Lesson 6.1 โ What's Different About CI/CD for DE
You've architected GitOps pipelines. You know Jenkins, GitHub Actions, Docker runner images, pipeline-as-code. This lesson is not teaching you CI/CD โ it's the delta for DE, in 20 minutes.
The four differences that matter
1 ยท The database is state โ you can't just redeploy
In app deployment, you push a new image, kill old pods, done. In DE, the database schema and data are state. You can't "redeploy" a warehouse from scratch โ you'd wipe petabytes.
Consequence: migrations. Every schema change is a versioned, idempotent SQL file applied in order. Tools: schemachange (Flyway-for-Snowflake, the most common), Liquibase, Sqitch.
snowflake-migrations/
โโ V1.0.0__initial_schema.sql # CREATE DATABASE, SCHEMA, ROLES
โโ V1.0.1__create_meter_reads.sql # CREATE TABLE
โโ V1.0.2__add_consumption_kwh_idx.sql # ALTER (though Snowflake doesn't have indexes)
โโ V1.1.0__add_quality_column.sql # ALTER TABLE ADD COLUMN
โโ R__masking_policies.sql # R_ = repeatable; re-runnable views/procs
The Jenkins pipeline runs schemachange -f ./migrations -a $SF_ACCOUNT -u $SF_USER on merge to main. It applies new files in order, tracks applied state in a Snowflake metadata table.
2 ยท "Environment" isn't a branch
In app world, feature/foo deploys to a feature-branch environment via preview deploys. In DE, the data is too large to clone per branch โ except Snowflake lets you clone for free.
The modern pattern:
- PROD: the real database. DDL-change-managed by schemachange.
- DEV: a zero-copy clone of PROD, refreshed nightly (or on-demand per branch). See Zero-Copy Clone.
- CI on a branch:
CREATE DATABASE dev_feature_foo CLONE prod;โ apply branch migrations โ run data tests โ tear down.
3 ยท Data tests are first-class CI checks
App CI runs unit tests. DE CI runs data tests: the pipeline that produces daily_consumption must produce non-null, non-negative values, with no duplicates on (meter_id, date). These assertions run on real (or cloned) data, not mocks.
Common tools:
- dbt tests โ YAML-declared assertions (
not_null,unique,relationships,accepted_values). - Great Expectations โ Python-based, more flexible, more weight.
- Snowflake native โ just a SQL assertion wrapped in a CI step.
4 ยท Secrets have a different shape
If you've used a secret-manager integration (External Secrets Operator, Vault, similar) โ same principle here. DE pipelines need:
- Warehouse user credentials (Snowflake)
- AWS role ARNs for stages & IAM
- API keys for source systems (CIS exports, weather feeds)
- Webhook URLs for alerts
Same secret-rotation rigor as app deploys. AWS Secrets Manager is typical. Never put keys in CREATE STAGE SQL that gets committed.
A complete DE CI/CD pipeline looks like
PR opened
โ
โโ 1. Lint SQL files (sqlfluff, snowsql parse)
โโ 2. Create dev clone: dev_pr_<N> CLONE prod
โโ 3. Run schemachange migrations against dev_pr_<N>
โโ 4. Run data tests (dbt test, great_expectations)
โโ 5. Report pass/fail on PR
โ
PR merged
โ
โโ 6. Run schemachange against PROD (from main branch)
โโ 7. Deploy Matillion job changes via Matillion API
โโ 8. Drop dev_pr_<N>
โโ 9. Notify #data-engineering channel
The sentence that lands the interview
"Most of DE CI/CD is the pipeline patterns I already know โ version control, Jenkinsfile, secret management, artifact registry. What's unique is (a) schema migrations via schemachange because the database is state, (b) using zero-copy clones to get preview environments, and (c) data-quality tests that run on real data rather than mocks."