Snowflake ETL

data architecture · v03

From CSV chaos
to queryable truth.

Production-grade ETL pipeline: CSV and PostgreSQL sources loaded into a Snowflake warehouse with star-schema modelling and SCD Type 2 history. Built during my Data Engineering internship at Nagarro.

Python 3.11 Snowflake Pandas SQLAlchemy SCD Type 2 GitHub Actions
TitleSnowflake ETL Pipeline
DrawnV. Sharma
Revision03
Date2025-01-14
ProjectNAGARRO-WH
CadenceNightly @ 02:00 ACST

§1Pipeline Architecture

Daily batch flow from raw sources through transform to warehouse

1
Source
CSV + Postgres
2
Extract
pandas + JDBC
3
Transform
SCD Type 2
5
Snowflake
X-Small WH
4
Load
COPY INTO + MERGE
Records/run 2.14M Wall time 14m 22s Throughput 2500 rec/s Warehouse X-Small · auto-suspend 60s Cost ≈ $0.18 per run

§2Star Schema

Fact at centre · four conformed dimensions · Kimball-style

dim_customer
customer_keyINT PK
cust_idVARCHAR
nameVARCHAR
cityVARCHAR
scd2_validDATE
dim_date
date_keyINT PK
dateDATE
yearINT
quarterINT
dowINT
fact_sales
sale_idBIGINT PK
customer_keyFK
product_keyFK
date_keyFK
store_keyFK
qtyINT
gross_amtDECIMAL
discountDECIMAL
dim_product
product_keyINT PK
skuVARCHAR
nameVARCHAR
categoryVARCHAR
list_priceDECIMAL
dim_store
store_keyINT PK
store_idVARCHAR
regionVARCHAR
openedDATE

§3SCD Type 2 · Live Demo

Update a customer attribute to see the row expire and history preserve

ckcust_idnamecityeff_dateexp_dateis_cur
ready — type a new city and click apply.

§4Run Metrics

Last 30-day window · production instance

2.14M
Records / run
▲ 4.2% vs last week
14m 22s
Wall time
within SLA
99.87%
Data quality
nulls + type checks
14 / 14
Tables synced
all green
  1. Staging tables use VARIANT to tolerate upstream schema drift without blocking the load.
  2. CDC handled via updated_at watermarks — log-based CDC deferred until WAL rates justify it.
  3. Idempotency guaranteed by MERGE INTO on natural keys. Pipeline is safely re-runnable.
  4. Star schema over snowflake schema — fewer joins, faster BI dashboards (Power BI, Metabase).
  5. Alerts via GitHub Actions + Slack webhook. Two retries with 5-minute backoff.