⭐ Featured Project🚀 Production Ready🏢 Enterprise DWH

Olist E-commerceData Pipeline

Transforming 100K+ Brazilian marketplace orders into actionable business intelligence through automated ELT pipelines, advanced analytics, and enterprise-grade data warehousing

Python 3.9+ - Core programming language for data processing and automation
Python 3.9+
Apache Airflow 2.7 - Workflow orchestration and pipeline scheduling
Airflow 2.7
dbt 1.7.4 - Data transformation and testing framework
dbt 1.7.4
Snowflake - Cloud data warehouse platform
Snowflake
Power BI - Business intelligence and data visualization
Power BI
Docker - Containerization platform for deployment
Docker
Olist Data Pipeline Architecture - Current workflow diagram (Note: olist-architecture-diagram.png not found)
🚀 View GitHub Repository

Problem & Motivation

Olist, Brazil's largest e-commerce marketplace, processes 100K+ orders monthly across 9 interconnected data tables. Raw transactional data alone cannot answer critical business questions: Which regions drive revenue? How does delivery time impact customer satisfaction? What factors influence repeat purchases?

This project builds a production-grade analytics pipeline that transforms Olist's fragmented order data into actionable insights. By implementing Medallion Architecture, automated quality testing, and dimensional modeling, this solution enables business stakeholders to make data-driven decisions about logistics, marketing, and customer retention strategies.

Pipeline Architecture

Complete data pipeline transforming raw Olist e-commerce data into actionable business intelligence through automated ELT processes.

01

Data Ingestion

Extract and load raw data from PostgreSQL database and CSV files into Snowflake.

Sources:

Extract and load from PostgreSQL (9 tables) and CSVs (product categories, geolocation) using Python + SQLAlchemy.

Performance:

<5 minute latency, handling 100K+ records daily for reliable data availability.

02

Bronze Layer

Store raw data in an immutable format in Snowflake with a full audit trail.

Immutability:

Store raw data with a one-to-one mapping to source systems, enabling full data lineage and version control.

Reliability:

100% data preservation guarantee, providing a complete and reliable audit trail.

03

Silver Layer

Clean and standardize data with dbt transformations and quality rules.

Transformation:

Clean and standardize using dbt, including deduplication, schema validation, and type checking.

Accuracy & Performance:

Achieve 99.5% data accuracy with optimized queries and proper indexing.

04

Gold Layer

Transform into a star schema dimensional model optimized for analytics.

Modeling:

Star Schema model centered on FACT_ORDERS, ready for business analysis.

Analytics-Ready:

Enables customer analytics, revenue analysis, and KPI reporting for Power BI.

05

Data Quality

Ensure data reliability with 26 comprehensive automated dbt tests.

Test Coverage:

26 automated dbt tests for uniqueness, not-null, accepted values, and relationships.

Reliability:

Enforce a 99.5% quality threshold with anomaly detection and integrated alerting.

06

Orchestration

Automate scheduling and monitoring with a robust Apache Airflow DAG.

Automation:

Schedule daily pipeline execution at 2 AM UTC using Apache Airflow 2.7 with retry logic.

Dependability:

Maintain a 99.9% pipeline success rate with a <15 minute end-to-end runtime.

Pipeline Impact Summary

9
Source Tables
26
Quality Tests
<15min
Runtime

This comprehensive pipeline processes 100K+ monthly orders from raw ingestion to business-ready analytics, enabling Olist to make data-driven decisions about logistics optimization, customer segmentation, and market expansion strategies.

Data Model Architecture

Star Schema optimized for analytical queries

Olist E-commerce Star Schema Diagram

Dimensional model with Fct_Order_Items at the center, surrounded by dimension tables for customers, products, geography, and time-based analytics

Challenges & Solutions

Key challenges overcome during the development process and their strategic solutions

Data Integration Complexity

Problem

Multiple source systems with different schemas

Solution

Medallion Architecture with incremental loading

Result

99.9% successful data integration rate

Performance Optimization

Problem

Large datasets causing slow query performance

Solution

Columnar storage with partitioning strategy

Result

15-minute pipeline execution time

Data Quality Assurance

Problem

Inconsistent data formats across sources

Solution

26 automated quality tests and validation rules

Result

99.5% data accuracy achievement

Business Insights

Key discoveries that transformed raw data into strategic business advantages

Revenue Distribution

70%
Southeast
15%
Northeast
10%
South
5%
North

70% of total revenue originates from Southeast Brazil

🎯 Identified key geographic markets for targeted marketing campaigns

Delivery Performance

1-3 days4-6 days7-9 days10-12 days13+ days60%80%95%85% Satisfactionat 7-day delivery

Orders with less than 7 day delivery have 85% higher customer satisfaction

⚡ Optimized logistics strategy focusing on delivery speed improvements

Customer Lifetime Value

80/20 Rule
Top 20% = 60% Revenue
Top 20%
60% Rev
Mid 30%
30% Rev
Low 50%
10% Rev

Top 20% of customers contribute to 60% of total revenue

👑 Implemented customer segmentation for personalized retention strategies

Strategic Impact

These insights demonstrate how data-driven decisions can transform operational efficiency and customer satisfaction. By understanding regional performance patterns, delivery optimization opportunities, and customer value distribution, Olist can now make strategic investments that maximize ROI and enhance competitive advantage.

Code Showcase

Interactive code browser - click on any file tab to explore the implementation

dbt Gold Layer Model

SQL
-- Gold Layer: Customer Analytics Fact Table
WITH customer_orders AS (
    SELECT
        c.customer_id,
        c.customer_unique_id,
        c.customer_zip_code_prefix,
        c.customer_city,
        c.customer_state,
        COUNT(o.order_id) as total_orders,
        SUM(o.order_purchase_timestamp) as first_purchase,
        SUM(o.payment_value) as total_spent,
        AVG(o.payment_value) as avg_order_value,
        MAX(o.order_purchase_timestamp) as last_purchase,
        DATEDIFF(MAX(o.order_purchase_timestamp),
                MIN(o.order_purchase_timestamp)) as customer_lifespan_days
    FROM silver_customers c
    LEFT JOIN silver_orders o ON c.customer_id = o.customer_id
    GROUP BY c.customer_id, c.customer_unique_id,
             c.customer_zip_code_prefix, c.customer_city, c.customer_state
)

SELECT
    customer_id,
    customer_unique_id,
    customer_zip_code_prefix,
    customer_city,
    customer_state,
    total_orders,
    first_purchase,
    total_spent,
    avg_order_value,
    last_purchase,
    customer_lifespan_days,
    CASE
        WHEN total_orders >= 3 AND total_spent >= 500 THEN 'High Value'
        WHEN total_orders >= 2 THEN 'Regular'
        ELSE 'New'
    END as customer_segment
FROM customer_orders
QUALIFY ROW_NUMBER() OVER (ORDER BY total_spent DESC) <= 100000
📊 SQL Model🔧 Data Transformation

Key Learnings & Future Vision

Key learnings from implementation and strategic vision for evolving data capabilities

💡 Key Learnings

Idempotent pipeline design

Prevents duplicate processing and ensures data consistency across failures

Early testing saves time

Implementing quality checks during development reduces debugging time by 60%

Documentation is crucial

Clear documentation enables team collaboration and reduces onboarding time

Business context matters

Understanding business requirements leads to more impactful technical solutions

🚀 Future Enhancements

Real-time Kafka streaming

Implement streaming data processing for real-time analytics capabilities

ML delivery predictions

Build machine learning models to predict delivery times and optimize logistics

Multi-cloud support

Extend infrastructure to support AWS and Google Cloud for better reliability

Advanced customer analytics

Develop deeper customer segmentation and lifetime value prediction models

"Great data engineering combines technical excellence with deep business understanding to deliver measurable impact while maintaining flexibility for future growth."
— Engineering Philosophy

"تَعَبُ كُلّها الحَياةُ."

© 2025 Yousef Mahmoud | Data Engineering Portfolio