🎓 Graduation Project🚀 Real-Time Processing📊 Streaming Analytics

Real-Time NYC TransitAnalytics Platform

Building a scalable real-time streaming platform for monitoring NYC transit system with AWS Kinesis, Lambda, Timestream, and Grafana. Processing millions of transit updates per hour with sub-second latency.

Python 3.9+ - Core programming language
Python 3.9+
AWS Kinesis - Real-time streaming
AWS Kinesis
AWS Lambda - Event-driven processing
AWS Lambda
AWS Timestream - Time-series storage
AWS Timestream
Grafana - Real-time dashboards
Grafana
Snowflake - Cloud Data Warehouse
Snowflake
dbt - Data Transformation
dbt
Prefect - Workflow Orchestration
Prefect
NYC Transit - Real-Time Streaming Pipeline Architecture
🚀 View GitHub Repository

Problem & Motivation

NYC's transit system generates millions of real-time updates daily from buses, subways, and trains. Traditional batch processing cannot provide timely insights for route optimization, delay management, and passenger experience improvement. Transit authorities need real-time visibility into system performance to make immediate operational decisions.

This graduation project builds a production-grade streaming analytics platform that processes transit data in real-time using AWS Kinesis, Lambda, and Timestream. The system enables real-time monitoring, route performance analysis, and predictive insights to improve NYC's transit efficiency and passenger satisfaction.

System Architecture

Real-time streaming pipeline for processing NYC transit data with sub-second latency

01

Data Ingestion

Event-driven data flow from NYC MTA API through AWS Lambda Producer to Kinesis stream

AWS Lambda Producer:

Serverless function ingests MTA API data and publishes to Kinesis stream

AWS Kinesis Stream:

Scalable real-time data streaming with automatic sharding and fault tolerance

02

Stream Processing

AWS Lambda Consumer processes Kinesis events in real-time and computes metrics

Event-Driven Processing:

Lambda automatically triggers on Kinesis records with auto-scaling

Real-Time Analytics:

Processes delay calculations, route metrics, and performance indicators

03

Storage Layer

Dual storage: AWS Timestream for real-time monitoring, Snowflake for historical analysis

AWS Timestream:

Time-series database optimized for high-velocity real-time monitoring data

Snowflake Star Schema:

Gold layer with dimensional model for historical batch analysis and trends

04

Real-time Dashboard

Grafana dashboards with live metrics visualization and alerts

Live Data Visualization:

Real-time charts and graphs connected to Timestream and Snowflake

Performance Metrics:

On-time performance, delays, route analytics, and operational insights

System Impact Summary

10M+
Events/Hour
<500ms
Processing Latency
45
Routes Monitored

This real-time platform processes millions of transit updates with sub-second latency, enabling immediate insights for route optimization, delay management, and improved passenger experience.

Pipeline Workflow

Complete end-to-end workflow from data ingestion to real-time analytics

1

Data Ingestion

Batch Pipeline: Ingest 9 CSV files containing static GTFS data (routes, trips, stops, stop times) from MTA's public data repository.

Streaming Pipeline: Event-driven flow: NYC MTA API → AWS Lambda (Producer) → AWS Kinesis Data Streams → AWS Lambda (Consumer) → AWS Timestream for real-time monitoring.

2

Bronze Layer (Raw Storage)

Store raw data in DuckDB with one-to-one mapping to source systems, enabling full data lineage and version control. All incoming data is preserved in its original format.

3

Silver Layer (DBT Transformations)

Standardization: Convert table names to snake_case, set correct data types (bigint → timestamp), and add calculated columns (stop_duration).

Quality Tests: Implement integrity tests on primary keys and validate referential integrity with framework tests. Output: New staging schema.

4

Gold Layer (Star Schema)

Transformation: Remove unnecessary tables and design star schema with FACT_STOP_TIMES as the central fact table.

Dimensions: Create DIM_TRIPS, DIM_STOPS, and DIM_ROUTES dimension tables. Add calculated column: trip_duration.

Output: New NYC_transit production schema in Snowflake, optimized for analytical queries.

5

Orchestration & Monitoring

Prefect: Scheduled DAGs for automated pipeline execution with retry logic and error handling.

Grafana: Live dashboards with real-time and historical metrics for system health monitoring.

Data Model Architecture

Star Schema optimized for transit analytics

NYC Transit Star Schema - FACT_STOP_TIMES with dimension tables

Dimensional model with FACT_STOP_TIMES at the center, connected to dimension tables for trips, stops, and routes. The fact table contains metrics like stop_sequence, accumulative_trip_time, and service_id, enabling comprehensive transit performance analysis across the NYC network.

Challenges & Solutions

Key challenges overcome during the development process and their strategic solutions

Real-time Data Processing

Problem

Processing millions of transit updates per minute with sub-second latency requirements

Solution

Implemented AWS Kinesis for high-throughput ingestion and AWS Lambda for distributed serverless processing

Result

99.9% uptime with <500ms processing latency

Scalability & Performance

Problem

System needed to handle peak loads during rush hours without degradation

Solution

Horizontal scaling with Kinesis sharding and Lambda concurrent execution auto-scaling

Result

Handled 10M+ events/hour with linear scalability

Data Quality & Reliability

Problem

Ensuring data accuracy and handling missing/invalid transit updates

Solution

Implemented validation pipelines, schema evolution, and dead letter queues

Result

99.5% data quality score with automatic error recovery

Business Insights

Discoveries that transformed real-time transit data into actionable operational intelligence

Network Scale

29
Routes
1,488
Stops
84K
Weekly Trips

High utilization across the entire network

🎯 Comprehensive coverage analysis for capacity planning

Route Performance

L
6K+
F
100m+
D
100m+
Z
100m+

Route L handles 6,000+ trips (highest)

🎯 Priority optimization for high-volume routes

Real-Time Delays

0.139s
Average Delay
✓ Excellent Performance
⚠️ Attention Needed
Route L: 11,164s total delay
Hotspots: L29S, L28S, L03N

322 active alerts monitored

🎯 Data-driven maintenance scheduling

Code Showcase

Interactive code browser - click on any file tab to explore the implementation

AWS Lambda Consumer - Kinesis to Timestream

Python
import boto3
import json
import time

def lambda_handler(event, context):
    timestream = boto3.client('timestream-write')
    
    for record in event['Records']:
        # Decode Kinesis Data
        payload = json.loads(record['kinesis']['data'])
        
        dimensions = [
            {'Name': 'route_id', 'Value': payload['route_id']},
            {'Name': 'vehicle_id', 'Value': payload['vehicle_id']}
        ]
        
        record_data = {
            'Time': str(int(time.time() * 1000)),
            'Dimensions': dimensions,
            'MeasureName': 'delay_seconds',
            'MeasureValue': str(payload['delay']),
            'MeasureValueType': 'DOUBLE'
        }
        
        timestream.write_records(
            DatabaseName='NYCTransitDB',
            TableName='RealTimeMetrics',
            Records=[record_data]
        )
    
    return {
        'statusCode': 200,
        'body': json.dumps('Successfully processed records')
    }

Key Learnings & Future Vision

Key learnings from implementation and strategic vision for evolving capabilities

💡 Key Learnings

Streaming Architecture Design

Learned to design fault-tolerant streaming pipelines with proper checkpointing and state management

AWS Kinesis & Lambda Best Practices

Mastered event-driven architecture with Kinesis Data Streams, Lambda triggers, and auto-scaling for serverless real-time processing

Real-time Analytics with Lambda

Gained expertise in event-driven processing, stateless function design, and integrating Lambda with AWS Timestream

System Scalability

Understood horizontal scaling patterns and auto-scaling strategies for streaming workloads

🚀 Future Enhancements

Machine Learning Integration

Add predictive models for delay forecasting and route optimization using historical patterns

Multi-city Expansion

Extend the system to support multiple cities with configurable transit APIs

Mobile App Integration

Build mobile app with push notifications for real-time transit updates

Advanced Analytics

Implement anomaly detection for unusual transit patterns and automated alerting

"Real-time data engineering is about building systems that don't just process data, but transform it into immediate operational intelligence that drives decision-making at the speed of business."
— Engineering Philosophy

"تَعَبُ كُلّها الحَياةُ."

© 2025 Yousef Mahmoud | Data Engineering Portfolio