Building a scalable real-time streaming platform for monitoring NYC transit system with AWS Kinesis, Lambda, Timestream, and Grafana. Processing millions of transit updates per hour with sub-second latency.

NYC's transit system generates millions of real-time updates daily from buses, subways, and trains. Traditional batch processing cannot provide timely insights for route optimization, delay management, and passenger experience improvement. Transit authorities need real-time visibility into system performance to make immediate operational decisions.
This graduation project builds a production-grade streaming analytics platform that processes transit data in real-time using AWS Kinesis, Lambda, and Timestream. The system enables real-time monitoring, route performance analysis, and predictive insights to improve NYC's transit efficiency and passenger satisfaction.
Real-time streaming pipeline for processing NYC transit data with sub-second latency
Event-driven data flow from NYC MTA API through AWS Lambda Producer to Kinesis stream
Serverless function ingests MTA API data and publishes to Kinesis stream
Scalable real-time data streaming with automatic sharding and fault tolerance
AWS Lambda Consumer processes Kinesis events in real-time and computes metrics
Lambda automatically triggers on Kinesis records with auto-scaling
Processes delay calculations, route metrics, and performance indicators
Dual storage: AWS Timestream for real-time monitoring, Snowflake for historical analysis
Time-series database optimized for high-velocity real-time monitoring data
Gold layer with dimensional model for historical batch analysis and trends
Grafana dashboards with live metrics visualization and alerts
Real-time charts and graphs connected to Timestream and Snowflake
On-time performance, delays, route analytics, and operational insights
This real-time platform processes millions of transit updates with sub-second latency, enabling immediate insights for route optimization, delay management, and improved passenger experience.
Complete end-to-end workflow from data ingestion to real-time analytics
Batch Pipeline: Ingest 9 CSV files containing static GTFS data (routes, trips, stops, stop times) from MTA's public data repository.
Streaming Pipeline: Event-driven flow: NYC MTA API → AWS Lambda (Producer) → AWS Kinesis Data Streams → AWS Lambda (Consumer) → AWS Timestream for real-time monitoring.
Store raw data in DuckDB with one-to-one mapping to source systems, enabling full data lineage and version control. All incoming data is preserved in its original format.
Standardization: Convert table names to snake_case, set correct data types (bigint → timestamp), and add calculated columns (stop_duration).
Quality Tests: Implement integrity tests on primary keys and validate referential integrity with framework tests. Output: New staging schema.
Transformation: Remove unnecessary tables and design star schema with FACT_STOP_TIMES as the central fact table.
Dimensions: Create DIM_TRIPS, DIM_STOPS, and DIM_ROUTES dimension tables. Add calculated column: trip_duration.
Output: New NYC_transit production schema in Snowflake, optimized for analytical queries.
Prefect: Scheduled DAGs for automated pipeline execution with retry logic and error handling.
Grafana: Live dashboards with real-time and historical metrics for system health monitoring.
Star Schema optimized for transit analytics

Dimensional model with FACT_STOP_TIMES at the center, connected to dimension tables for trips, stops, and routes. The fact table contains metrics like stop_sequence, accumulative_trip_time, and service_id, enabling comprehensive transit performance analysis across the NYC network.
Key challenges overcome during the development process and their strategic solutions
Problem
Processing millions of transit updates per minute with sub-second latency requirements
Solution
Implemented AWS Kinesis for high-throughput ingestion and AWS Lambda for distributed serverless processing
Result
99.9% uptime with <500ms processing latency
Problem
System needed to handle peak loads during rush hours without degradation
Solution
Horizontal scaling with Kinesis sharding and Lambda concurrent execution auto-scaling
Result
Handled 10M+ events/hour with linear scalability
Problem
Ensuring data accuracy and handling missing/invalid transit updates
Solution
Implemented validation pipelines, schema evolution, and dead letter queues
Result
99.5% data quality score with automatic error recovery
Discoveries that transformed real-time transit data into actionable operational intelligence
High utilization across the entire network
🎯 Comprehensive coverage analysis for capacity planning
Route L handles 6,000+ trips (highest)
🎯 Priority optimization for high-volume routes
322 active alerts monitored
🎯 Data-driven maintenance scheduling
Interactive code browser - click on any file tab to explore the implementation
import boto3
import json
import time
def lambda_handler(event, context):
timestream = boto3.client('timestream-write')
for record in event['Records']:
# Decode Kinesis Data
payload = json.loads(record['kinesis']['data'])
dimensions = [
{'Name': 'route_id', 'Value': payload['route_id']},
{'Name': 'vehicle_id', 'Value': payload['vehicle_id']}
]
record_data = {
'Time': str(int(time.time() * 1000)),
'Dimensions': dimensions,
'MeasureName': 'delay_seconds',
'MeasureValue': str(payload['delay']),
'MeasureValueType': 'DOUBLE'
}
timestream.write_records(
DatabaseName='NYCTransitDB',
TableName='RealTimeMetrics',
Records=[record_data]
)
return {
'statusCode': 200,
'body': json.dumps('Successfully processed records')
}Key learnings from implementation and strategic vision for evolving capabilities
Learned to design fault-tolerant streaming pipelines with proper checkpointing and state management
Mastered event-driven architecture with Kinesis Data Streams, Lambda triggers, and auto-scaling for serverless real-time processing
Gained expertise in event-driven processing, stateless function design, and integrating Lambda with AWS Timestream
Understood horizontal scaling patterns and auto-scaling strategies for streaming workloads
Add predictive models for delay forecasting and route optimization using historical patterns
Extend the system to support multiple cities with configurable transit APIs
Build mobile app with push notifications for real-time transit updates
Implement anomaly detection for unusual transit patterns and automated alerting
"Real-time data engineering is about building systems that don't just process data, but transform it into immediate operational intelligence that drives decision-making at the speed of business."
"تَعَبُ كُلّها الحَياةُ."
© 2025 Yousef Mahmoud | Data Engineering Portfolio