🎬 Serverless Pipeline☁️ AWS Cloud Native📊 100+ Years Data

Hollywood in the CloudBox Office Analytics

A fully automated, cloud-native data engineering project that ingests, processes, and analyzes over 100 years of US Box Office data (1902-2024) using AWS serverless services and Power BI.

AWS Lambda

AWS S3

AWS Glue

AWS Athena

Power BI

Hollywood in the Cloud - Serverless Pipeline Architecture Diagram

🚀 View GitHub Repository

Problem & Motivation

The film industry generates massive amounts of box office data spanning over 100 years (1902-2024), but this data exists in fragmented sources. Production studios and film distributors need data-driven insights to answer critical questions: What are long-term trends in movie production? Which genres are most common vs. most popular? How do English and Non-English films compare?

This project builds a production-grade serverless data pipeline on AWS infrastructure that automatically ingests, processes, and analyzes box office data. By implementing a cloud-native architecture with Lambda, S3, Glue, and Athena, this solution enables stakeholders to make strategic decisions about green-lighting projects and optimal release windows.

System Architecture

Serverless, event-driven architecture for automated box office analytics

Data Ingestion

AWS Lambda function fetches box office data from TMDB API and stores raw CSV files in S3

Automated Fetching:

Lambda function retrieves top-rated, upcoming movies, and genre data from TMDB API

S3 Data Lake:

Raw CSV files stored in S3 with separated locations for crawler compatibility

Schema Discovery

AWS Glue crawlers automatically parse metadata from each CSV file

Automatic Discovery:

Crawlers detect schema, data types, and create table definitions in Glue Data Catalog

Multiple Sources:

Individual crawler per CSV file ensures accurate schema parsing for each dataset

Data Transformation

AWS Glue jobs clean and optimize data, then Athena creates unified queryable table

Data Cleaning:

Glue jobs standardize formats, validate data types, and handle missing values

Unified Table:

Athena SQL queries combine all sources into single analytical table

Visualization

Power BI connects via ODBC to Athena for interactive dashboard analysis

ODBC Connection:

Secure cloud-to-cloud connection between Power BI and AWS Athena

Interactive Dashboards:

Real-time analysis of 100+ years of box office trends, genre performance, and language distribution

Pipeline Impact Summary

100+

Years of Data

Data Sources

Serverless

Architecture

This serverless pipeline processes over 100 years of box office data (1902-2024), enabling production studios and distributors to make data-driven decisions about genre trends, release windows, and market opportunities.

Pipeline Workflow

Complete end-to-end workflow from API ingestion to interactive dashboards

Data Ingestion

AWS Lambda Function: Automated data fetching from TMDB API (The Movie Database). The function retrieves three datasets: top-rated movies, upcoming movies, and movie genres.

S3 Storage: Raw CSV files stored in S3 Data Lake with separated locations: tmdb/raw/top_rated/, tmdb/raw/upcoming/, tmdb/raw/genres/

Schema Discovery

AWS Glue Crawlers: Automatically parse metadata from each CSV file. Individual crawler per CSV ensures accurate schema detection and data type inference.

Glue Data Catalog: Creates table definitions with proper schemas, enabling SQL-like queries on S3 data without loading into traditional databases.

Data Transformation

AWS Glue Jobs: Clean and optimize data - standardize formats, validate data types, handle missing values, and prepare data for analysis.

AWS Athena: Create unified queryable table combining all three data sources. SQL queries enable complex analytics on box office trends, genre performance, and language distribution.

Visualization

Power BI Connection: Connect to AWS Athena via ODBC driver with AWS credentials and IAM role-based access control.

Interactive Dashboards: Real-time analysis of historical trends (1902-2024), genre performance comparison, language distribution, and seasonal release patterns.

Data Model Architecture

Unified table structure optimized for analytical queries

Unified Movies Table Structure

Top Rated Movies

id, title, release_date, vote_average, popularity

Upcoming Movies

id, title, release_date, vote_average, popularity

Movie Genres

Discoveries that transformed 100+ years of box office data into strategic business intelligence

Genre Performance

Drama

Volume

Comedy

Volume

Adventure

Popular

Action

Popular

Adventure & Action achieve highest popularity ratings

🎯 High-value genres for production studios

Release Seasonality

Peak

Q4 (Oct/Nov) peak for holiday audiences

🎯 Strategic release window planning

Foreign Language Films

739

French

~370

Italian

~350

Japanese

French dominates foreign language market

🎯 International distribution insights

Code Showcase

Interactive code browser - click on any file tab to explore the implementation

AWS Lambda - Data Ingestion

Python

import os
import json
import csv
import logging
import urllib.request
import urllib.parse
import boto3
from io import StringIO

# Configure logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

# S3 configuration
S3_BUCKET = "tmdb-data-071125"

# Dedicated prefixes for each dataset
TOP_RATED_KEY = "tmdb/raw/top_rated/top_rated_movies.csv"
UPCOMING_KEY = "tmdb/raw/upcoming/upcoming_movies.csv"
GENRES_KEY = "tmdb/raw/genres/movie_genres.csv"

# TMDB configuration
TMDB_BASE_URL = "https://api.themoviedb.org/3"
TMDB_API_TOKEN = os.getenv("TMDB_API_TOKEN")

# AWS client
s3_client = boto3.client("s3")

def _make_tmdb_request(endpoint: str, params: dict | None = None) -> dict:
    """Perform a GET request to TMDB using urllib with Bearer token auth."""
    if not TMDB_API_TOKEN:
        logger.error("TMDB_API_TOKEN environment variable is not set")
        raise RuntimeError("TMDB_API_TOKEN environment variable is required")
    
    if params is None:
        params = {}
    
    query = urllib.parse.urlencode(params)
    url = f"{TMDB_BASE_URL}{endpoint}"
    if query:
        url = f"{url}?{query}"
    
    logger.info(f"Requesting TMDB URL: {url}")
    
    headers = {
        "Authorization": f"Bearer {TMDB_API_TOKEN}",
        "Accept": "application/json"
    }
    
    request = urllib.request.Request(url, headers=headers)
    
    with urllib.request.urlopen(request) as response:
        return json.loads(response.read().decode())

def lambda_handler(event, context):
    """Main Lambda handler to fetch and store TMDB data"""
    try:
        # Fetch top rated movies
        top_rated = _make_tmdb_request("/movie/top_rated", {"page": 1})
        # Process and upload to S3...
        
        return {
            'statusCode': 200,
            'body': json.dumps('Data ingestion completed successfully')
        }
    except Exception as e:
        logger.error(f"Error: {str(e)}")
        raise

Key Learnings & Future Vision

Key learnings from implementation and strategic vision for evolving capabilities

💡 Key Learnings

Serverless Architecture Design

Learned to design fully serverless, event-driven pipelines that scale automatically and minimize infrastructure costs

AWS Service Integration

Mastered integration patterns between Lambda, S3, Glue, Athena, and Power BI for seamless data flow

Data Lake Best Practices

Understood proper S3 organization, crawler configuration, and schema evolution patterns for data lakes

Cloud Security & IAM

Gained expertise in IAM role-based access control and least privilege permissions for cloud services

🚀 Future Enhancements

Real-time Streaming

Add Kinesis Data Streams for real-time movie release and box office updates

Machine Learning Integration

Build predictive models for box office performance using historical trends and genre patterns

Multi-source Integration

Extend pipeline to include additional data sources like Rotten Tomatoes, IMDb ratings

Advanced Analytics

Implement time-series analysis for long-term industry trends and forecasting

"Serverless data engineering is about building systems that scale automatically, cost-effectively, and reliably—transforming raw data into insights without managing infrastructure."

Explore Repository View More Projects

"تَعَبُ كُلّها الحَياةُ."