Skip to content

This project showcases an end-to-end ETL pipeline for NYC Taxi data using Azure services and Medallion Architecture. It automates data ingestion via APIs, processes and transforms data using PySpark in Databricks, and stores it in Delta tables for analytics. Finally, Power BI is used to create dashboards for trip insights.

Notifications You must be signed in to change notification settings

saurabhchavan7/Azure-NYC-Taxi-ETL-and-Analytics-Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Azure-NYC-Taxi-ETL-and-Analytics-Pipeline

Project Description

The Azure-NYC-Taxi-ETL-and-Analytics-Pipeline is an end-to-end data engineering project designed to process NYC Taxi trip data using Azure services. This project automates the ingestion, transformation, and analysis of taxi trip data while implementing modern big data practices such as Medallion Architecture, Delta Lake, and dynamic pipelines. It integrates with Power BI to provide interactive analytics dashboards.

Key Differentiator: This project eliminates manual workflows by dynamically fetching data from APIs, automating pipelines with Azure Data Factory, and ensuring scalability with Delta Lake.

Architecture

Architecture Diagram

High-Level Architecture

  1. Data Sources: NYC Taxi trip data (Parquet format) and lookup files (manually uploaded CSV files).
  2. Data Ingestion:
    • Built parameterized pipelines using Azure Data Factory to fetch data dynamically via API.
    • Stored raw data in the Bronze Layer of Azure Data Lake Gen2.
  3. Processing and Transformation:
    • Used Azure Databricks and PySpark to clean and transform data.
    • Processed data through the Silver layer (structured and cleaned) and Gold layer (modeled for analytics).
  4. Data Modeling and Storage:
    • Created Delta tables in the Gold layer for transactional support.
    • Stored data in Delta Lake for versioning and ACID compliance.
  5. Visualization:
    • Connected Delta tables with Power BI to build interactive dashboards.

Data Source

The project utilizes the NYC Taxi Trip Data, provided by the NYC Taxi & Limousine Commission. The data is stored in Parquet format and includes details about trip times, fares, pickup/drop-off locations, and more.


Project Objectives

  • Automate Data Ingestion: Dynamically fetch data from the NYC Taxi API into Azure Data Lake Gen2.
  • Implement Medallion Architecture: Process data through Bronze (raw), Silver (transformed), and Gold (modeled) layers.
  • Optimize Data Storage: Use Parquet and Delta Lake formats for scalability and transactional support.
  • Enable Analytics: Create Power BI dashboards to analyze taxi trip patterns.
  • Ensure Security: Utilize Azure IAM roles, Service Principal, and Key Vault for secure access.

Technologies Used

  • Cloud Services: Azure Data Factory, Azure Data Lake Gen2, Azure Databricks
  • Programming Languages: Python (PySpark), SQL
  • File Formats: Parquet, Delta Lake
  • Visualization: Power BI
  • Security: Azure IAM, Service Principal, Azure Key Vault

Key Features

  • Automated Pipelines:
    • Dynamically fetch and process data using Azure Data Factory.
  • Medallion Architecture:
    • Process data through Bronze (raw), Silver (transformed), and Gold (modeled) layers.
  • Delta Lake:
    • Store data with ACID compliance and enable efficient querying.
  • Power BI Integration:
    • Build interactive dashboards for actionable insights.

About

This project showcases an end-to-end ETL pipeline for NYC Taxi data using Azure services and Medallion Architecture. It automates data ingestion via APIs, processes and transforms data using PySpark in Databricks, and stores it in Delta tables for analytics. Finally, Power BI is used to create dashboards for trip insights.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published