The NYC Taxi Trip Dataset is a public dataset that contains detailed records of taxi rides in New York City. It is collected from Yellow, Green, and For-Hire Vehicles (FHV) and is provided by the NYC Taxi & Limousine Commission (TLC).
π Key Features of the Dataset
- Trip Details: Includes pickup & dropoff timestamps and locations.
- Fare & Payment: Tracks fare amount, tips, total cost, and payment type.
- Passenger Info: Number of passengers per trip.
- Trip Distance: Distance traveled in miles.
- Vendor & Ride Type: Identifies the taxi company (Yellow, Green, or FHV).
π Example Columns
Column Name | Description |
---|---|
tpep_pickup_datetime | Trip start time |
tpep_dropoff_datetime | Trip end time |
passenger_count | Number of passengers |
trip_distance | Distance traveled (miles) |
fare_amount | Base fare for the trip |
tip_amount | Tip given by the passenger |
total_amount | Total fare including tolls & extras |
payment_type | Payment method (Cash, Card, etc.) |
pickup_longitude/latitude | GPS location of pickup |
dropoff_longitude/latitude | GPS location of drop-off |
π Common Use Cases
β Exploratory Data Analysis (EDA) (Trip duration, fares, busiest hours)
β Time Series Analysis (Taxi demand trends over time)
β Geospatial Analysis (Popular pickup/dropoff locations)
β Fare Prediction (Using Machine Learning)ππ
Begin EDA
πΉ Install Required Libraries
Run this first if you havenβt installed the required libraries:
pip install pandas numpy matplotlib seaborn folium geopandas requests
πΉ Python Code for NYC Taxi Data EDA
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import folium
import requests
from io import BytesIO
# Step 1: Download a sample NYC Taxi dataset from January 2022 (about 50MB)
url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet"
# Read the dataset directly from the URL (Parquet format)
df = pd.read_parquet(url)
# Step 2: Display basic info
print("Dataset Info:")
print(df.info())
# Step 3: Show the first few rows
print("\nSample Data:")
print(df.head())
# Step 4: Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())
# Step 5: Descriptive statistics
print("\nStatistical Summary:")
print(df.describe())
# Step 6: Trip Duration Analysis
df["trip_duration"] = (df["tpep_dropoff_datetime"] - df["tpep_pickup_datetime"]).dt.total_seconds() / 60 # Convert to minutes
plt.figure(figsize=(8, 5))
sns.histplot(df["trip_duration"], bins=50, kde=True)
plt.xlim(0, 100)
plt.title("Trip Duration Distribution (Minutes)")
plt.xlabel("Trip Duration (minutes)")
plt.ylabel("Frequency")
plt.show()
# Step 7: Fare Amount Distribution
plt.figure(figsize=(8, 5))
sns.histplot(df["fare_amount"], bins=50, kde=True)
plt.xlim(0, 100)
plt.title("Fare Amount Distribution")
plt.xlabel("Fare ($)")
plt.ylabel("Frequency")
plt.show()
# Step 8: Number of Passengers per Ride
plt.figure(figsize=(8, 5))
sns.countplot(x=df["passenger_count"])
plt.title("Passenger Count Distribution")
plt.xlabel("Number of Passengers")
plt.ylabel("Ride Count")
plt.show()
# Step 9: Map of Pickup Locations (using Folium)
pickup_map = folium.Map(location=[40.75, -74.00], zoom_start=11) # NYC Coordinates
# Plot a sample of 5000 pickup locations
sample_df = df.sample(5000)
for index, row in sample_df.iterrows():
folium.CircleMarker(
location=, row["pickup_longitude"]],
radius=1,
color="blue",
fill=True,
fill_opacity=0.3,
).add_to(pickup_map)
# Save and display the map
pickup_map.save("nyc_taxi_pickups.html")
print("NYC Pickup Location Map Saved as 'nyc_taxi_pickups.html'")
πΉ Explanation of the Code
- β Loads NYC Taxi Data (Yellow Taxi Trips from January 2022)
- β Checks Data Overview (missing values, statistics)
- β Trip Duration Analysis (Histogram for trip time)
- β Fare Amount Analysis (Distribution of taxi fares)
- β Passenger Count Analysis (How many people per ride)
- β Geospatial Visualization (Map of pickup locations using Folium)
π What’s Next?
- If you’re working with large datasets, you can filter data by a specific date range:pythonCopyEdit
df = df[(df['tpep_pickup_datetime'] >= '2022-01-01') & (df['tpep_pickup_datetime'] < '2022-02-01')]
- Try correlation analysis between trip duration, fare, and distance
- Use GeoPandas for advanced geospatial analysis
- Perform time-series analysis on taxi rides per day