Tags:

The NYC Taxi Trip Dataset is a public dataset that contains detailed records of taxi rides in New York City. It is collected from Yellow, Green, and For-Hire Vehicles (FHV) and is provided by the NYC Taxi & Limousine Commission (TLC).

πŸ“Œ Key Features of the Dataset

  • Trip Details: Includes pickup & dropoff timestamps and locations.
  • Fare & Payment: Tracks fare amount, tips, total cost, and payment type.
  • Passenger Info: Number of passengers per trip.
  • Trip Distance: Distance traveled in miles.
  • Vendor & Ride Type: Identifies the taxi company (Yellow, Green, or FHV).

πŸ“Š Example Columns

Column NameDescription
tpep_pickup_datetimeTrip start time
tpep_dropoff_datetimeTrip end time
passenger_countNumber of passengers
trip_distanceDistance traveled (miles)
fare_amountBase fare for the trip
tip_amountTip given by the passenger
total_amountTotal fare including tolls & extras
payment_typePayment method (Cash, Card, etc.)
pickup_longitude/latitudeGPS location of pickup
dropoff_longitude/latitudeGPS location of drop-off

πŸ“Œ Common Use Cases

βœ” Exploratory Data Analysis (EDA) (Trip duration, fares, busiest hours)
βœ” Time Series Analysis (Taxi demand trends over time)
βœ” Geospatial Analysis (Popular pickup/dropoff locations)
βœ” Fare Prediction (Using Machine Learning)πŸš–πŸ“Š

Begin EDA

πŸ”Ή Install Required Libraries

Run this first if you haven’t installed the required libraries:

pip install pandas numpy matplotlib seaborn folium geopandas requests

πŸ”Ή Python Code for NYC Taxi Data EDA

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import folium
import requests
from io import BytesIO

# Step 1: Download a sample NYC Taxi dataset from January 2022 (about 50MB)
url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet"

# Read the dataset directly from the URL (Parquet format)
df = pd.read_parquet(url)

# Step 2: Display basic info
print("Dataset Info:")
print(df.info())

# Step 3: Show the first few rows
print("\nSample Data:")
print(df.head())

# Step 4: Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())

# Step 5: Descriptive statistics
print("\nStatistical Summary:")
print(df.describe())

# Step 6: Trip Duration Analysis
df["trip_duration"] = (df["tpep_dropoff_datetime"] - df["tpep_pickup_datetime"]).dt.total_seconds() / 60 # Convert to minutes
plt.figure(figsize=(8, 5))
sns.histplot(df["trip_duration"], bins=50, kde=True)
plt.xlim(0, 100)
plt.title("Trip Duration Distribution (Minutes)")
plt.xlabel("Trip Duration (minutes)")
plt.ylabel("Frequency")
plt.show()

# Step 7: Fare Amount Distribution
plt.figure(figsize=(8, 5))
sns.histplot(df["fare_amount"], bins=50, kde=True)
plt.xlim(0, 100)
plt.title("Fare Amount Distribution")
plt.xlabel("Fare ($)")
plt.ylabel("Frequency")
plt.show()

# Step 8: Number of Passengers per Ride
plt.figure(figsize=(8, 5))
sns.countplot(x=df["passenger_count"])
plt.title("Passenger Count Distribution")
plt.xlabel("Number of Passengers")
plt.ylabel("Ride Count")
plt.show()

# Step 9: Map of Pickup Locations (using Folium)
pickup_map = folium.Map(location=[40.75, -74.00], zoom_start=11) # NYC Coordinates

# Plot a sample of 5000 pickup locations
sample_df = df.sample(5000)
for index, row in sample_df.iterrows():
folium.CircleMarker(
location=
, row["pickup_longitude"]],
radius=1,
color="blue",
fill=True,
fill_opacity=0.3,
).add_to(pickup_map)

# Save and display the map
pickup_map.save("nyc_taxi_pickups.html")
print("NYC Pickup Location Map Saved as 'nyc_taxi_pickups.html'")

πŸ”Ή Explanation of the Code

  • βœ… Loads NYC Taxi Data (Yellow Taxi Trips from January 2022)
  • βœ… Checks Data Overview (missing values, statistics)
  • βœ… Trip Duration Analysis (Histogram for trip time)
  • βœ… Fare Amount Analysis (Distribution of taxi fares)
  • βœ… Passenger Count Analysis (How many people per ride)
  • βœ… Geospatial Visualization (Map of pickup locations using Folium)

πŸ“Œ What’s Next?

  • If you’re working with large datasets, you can filter data by a specific date range:pythonCopyEditdf = df[(df['tpep_pickup_datetime'] >= '2022-01-01') & (df['tpep_pickup_datetime'] < '2022-02-01')]
  • Try correlation analysis between trip duration, fare, and distance
  • Use GeoPandas for advanced geospatial analysis
  • Perform time-series analysis on taxi rides per day