Boston Housing Dataset for Machine Learning

Categories: Data Analytics

Tags:

Boston Housing Dataset for Machine Learning

The Boston Housing Dataset is a classic dataset used for regression analysis in machine learning and statistics. It contains information collected by the U.S. Census Service about housing in the area around Boston, Massachusetts. The dataset is often used to predict the median value of owner-occupied homes (MEDV) using the other features as predictors.

Download the dataset from Github : https://github.com/selva86/datasets/blob/master/BostonHousing.csv

Features (Attributes):

CRIM: Per capita crime rate by town.
- Indicates the crime level of the area, which may affect property values.
ZN: Proportion of residential land zoned for lots over 25,000 sq. ft.
- Higher values suggest a wealthier, more spacious residential area.
INDUS: Proportion of non-retail business acres per town.
- Represents the level of industrialization in the area.
CHAS: Charles River dummy variable.
- 1 if the tract bounds the Charles River; 0 otherwise.
- Proximity to the river could influence property desirability and value.
NOX: Nitric oxides concentration (parts per 10 million).
- Indicates air pollution; lower values are preferable for residential areas.
RM: Average number of rooms per dwelling.
- A measure of housing size; larger homes generally indicate higher property values.
AGE: Proportion of owner-occupied units built before 1940.
- Higher values suggest older housing stock.
DIS: Weighted distances to five Boston employment centers.
- Indicates accessibility to jobs; higher values suggest less accessibility.
RAD: Index of accessibility to radial highways.
- Higher values indicate better access to highways.
TAX: Full-value property-tax rate per $10,000.
- Represents the tax burden in the area.
PTRATIO: Pupil-teacher ratio by town.
- Lower values suggest better-quality education.
B: A metric related to the proportion of Black residents by town: B=1000(Bk−0.63)2B = 1000(Bk – 0.63)^2.
- Reflects demographic characteristics.
LSTAT: Percentage of the lower-status population.
- Higher values suggest a lower socio-economic status in the area.
MEDV: Median value of owner-occupied homes in $1000’s.
- The target variable (dependent variable) for regression tasks.
- Represents the median price of homes in a particular area.

Key Uses:

Predictive Modeling: The dataset is commonly used to predict MEDV (housing prices) using machine learning algorithms.
Feature Analysis: Helps in understanding the impact of socio-economic, environmental, and structural factors on housing prices.
Regression Problems: Serves as a benchmark for testing regression models like Linear Regression, Decision Trees, Random Forests, etc.

Limitations:

The dataset is relatively small (506 samples and 13 predictors).
It contains data from the 1970s, which might not reflect current housing market conditions.
Some features may raise ethical concerns (e.g., the B variable).