Used Car Price Prediction: Machine Learning Regression

This project builds a machine learning regression model to predict the market price of used cars. Given how opaque used car pricing can be, the goal is to give buyers and sellers a data-driven estimate based on objective vehicle attributes. By training on a rich dataset of real listings, the model learns which features matter most and how they interact to drive price.

Dataset overview

The dataset aggregates used car listings and captures a wide range of vehicle attributes. Each row represents a single listing and includes the following features:

Feature	Description
Make / Model	Manufacturer and model name (e.g., Toyota Camry)
Year	Model year of manufacture
Mileage	Odometer reading in miles or kilometers
Condition	Seller-reported condition (Excellent, Good, Fair, Poor)
Fuel Type	Petrol, Diesel, Electric, or Hybrid
Transmission	Automatic or Manual
Engine Size	Displacement in litres
Color	Exterior color
Number of Owners	How many previous owners the car has had
Price	Target variable — listed sale price

The 2023 Car Features List asset (2023-Car-Features-List.jpg) provided additional reference for standardizing make/model names and aligning feature categories across listing sources, ensuring consistency during the cleaning phase.

Methodology

Data collection & cleaning

Raw listing data was scraped and compiled from multiple sources. Cleaning steps included:

Removing duplicate listings and entries with missing prices
Standardizing make/model strings using the 2023 Car Features List reference
Imputing missing mileage values using the median within each make/model/year group
Converting categorical fields (fuel type, transmission, condition) to consistent encodings
Clipping extreme outlier prices beyond three standard deviations from the mean

Exploratory data analysis

EDA examined the distribution of the target variable (price) and the relationships between individual features and price. Key observations were logged and visualized before any modeling began to avoid data leakage and guide feature selection.

Feature engineering

New features were derived to better capture the car’s effective age and value depreciation:

car_age: current year minus model year
mileage_per_year: mileage divided by car age
brand_tier: grouping brands into budget, mid-range, and premium tiers based on average listing prices
One-hot encoding for fuel type, transmission, and condition

Model training

Several regression algorithms were trained and compared:

Linear Regression (baseline)
Ridge and Lasso Regression (regularized baselines)
Random Forest Regressor
Gradient Boosting Regressor (XGBoost)

Hyperparameters were tuned using 5-fold cross-validation with GridSearchCV. The Gradient Boosting Regressor produced the best results and was selected as the final model.

Evaluation

Models were evaluated on a held-out test set (20% of data) using:

R² score — proportion of variance explained
Mean Absolute Error (MAE) — average absolute difference between predicted and actual prices
Root Mean Squared Error (RMSE) — penalizes large errors more heavily

The final Gradient Boosting model achieved an R² of approximately 0.89 on the test set, with an MAE roughly 7% of the average listing price.

Key findings

Top predictors: Mileage and car age are consistently the strongest predictors of used car price, accounting for the bulk of the model’s explained variance. Brand tier is the next most important feature — premium brand vehicles depreciate more slowly. Condition rating has an outsized effect at the extremes: cars rated “Poor” sell for significantly less than the mileage/age baseline would suggest.

Additional findings from the analysis:

Diesel vehicles command a modest price premium over equivalent petrol models in the mid-range mileage band (30,000–80,000 miles).
Electric vehicles show a different depreciation curve — early-year models depreciate sharply, while newer models hold value better due to improved battery technology and range.
Color has a small but statistically significant effect: white, black, and silver cars sell faster and at slightly higher prices than less common colors.
First-owner cars carry a meaningful premium over second- or third-owner equivalents, even controlling for mileage and age.

Visualizations

The following plots were produced during the analysis. You can reproduce them by running the notebook in the project repository.

Price distribution histogram — showed a right-skewed distribution; a log transformation improved model fit significantly.
Correlation heatmap — confirmed the strong negative correlation between price and both mileage and car age.
Feature importance bar chart — ranked all input features by their contribution to the Gradient Boosting model.
Prediction vs. actual scatter plot — points clustered tightly around the diagonal for mid-range prices, with wider spread at the high-price end (luxury vehicles with fewer training examples).
Residuals plot — checked for heteroscedasticity; residuals were approximately normally distributed with no strong pattern.

Technologies used

Python

Core programming language for all data processing, modeling, and visualization work.

Pandas

Data loading, cleaning, transformation, and feature engineering pipelines.

Scikit-learn

Model training, cross-validation, hyperparameter tuning, and evaluation metrics.

Matplotlib

Static plots including histograms, scatter plots, and residual diagnostics.

Seaborn

Correlation heatmaps, distribution plots, and styled statistical visualizations.

Jupyter Notebook

Interactive development environment combining code, output, and narrative.

You can browse the full project source code, notebooks, and data samples on GitHub: Sumit-SC/Sumit-SC.github.io

​Dataset overview

​Methodology

​Key findings

​Visualizations

​Technologies used

Python

Pandas

Scikit-learn

Matplotlib

Seaborn

Jupyter Notebook

Dataset overview

Methodology

Key findings

Visualizations

Technologies used