IMDB Movie Analysis: Ratings, Genres & Revenue Trends

This project performs a comprehensive exploratory data analysis (EDA) of IMDB movie data to uncover the patterns that separate well-received, commercially successful films from the rest. Rather than building a predictive model, the emphasis here is on asking the right questions of the data and letting visualizations tell the story — which genres dominate ratings, whether release timing affects performance, and which directors consistently produce highly rated work.

Dataset overview

The dataset contains metadata for thousands of movies sourced from IMDB, covering several decades of releases. Each record includes:

Field	Description
Title	Movie title
Genre	One or more genre tags (e.g., Drama, Action, Comedy)
Director	Name of the primary director
Lead Actor/Actress	Primary cast member
Runtime	Film length in minutes
Release Year	Year the film was released
Release Month	Month of theatrical release
IMDB Rating	Average user rating on a 10-point scale
Number of Votes	Total votes contributing to the rating
Gross Revenue	Worldwide box office gross in USD
Language	Primary spoken language
Country	Country of production

The dataset required careful handling: gross revenue figures were missing for a significant portion of films (particularly older or non-English titles), and multi-genre entries needed to be expanded for genre-level analysis.

Methodology

Data loading & cleaning

The raw CSV was loaded with Pandas and inspected for missing values, duplicate entries, and formatting inconsistencies. Key cleaning steps:

Dropped records with missing IMDB ratings or vote counts below a minimum threshold (to filter out obscure entries with unreliable ratings)
Parsed multi-genre strings into individual genre tags using string splitting and explode()
Converted gross revenue to a consistent numeric format, replacing missing values with NaN rather than zero to avoid skewing revenue statistics
Standardized director name formatting to handle punctuation and encoding issues

Univariate analysis

Each variable was analyzed in isolation before examining relationships. This step included:

Distribution plots for ratings, runtime, and gross revenue
Frequency counts for genre, language, and country
Year-on-year counts to understand how the dataset’s coverage changes over time
Identifying the central tendency and spread of key numeric variables

Bivariate analysis

Pairs of variables were examined to identify meaningful correlations and patterns:

Rating vs. gross revenue (scatter plot with a log scale on revenue)
Runtime vs. rating (binned analysis to check whether longer films rate differently)
Release month vs. average rating and revenue (checking seasonal release windows)
Vote count vs. rating (popularity does not always equal quality — this tension was explored explicitly)

Genre & director analysis

Genre-level aggregations revealed which categories consistently earn high ratings and strong box office returns. Director-level analysis identified the filmmakers with the highest median ratings across a minimum number of films, filtering out directors with only one or two credits to keep comparisons fair.

Visualization

All findings were consolidated into a set of clear, labeled visualizations. Each chart was designed to answer a specific question and annotated with the key takeaway directly on the figure, making the notebook readable as a standalone analytical report.

Key findings

Genre impact on ratings: Documentary and Biography films earn consistently higher average IMDB ratings than Action or Horror films, but they attract far fewer votes and significantly lower gross revenue. Drama sits in the middle — reliably rated, reliably watched. The highest-grossing genre, Action/Adventure, averages a full point lower in rating than Documentary.

Seasonal release patterns: Films released in the November–December window average notably higher ratings and gross revenues, aligning with the awards season strategy studios use for prestige releases. The summer window (June–August) dominates in sheer volume and total box office gross but shows lower average ratings, reflecting the blockbuster vs. prestige split.

Further findings from the analysis:

Vote count as a proxy for reach: Films with more than 100,000 votes have a much tighter rating distribution (7.0–8.5), while films under 10,000 votes span nearly the full 1–10 range. High vote counts signal mainstream appeal and tend to compress ratings toward the mean.
Director consistency: A small group of directors (Christopher Nolan, Denis Villeneuve, and a handful of others) maintain median ratings above 8.0 across five or more films — a rare achievement that distinguishes them from directors with a single breakout film.
Revenue vs. rating disconnect: The correlation between IMDB rating and gross revenue is positive but moderate. Many critically beloved films underperform commercially, while several blockbusters with average ratings gross over $1 billion — marketing budget and franchise recognition matter far more for revenue than critical reception.

Visualizations

Running the project notebook produces the following charts:

Rating distribution histogram — bell-shaped, centered around 6.5–7.0, with a slight left skew at the low end
Genre popularity bar chart — ranked by number of films and separately by median rating
Top 20 directors by median rating (minimum 5 films) — horizontal bar chart with vote-count annotations
Revenue vs. rating scatter plot — log-scaled revenue axis, colored by genre, highlights the blockbuster/prestige divide
Monthly release heatmap — average rating and revenue by release month across years
Runtime distribution by genre — box plots showing that Drama and Biography skew longer while Horror and Comedy skew shorter

Technologies used

Python

Core language for all data manipulation and visualization tasks.

Pandas

Data loading, cleaning, aggregation, and multi-genre expansion with explode().

Matplotlib

Histograms, scatter plots, bar charts, and annotated figure exports.

Seaborn

Box plots, heatmaps, and statistical visualizations with consistent styling.

Jupyter Notebook

Self-contained analytical narrative combining code, outputs, and commentary.

​Dataset overview

​Methodology

​Key findings

​Visualizations

​Technologies used

Python

Pandas

Matplotlib

Seaborn

Jupyter Notebook

Dataset overview

Methodology

Key findings

Visualizations

Technologies used