This project performs a comprehensive exploratory data analysis (EDA) of IMDB movie data to uncover the patterns that separate well-received, commercially successful films from the rest. Rather than building a predictive model, the emphasis here is on asking the right questions of the data and letting visualizations tell the story — which genres dominate ratings, whether release timing affects performance, and which directors consistently produce highly rated work.Documentation Index
Fetch the complete documentation index at: https://github-52.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Dataset overview
The dataset contains metadata for thousands of movies sourced from IMDB, covering several decades of releases. Each record includes:| Field | Description |
|---|---|
| Title | Movie title |
| Genre | One or more genre tags (e.g., Drama, Action, Comedy) |
| Director | Name of the primary director |
| Lead Actor/Actress | Primary cast member |
| Runtime | Film length in minutes |
| Release Year | Year the film was released |
| Release Month | Month of theatrical release |
| IMDB Rating | Average user rating on a 10-point scale |
| Number of Votes | Total votes contributing to the rating |
| Gross Revenue | Worldwide box office gross in USD |
| Language | Primary spoken language |
| Country | Country of production |
Methodology
Data loading & cleaning
The raw CSV was loaded with Pandas and inspected for missing values, duplicate entries, and formatting inconsistencies. Key cleaning steps:
- Dropped records with missing IMDB ratings or vote counts below a minimum threshold (to filter out obscure entries with unreliable ratings)
- Parsed multi-genre strings into individual genre tags using string splitting and
explode() - Converted gross revenue to a consistent numeric format, replacing missing values with
NaNrather than zero to avoid skewing revenue statistics - Standardized director name formatting to handle punctuation and encoding issues
Univariate analysis
Each variable was analyzed in isolation before examining relationships. This step included:
- Distribution plots for ratings, runtime, and gross revenue
- Frequency counts for genre, language, and country
- Year-on-year counts to understand how the dataset’s coverage changes over time
- Identifying the central tendency and spread of key numeric variables
Bivariate analysis
Pairs of variables were examined to identify meaningful correlations and patterns:
- Rating vs. gross revenue (scatter plot with a log scale on revenue)
- Runtime vs. rating (binned analysis to check whether longer films rate differently)
- Release month vs. average rating and revenue (checking seasonal release windows)
- Vote count vs. rating (popularity does not always equal quality — this tension was explored explicitly)
Genre & director analysis
Genre-level aggregations revealed which categories consistently earn high ratings and strong box office returns. Director-level analysis identified the filmmakers with the highest median ratings across a minimum number of films, filtering out directors with only one or two credits to keep comparisons fair.
Key findings
Genre impact on ratings: Documentary and Biography films earn consistently higher average IMDB ratings than Action or Horror films, but they attract far fewer votes and significantly lower gross revenue. Drama sits in the middle — reliably rated, reliably watched. The highest-grossing genre, Action/Adventure, averages a full point lower in rating than Documentary.
Seasonal release patterns: Films released in the November–December window average notably higher ratings and gross revenues, aligning with the awards season strategy studios use for prestige releases. The summer window (June–August) dominates in sheer volume and total box office gross but shows lower average ratings, reflecting the blockbuster vs. prestige split.
- Vote count as a proxy for reach: Films with more than 100,000 votes have a much tighter rating distribution (7.0–8.5), while films under 10,000 votes span nearly the full 1–10 range. High vote counts signal mainstream appeal and tend to compress ratings toward the mean.
- Director consistency: A small group of directors (Christopher Nolan, Denis Villeneuve, and a handful of others) maintain median ratings above 8.0 across five or more films — a rare achievement that distinguishes them from directors with a single breakout film.
- Revenue vs. rating disconnect: The correlation between IMDB rating and gross revenue is positive but moderate. Many critically beloved films underperform commercially, while several blockbusters with average ratings gross over $1 billion — marketing budget and franchise recognition matter far more for revenue than critical reception.
Visualizations
Running the project notebook produces the following charts:- Rating distribution histogram — bell-shaped, centered around 6.5–7.0, with a slight left skew at the low end
- Genre popularity bar chart — ranked by number of films and separately by median rating
- Top 20 directors by median rating (minimum 5 films) — horizontal bar chart with vote-count annotations
- Revenue vs. rating scatter plot — log-scaled revenue axis, colored by genre, highlights the blockbuster/prestige divide
- Monthly release heatmap — average rating and revenue by release month across years
- Runtime distribution by genre — box plots showing that Drama and Biography skew longer while Horror and Comedy skew shorter
Technologies used
Python
Core language for all data manipulation and visualization tasks.
Pandas
Data loading, cleaning, aggregation, and multi-genre expansion with
explode().Matplotlib
Histograms, scatter plots, bar charts, and annotated figure exports.
Seaborn
Box plots, heatmaps, and statistical visualizations with consistent styling.
Jupyter Notebook
Self-contained analytical narrative combining code, outputs, and commentary.