GitHub - tomkeith/imdb-scraper: Script to scrape IMDb.com for key movie features - including all genres and short plot summary (and optional movie poster)

IMDb.com Web Scraper

Tom Keith - https://github.com/tomkeith

IMDb URL Structure

IMDb has a great url structure for scraping. Using Star Wars for example: www.imdb.com/title/tt0076759/. The IMDb ID - here tt0076759 - is all that is needed to fetch the page.

IMDb IDs can be sourced from the IMDb open datasets where these unique IDs are represented in the tconst column.

Why scrape IMDb if there is an open dataset?

IMDb's open dataset was lacking some key features needed for my Movie Genre Prediction project. Most notably it only had 3 genres (the first three alphabetically), where as IMDb.com can have 1-7 genres. Additionally, IMDb's open data does not have any text data (for example plot summary), something I also needed for NLP.

Those reasons are the inspiration behind creating this scraper.

Movie Posters

The function imdb_scrape has an optional second parameter (boolean) to save the movie poster (default location is /posters/ folder).

Notes

The main function imdb_scrape is meant to be ran in a loop. The notebook is not meant to be run all at once. Rather, the main cell (that is not a function) is mean to be manually updated before each running of the cell. See notes before that cell.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
posters		posters
rawdata		rawdata
.gitattributes		.gitattributes
imdb-scraper.ipynb		imdb-scraper.ipynb
imdb_movie_list.csv		imdb_movie_list.csv
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

IMDb.com Web Scraper

Tom Keith - https://github.com/tomkeith

IMDb URL Structure

Why scrape IMDb if there is an open dataset?

Movie Posters

Notes

About

Uh oh!

Releases

Packages

Languages

tomkeith/imdb-scraper

Folders and files

Latest commit

History

Repository files navigation

IMDb.com Web Scraper

Tom Keith - https://github.com/tomkeith

IMDb URL Structure

Why scrape IMDb if there is an open dataset?

Movie Posters

Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages