Data scraping, also known as web scraping, is the process of extracting information from a website and converting this information into a Database. It’s one of the most efficient ways to get data from the web. Web Scraping can also be used to scrape Images in the given Website.
Beautiful Soup is a Python library for parsing HTML and XML documents. It is a library that makes it easy to scrape information from web pages. We will be using this library to scrape Data and Images from the IMDB website.
import pandas as pd
import re
import lxml
from PIL import Image
from io import BytesIO
import os
import webbrowser
from bs4 import BeautifulSoup
from requests import get
-
Import all necessary libraries.
-
To go to the IMDB website, click on this link: https://www.imdb.com/list/ls068010962/
- Now, Right click anywhere on the screen and select 'Inspect' .
- Find the Elements that correspond to the data we want to extract.
-
Make a note of the TAGS as well as the Attributes like class, id, etc. We'll use that later.
-
Then follow the Indian_Movie_Celebrities_Database_Generator.ipynb file available in this repository. A detailed explanation of the code is provided in the .ipynb file.
-
After running the .ipynb file, the generated DataFrame containing information of 200 celebrities is displayed as a table in a browser as shown below:
From celebrity 1
Right until the 200th Celebrity!!!
-
Indian_Movie_Celebrities_Database_Generator.ipynb - The Python file containing the code for Data and Image Scraping.
-
Images/ (Folder) - This folder consists of all the images Scraped from the IMDB website.
-
Top 200 Best Indian Actors and Actresses.html - The .html file created after running the .ipynb file.