Using Python and BeautifulSoup, we can quickly, and efficiently, scrap data from a web page. In the example below, I am going to show you how to scrap a web page in 20 lines of code, using BeautifulSoup and Python.
What is Web Scraping:
Web scraping is the process of automatically extracting information from a website. Web scraping, or data scraping, is useful for researchers, marketers and analysts interested in compiling, filtering and repackaging data.
Web Scraping in Python and BeautifulSoup:
Web scraping in Python is a breeze. There are number of ways to access a web page and scrap its data. I am using Python and BeautifulSoup for the purpose.
In this example, we are scraping college footballer data from ESPN website.
As we are scraping the web page using BeautifulSoup and Requests libraries, we need to install them first. This can be done using pip:
pip install requests
pip install beautifulsoup4
Ok. Time to brew some Python magic.
Let’s import required libraries in our code. These include BeautifulSoup, requests, os and csv – as we are going to save the extracted data in a CSV file.
from bs4 import BeautifulSoup import requests import os, os.path, csv
Next step is to fetch the web page and store it in a BeautifulSoup object. We also need a parser to parse through the fetched web page. BeautifulSoup can work with a variety of parsers, we are using the default html.parser in this example.
listingurl = "http://www.espn.com/college-sports/football/recruiting/databaseresults/_/sportid/24/class/2006/sort/school/starsfilter/GT/ratingfilter/GT/statuscommit/Commitments/statusuncommit/Uncommited" response = requests.get(listingurl) soup = BeautifulSoup(response.text, "html.parser")
Now comes the fun part.
We are now going to extract the player name, school, city, playing position and grade.
On viewing the source code (CTRL + U in Chrome) we note that the page uses a table to display the data, rows are using odd and even classes to give shadow effect, and fields are enclosed in td tags.
Next step is to find all rows, checking for both odd and even rows, and traverse through their columns to fetch the data.
Note that we need to separate city and school from the hometown field.
The fetched data is appended in a list which will be written to a CSV file at later stage.
listings =  for rows in soup.find_all("tr"): if ("oddrow" in rows["class"]) or ("evenrow" in rows["class"]): name = rows.find("div", class_="name").a.get_text() hometown = rows.find_all("td").get_text() school = hometown[hometown.find(",")+4:] city = hometown[:hometown.find(",")+4] position = rows.find_all("td").get_text() grade = rows.find_all("td").get_text() listings.append([name, school, city, position, grade])
The final section of the code opens a CSV file and writes content of the list to it. A confirmation message is printed in the end.
with open("footballers.csv", 'w', encoding='utf-8') as toWrite: writer = csv.writer(toWrite) writer.writerows(listings) print("ESPN College Football listings fetched.")
That’s all folks!
So, this is how Python and BeautifulSoup are used to scrap a web page in just 20 lines of code.
While the code achieved the requirements, it is not very elegant or self-explanatory. The detailed version of code which comments, and extra bits to tie up the lose ends, is available at GitHub [here].
If you found this article useful, share it at Twitter:How to scrap a web page in 20 lines of code using #Python and #BeautifulSoup… Click To Tweet
Resources for Web Scraping using Python and BeautifulSoup:
Books on Web Scraping in Python:
Python Web Scraping – Second Edition: Hands-on data scraping and crawling using PyQT, Selnium, HTML and Python [available here]
Web Scraping with Python: Collecting More Data from the Modern Web [available here]