Using Python and BeautifulSoup, we can quickly, and efficiently, scrape data from a web page. In the example below, I am going to show you how to scrape a web page in 20 lines of code, using BeautifulSoup and Python.
What is Web Scraping:
Web scraping is the process of automatically extracting information from a website. Web scraping, or data scraping, is useful for researchers, marketers and analysts interested in compiling, filtering and repackaging data.
A word of caution: Always respect the website’s privacy policy and check robots.txt before scraping. If a website offers API to interact with its data, it is better to use that instead of scraping.
Web Scraping with Python and BeautifulSoup:
Web scraping in Python is a breeze. There are number of ways to access a web page and scrape its data. I am using Python and BeautifulSoup for the purpose.
In this example, we are scraping college footballer data from ESPN website.
As we are scraping the web page using BeautifulSoup and Requests libraries, we need to install them first. This can be done using pip:
pip install requests
pip install beautifulsoup4
Ok. Time to brew some Python magic.
Let’s import required libraries in our code. These include BeautifulSoup, requests, os and csv – as we are going to save the extracted data in a CSV file.
from bs4 import BeautifulSoup import requests import os, os.path, csv
Next step is to fetch the web page and store it in a BeautifulSoup object. We also need a parser to parse through the fetched web page. BeautifulSoup can work with a variety of parsers, we are using the default html.parser in this example.
listingurl = "http://www.espn.com/college-sports/football/recruiting/databaseresults/_/sportid/24/class/2006/sort/school/starsfilter/GT/ratingfilter/GT/statuscommit/Commitments/statusuncommit/Uncommited" response = requests.get(listingurl) soup = BeautifulSoup(response.text, "html.parser")
Now comes the fun part.
We are now going to extract the player name, school, city, playing position and grade.
On viewing the source code (CTRL + U in Chrome) we note that the page uses a table to display the data, rows are using odd and even classes to give shadow effect, and fields are enclosed in td tags.
Next step is to find all rows, checking for both odd and even rows, and traverse through their columns to fetch the data.
Note that we need to separate city and school from the hometown field.
The fetched data is appended in a list which will be written to a CSV file at later stage.
listings = [] for rows in soup.find_all("tr"): if ("oddrow" in rows["class"]) or ("evenrow" in rows["class"]): name = rows.find("div", class_="name").a.get_text() hometown = rows.find_all("td")[1].get_text() school = hometown[hometown.find(",")+4:] city = hometown[:hometown.find(",")+4] position = rows.find_all("td")[2].get_text() grade = rows.find_all("td")[4].get_text() listings.append([name, school, city, position, grade])
The final section of the code opens a CSV file and writes content of the list to it. A confirmation message is printed in the end.
with open("footballers.csv", 'w', encoding='utf-8') as toWrite: writer = csv.writer(toWrite) writer.writerows(listings) print("ESPN College Football listings fetched.")
That’s all folks!
So, this is how Python and BeautifulSoup are used to scrape a web page in just 20 lines of code.
While the code achieved the requirements, it is not very elegant or self-explanatory. The detailed version of code which comments, and extra bits to tie up the lose ends, is available at GitHub [here].
If you found this article useful, share it at Twitter:
How to scrap a web page in 20 lines of code using #Python and #BeautifulSoup https://www.kashifaziz.me/web-scraping-python-beautifulsoup.html/ Click To TweetResources for Web Scraping using Python and BeautifulSoup:
BeautifulSoup Documentation. [here]
Requests Library Documentation. [here]
BeautifulSoup code snippets at GitHub. [here]
Python Developer in Karachi, Pakistan. Interested in Web Scraping | Data Mining | Web Bots | Python Development using Django. [Check my Portfolio here]
If you found the above post interesting, join my list to get updated when the next one comes up.
For consulting assignments, contact me here with your requirements.
I prefer Selenium over BeautifulSoup because Selenium can scrape JavaScript and Ajax content.