Today we are going to learn how to do web scraping using Python. I am a big fan of cricket. So I decided to make a script that scrapes data from Cricinfo and gives a notification to the user. We are using the request module to fetch the URL and beautifulsoup4 for web scraping.
Prerequisites
Install the following packages using pip
pip install beautifulsoup4
pip install requests
pip install lxml
Notify-send for Linux its pre-installed for window installation using the link
Let's dive into the code
import all the dependencies
import requests ## for fetching the URL
from bs4 import BeautifulSoup ## for web parsing
import os
import time
Fetch contents from URL using the request module
url="https://www.espncricinfo.com/series/india-in-south-africa-2021-22-1277060/south-africa-vs-india-2nd-odi-1277083/live-cricket-score"
response = requests.get(url)
Here using we trigger a GET request to the specified URL( Live match was going on while the script was done)
Find the required div and score using beautifulsoup4
This is the entire screen when taken in the browser. I need only the live score for batting div.
Entire screen
batting team div
So, I have inspected the webpage and found the class of live scorecard which is “match-info match-info-MATCH match-info-MATCH-full-width”
To get html tag with the class
soup = BeautifulSoup(response.content, "lxml")
score_parent = soup.find('div', attrs = {'class':'match-info match-info-MATCH match-info-MATCH-full-width'})
Here, first, we give the entire content from the URL to BeautifulSoup. This will return a BeautifulSoup object. Then I find all the div with the required class. Then find the batting team div using the same find function
batting_team_div = score_parent.find("div", attrs={"class": "team"})
name_div = batting_team_div.find('p', attrs = {'class': 'name'})
overs_div = batting_team_div.find('span', attrs={"class": 'score-info'})
score_div = batting_team_div.find_all('span', attrs={"class": "score"})
Extract text from the div
In the last step, we get all the required divs. In this step, we get the text from the div using the text attribute.
batting_team_name = name_div.text
This will give you the name of the batting team. We generate a new string with all details using the following steps
batting_team_name = name_div.text
display_text = ""
if batting_team_name:
display_text += batting_team_name
display_text += " "
overs_text = overs_div.text
if overs_text:
display_text += overs_text
display_text += " "
scores = []
for span in score_div:
scores.append(span.text.replace("\xa0", " "))
scores_string = "".join(scores)
if scores_string:
display_text += scores_string
Show Notification on the screen
cmd = f'notify-send "Cricket Score" "{display_text}"'
os.system(cmd)
Using the os module we will trigger the notify-send command. Notify-send is a command for giving notifications on the screen. Its syntax is notify-send <title> <text to display>
Output will look like this
Make an Infinite loop and call every second
starttime = time.time()
while True:
print("tick")
time.sleep(60.0 - ((time.time() - starttime) % 60.0))
get_score_and_notify()
This is an infinite loop that will call our function every 60 s.
You can clone the entire code from the repository from here
We can do web scraping for many purposes like collecting table data from websites, image collection, link collections, etc.
I hope you learned about web scraping. Please share your suggestions at afsal@parseltongue.co.in
Happy coding!