How to scrape a website using Python?

Sometimes, we need to extract data from websites, like the prices of a particular product from a website. This data might be used for data analytics or building a machine learning model to predict future prices. The use cases are limitless. We cannot possibly spend time copy-pasting data and arranging them for proper use. In such cases, we use web scraping using python.

Introduction to web scraping using python

This process of extracting large amounts of information from websites is known as web scraping. In this article, we are going to use python, one of the most popular programming languages to scrape a website. Even though many companies like Google, Facebook, Twitter, etc provide us with their APIs to access their information and services according to our needs. But, many websites might contain websites that might be useful to us but they ain’t providing any API services to their users. In this case, web scraping is a great option.

Libraries used

requests: The requests library is used to perform HTTP requests. In this article, we’ll be using it to fetch information from a remote URL.

Beautifulsoup4: This library is used for web scraping purposes to pull the data out of HTML and XML files

pandas: We use this library for data manipulation and analysis.

Implementation

Now, let’s start to web scrape a website. In this article, we’ll be scraping data from goodreads website. This website contains data regarding various types of quotes such as love, inspiration, life, etc.

Fetching

Let’s now import requests and bs4 packages as follows and fetch inspirational quotes data from the following URL.

import requests
from bs4 import BeautifulSoup
url = "https://www.goodreads.com/quotes/tag/{}"

url.format("inspirational-quotes")
# url = 'https://www.goodreads.com/quotes/tag/inspirational-quotes'
 
res = requests.get(url)
soup = BeautifulSoup(res.text)Code language: Python (python)

After, fetching information using requests.get(), we can initialize our soup using BeautifulSoup(). Now, let’s do some small exercises to learn how BeautifulSoup works!

find_all()

First, we fetch information on all the elements where the letter “a” has occurred. We can perform this in the following manner.

links = soup.find_all("a")
print(links)
Code language: Python (python)

find_all() using class

Now, I want to access the information on the quotes present on the website. First, we need to know the properties of the element like its class or id. Inspect the element and find its respective property. In our case, the quote belongs to the class quote and mediumText on an <div> element.

We can fetch all the quotes with similar features in the following manner.

quote_divs = soup.find_all("div", attrs={"class": "quote"})
print(quote_divs)Code language: PHP (php)

find_next()

Let’s now fetch the information about the quote like its text and author. Before that, we need to keep in mind the structure of the container.

The quote is contained inside the <div> element with class quoteText. First, let’s try to work on the 1st element.

quote_div = quote_divs[0]
quoteText_div = quote_div.find_next("div", attrs={"class": "quoteText"})
print(quoteText_div)Code language: PHP (php)

We use the function find_next() to find the <div> with class quoteText.

Extract Text

Now, let’s remove the HTML tags using the .text method. Furthermore, we could also use the strip() method to remove any leading (spaces at the beginning) and trailing (spaces at the end) characters.

striped = quoteText_div.text.strip()
print(striped)Code language: PHP (php)

We could notice that the output is taking multiple lines. Now, let’s fetch the quote and the author individually. We can do this by splitting the sentence according to their occurrence into new lines.

striped_text = striped.split("\n")
print(striped_text)Code language: PHP (php)

Now, we have our quote in the first index and our author in the last index. Let’s print them out as follows:

quote = striped_text[0][1:-1]
author = striped_text[-1].strip()
 
print(quote)
print(author)Code language: PHP (php)

Define function

Similarly, we can fetch the details of all the quotes. But, doing this process, again and again for each quote is a very repetitive task. Let’s make a function and append the quote and author for every class quote.

def getAllQuotes(url):    
   
   quotes = []
 
    for quote_div in quote_divs:
 
        quoteText_div = quote_div.find_next("div", attrs={"class" : "quoteText"})
 
        striped = quoteText_div.text.strip()
 
        striped_text= striped.split("\n")
 
        quote = striped_text[0][1:-1]
        author = striped_text[-1].strip()
 
        quote_item = {
            "quote" : quote,
            "author" : author
        }
        quotes.append(quote_item)
   
    return quotesCode language: Python (python)

Now, simply call the function to get all the data.

quote_data = getAllQuotes(url)
print(quote_data)
Code language: PHP (php)

Convert to CSV

Finally, let’s convert this data into a CSV file. This is an additional process, we perform this process with the help of the pandas library.

import pandas as pd
df = pd.DataFrame(quote_data)
 
df.to_csv("scrap.csv", index=None)Code language: Python (python)

Conclusion

In this article, we learned how to fetch data from a remote URL, how to extract information using the BeautifulSoup library, and finally convert the data into a CSV file. Web scraping using python is an amazing method to extract information from websites that we might require. Using this example, try to scrape multiple websites and get hands-on experience with web scraping.