Categories: Data Analytics / Python
Tags:

Step 1: Install Dependencies

Before starting, ensure you have the required libraries installed. Use the following command to install them:

pip install requests beautifulsoup4

Step 2: Import Necessary Libraries

import requests
from bs4 import BeautifulSoup

Step 3: Send a Request to the Website

Use the requests library to fetch the webpage content.

url = "https://example.com"  # Replace with the target URL
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    print("Successfully fetched the webpage!")
else:
    print("Failed to fetch the webpage. Status Code:", response.status_code)

Step 4: Parse the HTML Content

Pass the HTML content to BeautifulSoup for parsing.

soup = BeautifulSoup(response.text, "html.parser")

Step 5: Extract Specific Data

You can extract different elements using tags like <title>, <h1>, <p>, etc.

Extract the Title of the Page

title = soup.title.text
print("Page Title:", title)

Extract All Headings (h1 Tags)

headings = soup.find_all("h1")
for h in headings:
    print(h.text)

Extract All Paragraphs (p Tags)

paragraphs = soup.find_all("p")
for p in paragraphs:
    print(p.text)

Extract Links (a Tags)

links = soup.find_all("a")
for link in links:
    href = link.get("href")
    print(href)

Step 6: Extract Data from a Specific Section

If you want to extract data from a particular div or table, use find() or find_all() with class or id attributes.

div_content = soup.find("div", class_="example-class")
print(div_content.text if div_content else "No content found!")

Step 7: Handle Websites with Dynamic Content

Some websites load content dynamically using JavaScript. In such cases, consider using Selenium for scraping.

pip install selenium

Conclusion

This tutorial covered the basics of web scraping using BeautifulSoup. If a website blocks your requests, try:

  • Adding headers in the requests.get() call.
  • Using proxies or rotating user-agents.

Would you like a more advanced tutorial covering pagination, authentication, or API extraction? 🚀