IT Skills

Web Scraping with Python





1. Basics of Web Scraping

What is Web Scraping?

Web scraping involves extracting data from websites using automated tools or scripts.

Key Libraries for Web Scraping:

  • requests: Fetches web pages.
  • BeautifulSoup: Parses HTML and XML.
  • lxml: Faster HTML parsing.
  • selenium: Automates browser interaction.
  • scrapy: Advanced web scraping framework.

Ethical Considerations:

  • Check the website’s robots.txt file.
  • Avoid overloading servers; use delays and respect rate limits.

2. Basic Steps for Web Scraping

  1. Send an HTTP request to the webpage using requests.
  2. Parse the HTML content using BeautifulSoup.
  3. Extract the desired data using tags, classes, or attributes.
  4. Store the data in a structured format (CSV, JSON, or database).

3. Examples of Web Scraping?

Simple Scraping with requests and BeautifulSoup

```python import requests from bs4 import BeautifulSoup

Fetch the webpage

url = "https://example.com" response = requests.get(url) html = response.text

Parse HTML

soup = BeautifulSoup(html, 'html.parser')

Extract data

title = soup.title.text Page title links = [a['href'] for a in soup.find_all('a', href=True)] All links

print("Page Title:", title) print("Links:", links) ```

Scraping Tables

```python table = soup.find('table') Find the table rows = table.find_all('tr') Find all rows

for row in rows: columns = row.find_all('td') Find columns in each row data = [col.text for col in columns] print(data) ```

Automating Browsers with Selenium

```python from selenium import webdriver

Set up a web driver

driver = webdriver.Chrome()

Open a webpage

driver.get("https://example.com")

Interact with the page

search_box = driver.find_element("name", "q") search_box.send_keys("Python web scraping") search_box.submit()

Extract results

results = driver.find_elements("css selector", "h3") for result in results: print(result.text)

driver.quit() ```


4. Useful Formulas and Techniques

CSS Selectors and XPath:

  • Find elements using CSS selectors: python headlines = soup.select('h1, h2, h3') All headline tags
  • Use XPath (with Selenium or lxml): python element = driver.find_element("xpath", '//div[@class="example-class"]')

Handling Pagination:

  • Identify the "Next Page" link: python next_page = soup.find('a', {'rel': 'next'})['href']

Rate Limiting with time.sleep():

  • Add a delay between requests to avoid getting blocked: ```python import time

for page in range(1, 5): response = requests.get(f"https://example.com?page={page}") time.sleep(2) Pause for 2 seconds ```


5. Specific Situations for Web Scraping

Scenario 1: Scraping Product Prices from E-commerce Sites?

  • Use BeautifulSoup or Selenium to scrape product names, prices, and reviews. python prices = soup.find_all('span', class_='price') for price in prices: print(price.text)

Scenario 2: Collecting News Articles

  • Extract headlines, links, and publication dates from a news website. python articles = soup.find_all('div', class_='article') for article in articles: title = article.find('h2').text link = article.find('a')['href'] print(f"Title: {title}, Link: {link}")

Scenario 3: Downloading Images?

  • Scrape image URLs and download them using requests. python images = soup.find_all('img', src=True) for img in images: img_url = img['src'] with open("image.jpg", 'wb') as f: f.write(requests.get(img_url).content)

Scenario 4: Job Listings

  • Extract job titles, companies, and locations from job portals. python jobs = soup.find_all('div', class_='job-listing') for job in jobs: title = job.find('h2').text company = job.find('h3').text print(f"Job: {title}, Company: {company}")

Scenario 5: Real-Time Data from APIs?

  • Use APIs (if available) for structured data rather than scraping. python response = requests.get("https://api.example.com/data") data = response.json() print(data)

6. Best Practices for Web Scraping

  • Use Headers: Mimic a real browser. python headers = {'User-Agent': 'Mozilla/5.0'} response = requests.get(url, headers=headers)
  • Avoid IP Blocking: Use proxies or rotate IPs.
  • Store Data Efficiently: Save data in files or databases. ```python import csv

with open('data.csv', 'w') as file: writer = csv.writer(file) writer.writerow(["Title", "Link"]) writer.writerow(["Example Title", "https://example.com"]) ```


7. Resources for Practice


If you liked this, consider supporting us by checking out Tiny Skills - 250+ Top Work & Personal Skills Made Easy