Building a Modern Web Scraper with Python

{development}{tutorial}::#python#web-scraping#tutorial#async

Web scraping is an essential skill for any developer working with data. In this post, I'll walk you through building a modern, production-ready web scraper that handles real-world challenges like rate limiting, error handling, and async operations.

Why Web Scraping Matters

In today's data-driven world, not all information is available through APIs. Web scraping fills this gap by allowing us to extract structured data from websites programmatically. Whether you're:

Monitoring competitor pricing
Aggregating news articles
Building datasets for machine learning
Tracking product availability

A well-built scraper can save countless hours of manual work.

Always check a website's robots.txt and terms of service before scraping. Respect rate limits and be a good internet citizen.

Architecture Overview

Here's a high-level view of our scraper architecture:

Web Scraper Architecture

The data flows through four main stages: fetching HTML, parsing content, validating data, and storing results. Each stage is isolated and can be tested independently.

The Tech Stack

For this project, we'll use:

Python 3.11+ - Modern Python with excellent async support
httpx - Async HTTP client (faster than requests)
Beautiful Soup 4 - HTML parsing and navigation
Pydantic - Data validation and serialization
tenacity - Retry logic with exponential backoff

Project Structure

Here's how I organize my scraping projects:

scraper/
├── __init__.py
├── models.py      # Pydantic models for data validation
├── scraper.py     # Core scraping logic
├── parser.py      # HTML parsing functions
└── utils.py       # Helper functions (rate limiting, etc.)

Core Implementation

1. Setting Up the HTTP Client

First, let's create an async HTTP client with proper headers and timeout configuration:

import httpx
from typing import Optional

class ScraperClient:
    """Async HTTP client for web scraping with rate limiting."""

    def __init__(
        self,
        base_url: str,
        timeout: float = 30.0,
        max_retries: int = 3
    ):
        self.base_url = base_url
        self.timeout = timeout
        self.max_retries = max_retries

        # Configure client with realistic headers
        self.client = httpx.AsyncClient(
            timeout=httpx.Timeout(timeout),
            headers={
                "User-Agent": "Mozilla/5.0 (compatible; MyBot/1.0)",
                "Accept": "text/html,application/xhtml+xml",
                "Accept-Language": "en-US,en;q=0.9",
            },
            follow_redirects=True,
        )

    async def get(self, url: str) -> Optional[str]:
        """Fetch a URL with retry logic."""
        try:
            response = await self.client.get(url)
            response.raise_for_status()
            return response.text
        except httpx.HTTPError as e:
            print(f"HTTP error occurred: {e}")
            return None

    async def close(self):
        """Clean up the client."""
        await self.client.aclose()

Using httpx.AsyncClient instead of requests can improve scraping speed by 3-5x when fetching multiple pages concurrently.

2. Data Models with Pydantic

Define your data structure upfront. This makes validation automatic and catches errors early:

1 from pydantic import BaseModel, HttpUrl, Field
2 from datetime import datetime
3 from typing import Optional, List
4 
5 class Article(BaseModel):
6     """Represents a scraped article."""
7 
8     title: str = Field(..., min_length=1, max_length=500)
9     url: HttpUrl
10     author: Optional[str] = None
11     published_date: Optional[datetime] = None
12     content: str = Field(..., min_length=10)
13     tags: List[str] = Field(default_factory=list)
14 
15     class Config:
16         json_encoders = {
17             datetime: lambda v: v.isoformat(),
18             HttpUrl: lambda v: str(v),
19         }
20 
21 class ScrapingResult(BaseModel):
22     """Container for scraping results."""
23 
24     articles: List[Article]
25     scraped_at: datetime = Field(default_factory=datetime.now)
26     total_count: int
27     success_rate: float

3. HTML Parsing

Beautiful Soup makes HTML parsing straightforward:

from bs4 import BeautifulSoup
from typing import List, Optional

def parse_article(html: str) -> Optional[Article]:
    """Extract article data from HTML."""
    soup = BeautifulSoup(html, 'html.parser')

    # Find the main article container
    article_elem = soup.find('article', class_='post')
    if not article_elem:
        return None

    # Extract fields with fallbacks
    title = article_elem.find('h1', class_='title')
    title_text = title.get_text(strip=True) if title else None

    author = article_elem.find('span', class_='author')
    author_text = author.get_text(strip=True) if author else None

    content = article_elem.find('div', class_='content')
    content_text = content.get_text(strip=True) if content else ""

    # Extract tags
    tags = [
        tag.get_text(strip=True)
        for tag in article_elem.find_all('a', class_='tag')
    ]

    try:
        return Article(
            title=title_text,
            url=article_elem.find('link', rel='canonical')['href'],
            author=author_text,
            content=content_text,
            tags=tags,
        )
    except Exception as e:
        print(f"Failed to parse article: {e}")
        return None

Advanced Techniques

Rate Limiting

Respect the server by implementing rate limiting:

import asyncio
from collections import deque
from time import time

class RateLimiter:
    """Token bucket rate limiter."""

    def __init__(self, rate: int, per: float):
        self.rate = rate  # requests
        self.per = per    # seconds
        self.allowance = rate
        self.last_check = time()

    async def acquire(self):
        """Wait until a request is allowed."""
        current = time()
        time_passed = current - self.last_check
        self.last_check = current

        self.allowance += time_passed * (self.rate / self.per)
        if self.allowance > self.rate:
            self.allowance = self.rate

        if self.allowance < 1.0:
            sleep_time = (1.0 - self.allowance) * (self.per / self.rate)
            await asyncio.sleep(sleep_time)
            self.allowance = 0.0
        else:
            self.allowance -= 1.0

Concurrent Scraping

Process multiple pages concurrently while respecting rate limits:

async def scrape_multiple_urls(
    urls: List[str],
    client: ScraperClient,
    rate_limiter: RateLimiter,
    max_concurrent: int = 5
) -> List[Article]:
    """Scrape multiple URLs concurrently."""

    semaphore = asyncio.Semaphore(max_concurrent)

    async def scrape_one(url: str) -> Optional[Article]:
        async with semaphore:
            await rate_limiter.acquire()
            html = await client.get(url)
            if html:
                return parse_article(html)
            return None

    tasks = [scrape_one(url) for url in urls]
    results = await asyncio.gather(*tasks, return_exceptions=True)

    # Filter out None and exceptions
    return [r for r in results if isinstance(r, Article)]

Error Handling Best Practices

Never let a single failed request crash your entire scraping job. Always handle exceptions gracefully and log failures for later review.

Here's a robust error handling pattern:

from tenacity import retry, stop_after_attempt, wait_exponential
import logging

logger = logging.getLogger(__name__)

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=4, max=10)
)
async def fetch_with_retry(url: str, client: ScraperClient) -> str:
    """Fetch URL with exponential backoff retry."""
    try:
        html = await client.get(url)
        if html is None:
            raise ValueError(f"Failed to fetch {url}")
        return html
    except Exception as e:
        logger.error(f"Error fetching {url}: {e}")
        raise

Performance Comparison

Here's how different approaches compare when scraping 100 pages:

Approach	Time (seconds)	Requests/sec	Memory (MB)	Notes
Sequential (requests)	245	0.4	45	Baseline, very slow
Sequential (httpx)	198	0.5	42	Slightly faster
Async (httpx, no limit)	12	8.3	78	Fast but risky
Async (httpx, rate limited)	35	2.9	65	Recommended approach
Async (httpx, concurrent=5)	28	3.6	58	Best balance

These benchmarks were run on a 2023 MacBook Pro with a 100Mbps connection. Your results may vary based on network conditions and target server response times.

Putting It All Together

Here's the complete main function:

import asyncio
import json
from pathlib import Path

async def main():
    """Main scraping workflow."""

    # Configuration
    base_url = "https://example.com"
    urls_to_scrape = [
        f"{base_url}/article/{i}"
        for i in range(1, 101)
    ]

    # Initialize components
    client = ScraperClient(base_url)
    rate_limiter = RateLimiter(rate=10, per=60)  # 10 req/min

    try:
        # Scrape articles
        print(f"Scraping {len(urls_to_scrape)} articles...")
        articles = await scrape_multiple_urls(
            urls_to_scrape,
            client,
            rate_limiter,
            max_concurrent=5
        )

        # Create result
        result = ScrapingResult(
            articles=articles,
            total_count=len(urls_to_scrape),
            success_rate=len(articles) / len(urls_to_scrape)
        )

        # Save to file
        output_path = Path("scraped_data.json")
        output_path.write_text(result.json(indent=2))

        print(f"✓ Scraped {len(articles)} articles")
        print(f"✓ Success rate: {result.success_rate:.1%}")
        print(f"✓ Saved to {output_path}")

    finally:
        await client.close()

if __name__ == "__main__":
    asyncio.run(main())

Key Takeaways

After building dozens of scrapers, here are my top recommendations:

Start simple - Get basic scraping working before adding complexity
Validate early - Use Pydantic or similar to catch data issues immediately
Be respectful - Implement rate limiting and respect robots.txt
Handle failures - Network issues happen; plan for them
Monitor performance - Track success rates and response times
Test thoroughly - Websites change; your scraper should be resilient

Richardson, L. (2023). Beautiful Soup Documentation. Crummy.com. https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Encode. (2024). HTTPX: A next-generation HTTP client for Python. GitHub. https://github.com/encode/httpx

Colvin, S. (2024). Pydantic: Data validation using Python type hints. Pydantic Documentation. https://docs.pydantic.dev/

What's Next?

In future posts, I'll cover:

Handling JavaScript-heavy sites with Playwright
Distributed scraping with Celery and Redis
Legal considerations and ethical scraping practices
Building a scraping API with FastAPI

Check out my GitHub for the complete source code and additional examples. Feel free to open issues or contribute improvements!

Have questions or suggestions? Found this helpful? Let me know on GitHub or reach out via the links in the footer.

1	from pydantic import BaseModel, HttpUrl, Field
2	from datetime import datetime
3	from typing import Optional, List
4
5	class Article(BaseModel):
6	"""Represents a scraped article."""
7
8	title: str = Field(..., min_length=1, max_length=500)
9	url: HttpUrl
10	author: Optional[str] = None
11	published_date: Optional[datetime] = None
12	content: str = Field(..., min_length=10)
13	tags: List[str] = Field(default_factory=list)
14
15	class Config:
16	json_encoders = {
17	datetime: lambda v: v.isoformat(),
18	HttpUrl: lambda v: str(v),
19	}
20
21	class ScrapingResult(BaseModel):
22	"""Container for scraping results."""
23
24	articles: List[Article]
25	scraped_at: datetime = Field(default_factory=datetime.now)
26	total_count: int
27	success_rate: float