UK Real Estate Scraper

🏠 UK Real Estate Scraper (Technical Challenge)

License Repo Size Last Commit

🎯 Project Overview

A specialized web scraping tool built to extract property data from highly secured UK real estate platforms. This project was a deep dive into bypassing modern anti-bot measures and handling dynamic JavaScript rendering using browser automation.

πŸ› οΈ Tech Stack


πŸš€ The Journey: Overcoming Technical Hurdles

This project was significantly more challenging than any other standard scraping tasks due to the target's security layers. It was rewarding to finally get some data.

The Scrapy vs. JS Wall

Initially attempted using Scrapy, which failed to capture data because the property listings were injected via Asynchronous JavaScript. The site returned a 200 OK but with a nearly empty HTML body (4KB).

Selenium Pivot

Switched to Selenium to simulate a real user environment. * Custom Binary Pathing: Configured the script to point directly to the Chrome binary to resolve environment-specific conflicts. * Stealth Configuration: Implemented AutomationControlled flags and custom User-Agents to reduce the bot fingerprint.

Handling Dynamic Content

Bypassing Interstitials

Handled complex cookie consent banners that visually obstructed the data layer, using XPath selectors to clear the UI "fog."

πŸ“Š Data Output

The scraper successfully extracted property titles and pricing into a structured JSON format: Below is one of them.

[
    {
        "title": "Church Gate",
        "price": "Β£290,000"
    }
]


βš™οΈ Prerequisites

Before running the scraper, ensure you have the following installed:

  1. Google Chrome: This scraper uses the Chrome browser to render JavaScript. Download Google Chrome here.
  2. Python 3.10+: Developed and tested on Python 3.13.7.
  3. Chrome Driver: Automatically managed via webdriver-manager, but requires Chrome to be installed in the default system path.

πŸš€ How to Run

  1. Clone the repository:
git clone [https://github.com/yourusername/real_estate_web_scraper.git](https://github.com/reory/real_estate_web_scraper.git)
cd real_estate_web_scraper

  1. set up virtual environment
python -m venv venv
# On Windows:
.\venv\Scripts\activate

  1. Install dependences
pip install selenium beautifulsoup4 lxml webdriver-manager

  1. Configuration:

Open crackukproperty.py and ensure the chromeoptions.binarylocation points to your chrome.exe


  1. Execute the scraper: python crackukproperty.py

πŸ›‘οΈEthical Note

To ensure compliance with the target's Terms of Service and to prevent IP blacklisting, the project was concluded once the Proof of Concept (PoC) was achieved. Extensive crawling was intentionally avoided to respect server resources.


πŸ“ˆ Key Learnings

Difference between static HTML scraping and dynamic DOM interaction.

Debugging Selenium TimeoutExceptions.

Managing Python virtual environments and interpreter paths in VS Code


πŸ› οΈ Detailed Tech Stack Logic


🀝 Contributing

Contributions are what make the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement".

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

πŸ“’ Potential Improvements

Since this project was a Proof of Concept (PoC), there are several areas for growth: * Pagination Logic: Automating the "Next" button click to crawl multiple pages. * Data Cleaning: Implementing Pydantic models to cast prices as integers and validate addresses. * Headless Mode: Optimizing the Selenium configuration to run without a visible browser window. * Proxy Integration: Adding rotating proxies to further reduce the risk of IP rate-limiting. * Integration: Adding Playwright - behaves like a real user and can scrape more effectively.


πŸ’– Acknowledgments


Built By Roy Peters Click here for contact details 😁