Skip to content

marcmos/flatscraper

Repository files navigation

Flatscraper

Finding apartments across multiple listing sites is time-consuming and difficult to track. Flatscraper addresses this challenge by:

  • Extracting property data from major real estate websites: otodom.pl, olx.pl, nieruchomosci-online.pl, and morizon.pl.
  • Smart multi-source deduplication: Merges the same property listed on multiple sites, preferring more complete data sources.
  • Intelligent public transport scoring: Finds the best route across multiple transport profiles (tram, bus) and time windows, accounting for walking distance and transfer complexity.
  • Context-aware ranking: Scores offers based on public transport accessibility, price per square meter, area, and availability of special features like balconies.
  • Generating aggregated feeds that match specific criteria, delivered via multiple channels: e-mail digests, HTML, and RSS feeds.
  • Storing data in a SQLite database, enabling custom ETL queries.

What Makes This Interesting

The real value isn't the scraping or database code—it's the domain logic you'd need to build for any serious property search system:

  • Public transport route optimization: Computing accessibility isn't just "distance to subway." The system simulates planning your morning commute: it queries departure times at 5-minute intervals across different transport options (tram, bus) to find when you'd actually leave to minimize total travel time—modeling how a real person would plan their route.

  • Multi-source deduplication: The same apartment appears on 3+ sites. Simple solution that works: match by area + price (exact). Merges offers by preferring more complete sources (otodom > nieruchomosci-online > morizon). Good enough for Krakow's market, though a production system would need fuzzy matching and proper conflict resolution.

  • Preference-based filtering: Not just "show me cheap apartments," but "prioritize tram over bus, filter by maximum walking distance, penalize high floors without elevators in old buildings, prefer certain neighborhoods."

These algorithms are language-agnostic—you'd need to implement them whether you used Haskell, Python, or JavaScript. The domain complexity is inherent to the problem, not the technology choice.

While other open source projects focus on either real estate scraping or transit routing, this project bridges that gap—combining offline GTFS-based route optimization with multi-source property aggregation. The two domains (real estate and public transit) are typically handled by different developer communities and tools.

See docs/REFACTORING-TODO.md for details on which pieces of logic are complex enough to warrant testing.

Screenshots

Daily offer web feed: Screenshot of an offer web feed

RSS feed (as viewed in miniflux): Screenshot of RSS feed in miniflux

Design Notes

  • City-specific configuration: Currently optimized for Krakow, Poland, with dedicated transport hubs and points of interest.
  • Offline public transport accessibility scoring: Calculated locally using the loaded GTFS schedule (thanks to the mobroute project.)
  • SQLite-centric: All data is stored and processed in SQLite.

Architecture

This project follows Clean Architecture with compile-time enforcement via Haskell's module system.

Package Structure

The codebase uses 6 internal Haskell packages (not for code size, but for compile-time architecture enforcement):

  • flatscraper-core - Domain entities (Domain.Offer, Domain.PublicTransport)

    • Cannot import: database libraries, HTTP clients, scrapers, or any infrastructure
    • Dependencies: Only pure libraries (lens, containers)
  • flatscraper-usecase - Business logic orchestration

    • Can import: flatscraper-core
    • Cannot import: adapters (database, scrapers, presenters)
  • flatscraper-adapters-* - Infrastructure implementations

    • Can import: flatscraper-core, flatscraper-usecase
    • Includes: dataaccess (SQLite), scraper (HTML parsing), presenter (RSS/HTML), view (email)

If you try to violate dependency rules (e.g., import SQLite in the domain layer), GHC will refuse to compile. This prevents architectural erosion over time.

Dependency Flow

flatscraper-core (Domain entities - NO dependencies)
    ↑
flatscraper-usecase (depends on: core)
    ↑
flatscraper-adapters-* (depends on: core, usecase)
    ↑
CLI executables (depends on: all layers)

Dependencies flow inward only. Use cases define interfaces (e.g., QueryAccess, ScoreboardQueryAccess) implemented by adapters in outer layers.

Pragmatic Trade-off: SQL-based Scoring

While following Clean Architecture, this project makes an intentional exception: scoring business logic lives in SQL views (sql/views.sql) rather than the domain layer. This enables rapid experimentation with scoring formulas without recompilation, at the cost of reduced testability. See docs/DUAL-SCORING-ARCHITECTURE.md for details on the migration path to domain-based scoring.

Command-line Tools

The project provides several executables:

  • flatscraper-scrape: Extract new offers
  • flatscraper-process-location: Enrich offers with location data
  • flatscraper-gen-feed: Generate HTML/RSS feeds
  • flatscraper-send-digest: Send email digests

Setup and Installation

  1. Install Haskell Stack.
  2. Install mobroute and load the GTFS schedule.
  3. Clone this repository and run stack build.
  4. Create an SMTP credentials file (for email functionality).
  5. Run the desired command-line tool. For periodic scraping, consider installing systemd units.
  6. Set up a web server and copy index.html to the webroot. Configure cron or systemd timers to generate HTML/RSS feeds automatically.

Project Background

This project was created before property listing sites began offering e-mail notifications for new offers matching specified filters. The goal was to automate offer filtering, reducing the number of listings that required human review. One challenge was eliminating duplicates of a single offer, re-posted multiple times across different sites. The project started in April 2019.

By 2025, most sites had implemented new offer notifications, but this did not solve the issue of receiving multiple notifications for a single re-posted offer. At that point, the scoring system was introduced to help identify unusual offers in the market. the problem with getting multiple notifications for a single re-posted offer. At this time came the scoring system to catch unusual offers in the market.

About

Flat rent offer scraper

Topics

Resources

License

Stars

Watchers

Forks

Contributors 3

  •  
  •  
  •  

Languages