Greystar Properties Scraper

Language / Idioma

Overview

This scraper is designed to extract property information from Greystar in a robust and efficient way, using parallel processing and advanced error handling.

Key Features

🔄 Resilient Processing

Automatic resumption: If the process is interrupted, it continues from where it left off
State persistence: Saves progress in JSON files for recovery
Robust error handling: Continues processing even if some sites fail

⚡ Parallel Processing

10 concurrent workers: Processes multiple properties simultaneously
Intelligent distribution: Divides tasks evenly among workers
Resource optimization: Specific configuration for headless browsing

🎯 Data Validation

Quality filtering: Only saves records with complete information
Required fields: Phone, zip code, state, and address/city
Detailed logging: Shows which records are skipped and why

System Architecture

Persistence Files

File	Purpose	Format
`greystar_links.json`	Complete list of extracted links	JSON
`greystar_progress.json`	Already processed URLs for resumption	JSON
`greystar_properties.csv`	Extracted data in CSV format	CSV

Processing Flow

Link Extraction (First time only)
- Navigates to https://www.greystar.com/properties
- Extracts all property links by state
- Saves to greystar_links.json
Progress Verification
- Loads previous progress from greystar_progress.json
- Filters already processed links
- Continues only with pending links
Parallel Processing
- Divides links into 10 chunks
- Processes each chunk in a separate worker
- Updates progress after each record

Installation

# Clone the repository
git clone <repository-url>
cd greystarScrapy

# Install dependencies
npm install

# Run the scraper
node greystar_paralell_scrapy_v2.js

Requirements

Node.js 14+
Google Chrome (for Puppeteer)
8GB+ RAM (recommended for parallel processing)

Usage

The scraper will automatically:

Extract property links if not already done
Resume from the last processed property
Save data to CSV as it processes
Handle errors gracefully

Output

The scraper generates:

greystar_properties.csv: Main data file with all extracted properties
greystar_links.json: Cache of all property links
greystar_progress.json: Progress tracking for resumption

Contributing

Fork the repository
Create a feature branch
Commit your changes
Push to the branch
Create a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Keywords

web-scraping, puppeteer, greystar, real-estate, property-scraper, nodejs, parallel-processing, data-extraction, automation, csv-export, rental-properties, apartment-scraper, headless-browser, resilient-scraping, property-data

Validation and Saving
- Validates that each record has minimum data
- Saves only complete records to CSV
- Marks all attempts as processed

Data Extraction Methods

🔍 Multi-Method Strategy

Structured JSON-LD

// Search in scripts with structured data
script[type="application/ld+json"]

Meta Tags

// Search in meta properties and names
meta[property], meta[name]

Text Analysis

// Regex patterns for complete addresses
/(\d+\s+[A-Za-z\s]+(?:Street|St|Avenue|Ave...).*[A-Z]{2}\s+\d{5})/gi

Specific Selectors

// Elements with address-related classes
[class*="address"], [class*="location"], [class*="contact"]

📞 Phone Extraction

Phone links
```
a[href^="tel:"]
```

Text patterns

/(?:\+?1[-.\s]?)?\(?([0-9]{3})\)?[-.\s]?([0-9]{3})[-.\s]?([0-9]{4})/

Data Validation

Validation Criteria

A record is considered valid if it has:

✅ Phone: Valid format with country code
✅ Zip Code: USA format (5 digits or 5+4)
✅ State: 2-letter code (e.g., CA, NY, TX)
✅ Address or City: At least one of the two fields

Validation Example

function isValidRecord(communityData, addressParts, community) {
    const hasPhone = communityData.phone && communityData.phone.trim() !== '';
    const hasZip = addressParts.zip && addressParts.zip.trim() !== '';
    const hasState = addressParts.state && addressParts.state.trim() !== '';
    const hasAddressOrCity = (addressParts.address && addressParts.address.trim() !== '') || 
                            (addressParts.city && addressParts.city.trim() !== '');
    
    return hasPhone && hasZip && hasState && hasAddressOrCity;
}

System Configuration

Browser Configuration

{
    headless: true,                    // Headless mode
    timeout: 20000,                    // 20 seconds per page
    workers: 10,                       // Parallel processing
    delay: 800,                        // Pause between requests (ms)
    retries: 3                         // Attempts per URL
}

Chrome Arguments

[
    '--no-sandbox',
    '--disable-setuid-sandbox',
    '--disable-dev-shm-usage',
    '--disable-gpu',
    '--disable-extensions',
    '--disable-background-timer-throttling'
]

Data Structure

CSV Output Format

state_name,communityName,address,city,state_address,zip,phone,email

Record Example

California,"The Residences at Marina Bay","1000 Marina Bay Dr","Richmond","CA","94804","+1 510 555 1234","residencesmarinabay@greystar.com"

JSON Links Format

{
    "extractedAt": "2025-07-10T21:25:05.595Z",
    "totalLinks": 3341,
    "links": [
        {
            "state": "California",
            "communityName": "The Residences at Marina Bay",
            "communityUrl": "https://www.greystar.com/properties/..."
        }
    ]
}

Error Handling

Types of Handled Errors

Page Timeouts
- Timeout configured to 20 seconds
- Marks as processed and continues
Navigation Errors
- Pages not found (404)
- Connectivity issues
- Continues with next link
Extraction Errors
- Pages with different structure
- JavaScript not executed
- Saves empty record but marks as processed
Validation Errors
- Incomplete data
- Incorrect formats
- Skips from CSV but marks as processed

Logging and Monitoring

Logging Levels

// General information
console.log('Worker 0: ✓ Processed 15/335 - Community Name');

// Warnings (incomplete data)
console.log('Worker 0: ⚠️ Incomplete record skipped - Community Name');

// Errors
console.error('Worker 0: ✗ Error processing Community Name: timeout');

Progress Metrics

Total links found
Already processed links
Remaining links
Valid records saved
Records skipped by validation

System Usage

Execution

# Run with caffeinate to avoid sleep
caffeinate node greystar_paralell_scrapy_v2.js

Restart After Interruption

# System automatically detects previous progress
node greystar_paralell_scrapy_v2.js

Start from Scratch

# Remove state files
rm greystar_links.json greystar_progress.json greystar_properties.csv

# Run again
node greystar_paralell_scrapy_v2.js

Implemented Optimizations

Performance

Headless browsing: Navigation without GUI
Parallel processing: 10 simultaneous workers
Optimized timeouts: Balance between speed and stability
Controlled pauses: Avoids server overload

Data Quality

Strict validation: Only complete records
Smart parsing: Multiple extraction methods
Normalization: Consistent formats for phones
Email generation: Based on community names

Robustness

Persistent state: Automatic recovery
Error handling: Continues despite individual failures
Detailed logging: Facilitates debugging and monitoring
Thread-safe: Safe writing to shared files

Technical Considerations

Memory and CPU

Each worker consumes ~50-100MB RAM
10 workers = ~500MB-1GB total RAM
CPU: Uses multiple cores efficiently

Network and Connectivity

~1 request per second per worker
Total: ~10 requests/second
Respectful with target server

Storage

JSON links: ~500KB - 1MB
Progress JSON: Grows to ~500KB
Final CSV: ~1-5MB (depending on valid data)

Maintenance

Necessary Updates

CSS Selectors: If Greystar changes HTML structure
Address Patterns: For new address formats
Timeouts: Adjust according to server speed
Validation: Stricter or more flexible criteria

Recommended Monitoring

Review logs every 30 minutes during execution
Verify data quality in intermediate CSV
Monitor system resource usage
Validate that progress is saved correctly

Note: This system is designed to be robust and efficient, but always respects the terms of service of the target website and implements appropriate delays to avoid overloading the server.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.es.md		README.es.md
README.md		README.md
addRecords.js		addRecords.js
greystar_paralell_scrapy_v1.js		greystar_paralell_scrapy_v1.js
greystar_paralell_scrapy_v2.js		greystar_paralell_scrapy_v2.js
greystar_scrapy.js		greystar_scrapy.js
package-lock.json		package-lock.json
package.json		package.json

License

rsamanez/GreystarPropertiesScraper

Folders and files

Latest commit

History

Repository files navigation

Greystar Properties Scraper

Language / Idioma

Overview

Key Features

🔄 Resilient Processing

⚡ Parallel Processing

🎯 Data Validation

System Architecture

Persistence Files

Processing Flow

Installation

Requirements

Usage

Output

Contributing

License

Keywords

Data Extraction Methods

🔍 Multi-Method Strategy

📞 Phone Extraction

Data Validation

Validation Criteria

Validation Example

System Configuration

Browser Configuration

Chrome Arguments

Data Structure

CSV Output Format

Record Example

JSON Links Format

Error Handling

Types of Handled Errors

Logging and Monitoring

Logging Levels

Progress Metrics

System Usage

Execution

Restart After Interruption

Start from Scratch

Implemented Optimizations

Performance

Data Quality

Robustness

Technical Considerations

Memory and CPU

Network and Connectivity

Storage

Maintenance

Necessary Updates

Recommended Monitoring

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages