This scraper is designed to extract property information from Greystar in a robust and efficient way, using parallel processing and advanced error handling.
- Automatic resumption: If the process is interrupted, it continues from where it left off
- State persistence: Saves progress in JSON files for recovery
- Robust error handling: Continues processing even if some sites fail
- 10 concurrent workers: Processes multiple properties simultaneously
- Intelligent distribution: Divides tasks evenly among workers
- Resource optimization: Specific configuration for headless browsing
- Quality filtering: Only saves records with complete information
- Required fields: Phone, zip code, state, and address/city
- Detailed logging: Shows which records are skipped and why
| File | Purpose | Format |
|---|---|---|
greystar_links.json |
Complete list of extracted links | JSON |
greystar_progress.json |
Already processed URLs for resumption | JSON |
greystar_properties.csv |
Extracted data in CSV format | CSV |
-
Link Extraction (First time only)
- Navigates to
https://www.greystar.com/properties - Extracts all property links by state
- Saves to
greystar_links.json
- Navigates to
-
Progress Verification
- Loads previous progress from
greystar_progress.json - Filters already processed links
- Continues only with pending links
- Loads previous progress from
-
Parallel Processing
- Divides links into 10 chunks
- Processes each chunk in a separate worker
- Updates progress after each record
# Clone the repository
git clone <repository-url>
cd greystarScrapy
# Install dependencies
npm install
# Run the scraper
node greystar_paralell_scrapy_v2.js- Node.js 14+
- Google Chrome (for Puppeteer)
- 8GB+ RAM (recommended for parallel processing)
The scraper will automatically:
- Extract property links if not already done
- Resume from the last processed property
- Save data to CSV as it processes
- Handle errors gracefully
The scraper generates:
greystar_properties.csv: Main data file with all extracted propertiesgreystar_links.json: Cache of all property linksgreystar_progress.json: Progress tracking for resumption
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
web-scraping, puppeteer, greystar, real-estate, property-scraper, nodejs, parallel-processing, data-extraction, automation, csv-export, rental-properties, apartment-scraper, headless-browser, resilient-scraping, property-data
- Validation and Saving
- Validates that each record has minimum data
- Saves only complete records to CSV
- Marks all attempts as processed
-
Structured JSON-LD
// Search in scripts with structured data script[type="application/ld+json"]
-
Meta Tags
// Search in meta properties and names meta[property], meta[name]
-
Text Analysis
// Regex patterns for complete addresses /(\d+\s+[A-Za-z\s]+(?:Street|St|Avenue|Ave...).*[A-Z]{2}\s+\d{5})/gi
-
Specific Selectors
// Elements with address-related classes [class*="address"], [class*="location"], [class*="contact"]
-
Phone links
a[href^="tel:"]
-
Text patterns
/(?:\+?1[-.\s]?)?\(?([0-9]{3})\)?[-.\s]?([0-9]{3})[-.\s]?([0-9]{4})/
A record is considered valid if it has:
- ✅ Phone: Valid format with country code
- ✅ Zip Code: USA format (5 digits or 5+4)
- ✅ State: 2-letter code (e.g., CA, NY, TX)
- ✅ Address or City: At least one of the two fields
function isValidRecord(communityData, addressParts, community) {
const hasPhone = communityData.phone && communityData.phone.trim() !== '';
const hasZip = addressParts.zip && addressParts.zip.trim() !== '';
const hasState = addressParts.state && addressParts.state.trim() !== '';
const hasAddressOrCity = (addressParts.address && addressParts.address.trim() !== '') ||
(addressParts.city && addressParts.city.trim() !== '');
return hasPhone && hasZip && hasState && hasAddressOrCity;
}{
headless: true, // Headless mode
timeout: 20000, // 20 seconds per page
workers: 10, // Parallel processing
delay: 800, // Pause between requests (ms)
retries: 3 // Attempts per URL
}[
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
'--disable-gpu',
'--disable-extensions',
'--disable-background-timer-throttling'
]state_name,communityName,address,city,state_address,zip,phone,email
California,"The Residences at Marina Bay","1000 Marina Bay Dr","Richmond","CA","94804","+1 510 555 1234","residencesmarinabay@greystar.com"
{
"extractedAt": "2025-07-10T21:25:05.595Z",
"totalLinks": 3341,
"links": [
{
"state": "California",
"communityName": "The Residences at Marina Bay",
"communityUrl": "https://www.greystar.com/properties/..."
}
]
}-
Page Timeouts
- Timeout configured to 20 seconds
- Marks as processed and continues
-
Navigation Errors
- Pages not found (404)
- Connectivity issues
- Continues with next link
-
Extraction Errors
- Pages with different structure
- JavaScript not executed
- Saves empty record but marks as processed
-
Validation Errors
- Incomplete data
- Incorrect formats
- Skips from CSV but marks as processed
// General information
console.log('Worker 0: ✓ Processed 15/335 - Community Name');
// Warnings (incomplete data)
console.log('Worker 0: ⚠️ Incomplete record skipped - Community Name');
// Errors
console.error('Worker 0: ✗ Error processing Community Name: timeout');- Total links found
- Already processed links
- Remaining links
- Valid records saved
- Records skipped by validation
# Run with caffeinate to avoid sleep
caffeinate node greystar_paralell_scrapy_v2.js# System automatically detects previous progress
node greystar_paralell_scrapy_v2.js# Remove state files
rm greystar_links.json greystar_progress.json greystar_properties.csv
# Run again
node greystar_paralell_scrapy_v2.js- Headless browsing: Navigation without GUI
- Parallel processing: 10 simultaneous workers
- Optimized timeouts: Balance between speed and stability
- Controlled pauses: Avoids server overload
- Strict validation: Only complete records
- Smart parsing: Multiple extraction methods
- Normalization: Consistent formats for phones
- Email generation: Based on community names
- Persistent state: Automatic recovery
- Error handling: Continues despite individual failures
- Detailed logging: Facilitates debugging and monitoring
- Thread-safe: Safe writing to shared files
- Each worker consumes ~50-100MB RAM
- 10 workers = ~500MB-1GB total RAM
- CPU: Uses multiple cores efficiently
- ~1 request per second per worker
- Total: ~10 requests/second
- Respectful with target server
- JSON links: ~500KB - 1MB
- Progress JSON: Grows to ~500KB
- Final CSV: ~1-5MB (depending on valid data)
- CSS Selectors: If Greystar changes HTML structure
- Address Patterns: For new address formats
- Timeouts: Adjust according to server speed
- Validation: Stricter or more flexible criteria
- Review logs every 30 minutes during execution
- Verify data quality in intermediate CSV
- Monitor system resource usage
- Validate that progress is saved correctly
Note: This system is designed to be robust and efficient, but always respects the terms of service of the target website and implements appropriate delays to avoid overloading the server.