A fully automated, cross-platform web scraper that extracts credentialed tax professionals from the official IRS Return Preparer Office site based on dynamic user input. Designed with clean modular architecture and robust input validation, this project utilizes Selenium, pandas, and Python 3 to efficiently collect and export structured data into CSV format.
Watch Demo Video (Update: Will keep CSV formatted the same as a guide because this is the free version)
- Modular Design: Separated logic into validation, scraping, and utility components for scalability and readability.
- User-Centric Input Flow: Validates and guides user input interactively with intuitive prompts.
- Real Data Pipeline: Collects credentialed tax professional data across multiple pages and outputs a clean, structured CSV.
- Cross-Platform File Access: Automatically opens the output file regardless of OS (macOS, Windows, Linux).
- Fast and Scalable: Designed to handle multiple pages of results and custom filtering across six credential categories.
- Language: Python 3.11
- Libraries: selenium, pandas
- Browser Automation: ChromeDriver
- Environment: Cross-platform (tested on macOS and Windows)
I built this to explore how automation and clean code principles could be applied to real-world datasets provided by government portals. The IRS RPO database offered a complex structure that required form interaction, conditional filtering, and pagination — the perfect challenge for testing my Selenium and data processing skills.
| Feature | Description |
|---|---|
| Input Validation | Ensures users provide valid integers, ZIP codes, and boolean responses |
| IRS Website Automation | Interacts with dropdowns, checkboxes, and navigates multiple pages |
| Credential Filtering | Filters by Attorney, CPA, Enrolled Agent, Actuary, and more |
| CSV Export | All results are saved to contacts.csv with structured columns |
| File Launcher | Automatically opens the result in your system’s default CSV viewer |
These improvements make the CSV easier to use, more structured, and ready for manual enrichment.
- Separated Name Columns: Full names are now split into distinct First Name and Last Name columns. The script automatically detects the comma in names like
"Doe, Jane"and separates them accordingly. - Manual Entry Fields: Added two new columns —
EmailandPhone Number— which are left blank for users to fill out manually. These details are often unavailable or unreliable from the IRS database, making manual entry more practical. - Other Info Consolidation: Details such as City, Distance, and other location-related metadata are now grouped into a single column called
Other Info. This provides helpful context for users while completing contact fields likeEmailandPhone Number.
Requirements:
pip install selenium pandas
python main.py
- ZIP Code
- Search distance (in miles)
- Number of pages to scrape
- Which credential types to include (yes/no)
irs-credential-scraper/main.py— Orchestrates the workflow from input to outputinput_validation.py— Handles user interaction and sanitizationweb_scraping.py— Core Selenium scraping logicutils.py— OS-aware utility for opening CSVcontacts.csv— Exported results (generated after run)README.md— This documentation file
- Learned how to automate interaction with complex web forms using Selenium, including dropdowns, checkboxes, and pagination.
- Developed strong input validation techniques to ensure robust user interaction and prevent invalid data entry.
- Gained experience structuring a modular Python project for clarity, maintainability, and scalability.
- Improved skills in data extraction, cleaning, and exporting to CSV format using pandas.
- Learned to handle cross-platform compatibility for launching files in different operating systems.
- Improve the scraper’s resilience by implementing robust error handling to better manage website load delays, unexpected changes in the IRS site structure, and element availability issues.
- Explore replacing Selenium with direct API calls or alternative data sources where possible to increase reliability, reduce maintenance, and speed up data extraction.
- Add support for exporting data in additional formats such as Excel or JSON to increase flexibility for end users.
- Implement multi-threading or asynchronous scraping techniques to improve performance and reduce runtime.
- Enhance filtering options with more detailed credential categories and geographic search parameters to provide more precise results.
git clone https://github.com/yourusername/irs-credential-scraper.git
cd irs-credential-scraper