This project implements a simple web crawler that automatically traverses the web by downloading pages and following links. The crawler extracts relevant data from web pages and stores it in a structured format.
- src/main/java/com/example/WebCrawler.java: Main class that initializes the crawling process, manages the URL queue, and handles data extraction.
- src/main/java/com/example/Config.java: Responsible for loading configuration settings from the
application.propertiesfile. - src/main/resources/application.properties: Configuration settings for the application, including crawler settings and logging details.
- pom.xml: Maven configuration file specifying project dependencies and build settings.
- Starts from a given URL and extracts page data.
- Follows clickable links to traverse the web.
- Respects
robots.txtrules to avoid restricted pages. - Prevents revisiting the same URL.
- Saves the state of the URL queue and visited URLs for resumption.
- Uses multithreading to speed up crawling.
- Java 8 or higher
- Maven
git clone https://github.com/yourusername/web-crawler.git
cd web-crawlermvn clean packagemvn exec:javaYou can customize the behavior of the web crawler by modifying the application.properties file. The available settings include:
crawler.maxDepth: Maximum depth of links to follow.crawler.userAgent: User agent string to use for HTTP requests.crawler.delayBetweenRequests: Delay between requests in milliseconds.logging.level: Log level for the application (e.g., INFO, DEBUG).logging.file: Log file path.crawled.urls.output.file: Output file path for storing crawled URLs.
The crawler logs its activity to a file specified in the application.properties file. By default, it logs to logs/crawler.log.
The crawler saves the state of the URL queue and visited URLs to disk, allowing it to resume from where it left off after being shut down. The state is saved to the following files:
output/queue_state.txtoutput/visited_urls_state.txt