This project is a multi-threaded web crawler built in C++. It uses the Curl library for making HTTP requests, the Gumbo library for parsing HTML, and the nlohmann/json library for JSON manipulation. The crawler can crawl websites up to a specified depth and save metadata such as page titles, descriptions, and links.
- Setup Instructions
- Installation Guide for Dependencies
- Cloning the Repository
- Building the Project
- Running the Web Crawler
- Getting Started
To set up and run the web crawler project, follow these instructions:
Make sure you have the following dependencies installed on your system:
- C++ compiler (e.g., GCC, Clang)
- CMake (version 3.14 or later)
- Curl library
- Gumbo library
- nlohmann/json library
To install the Curl library, use your package manager. For example, on Debian-based systems (like Ubuntu), run:
sudo apt-get install libcurl4-openssl-devYou can install Curl using vcpkg. First, install vcpkg and then run:
vcpkg install curlInstall the Gumbo library using the following command:
sudo apt-get install libgumbo-devTo install Gumbo on Windows, use vcpkg:
vcpkg install gumbo-parserYou can install the nlohmann/json library via a package manager:
sudo apt-get install nlohmann-json3-devInstall the nlohmann/json library using vcpkg:
vcpkg install nlohmann-jsonMake sure to configure your build system to include the paths to these libraries as needed.
git clone https://github.com/your-username/web-crawler.git
cd web-crawlerCreate a obj directory, generate the necessary files with make, and compile the project:
./make.shTo start crawling from a seed URL, thread, Max Depth, run:
./bin/web_crawler javatpoint.com 16 3Replace https://www.javatpoint.com with the desired seed URL.
# 1. Clone your forked repo
git clone https://github.com/your-username/repo-name.git
cd repo-name
# 2. Add upstream remote to keep up with the original repo
git remote add upstream https://github.com/original-owner/repo-name.git
git fetch upstream
# 3. Create a new branch for your feature/fix
git checkout -b feature-or-fix-description
# 4. Before making changes, ensure your local main branch is up-to-date
git checkout main
git fetch upstream
git rebase upstream/main
# 5. Rebase your feature branch onto the updated main branch
git checkout feature-or-fix-description
git rebase main
# 6. Make your changes, then stage and commit with a descriptive message
git add .
git commit -m "Description of changes"
# 7. Push your branch to your fork
git push origin feature-or-fix-description
# 8. If updates happen on upstream/main while your PR is open, keep your branch updated:
git fetch upstream # Get latest updates from the original repo
git rebase upstream/main # Rebase your feature branch onto it
git push origin feature-or-fix-description --force # Force-push updated branch
This web crawler project is a multi-threaded application that effectively crawls websites, extracts relevant metadata, and handles rate-limiting for domains. With its modular design, it is easy to extend or modify the functionality for additional use cases. You can adjust the number of worker threads, crawl depth, and domain request delay for optimized performance based on your requirements.
Feel free to contribute to the project by submitting issues or pull requests. Happy crawling!