A professional, production-ready PHP sitemap generator by IProDev (Hemn Chawroka) β supports concurrency, robots.txt, gzip compression, sitemap index files, and comprehensive error handling.
- π High Performance: Concurrent HTTP requests using Guzzle
- π€ Robots.txt Support: Respects robots.txt rules (including wildcards)
- π¦ Gzip Compression: Automatic .gz file generation
- π Sitemap Index: Automatic index file creation for large sites
- π‘οΈ Error Handling: Comprehensive error handling and validation
- π Logging: PSR-3 compatible logging support
- π― Canonical URLs: Automatic canonical URL detection
- π§ͺ Well Tested: Comprehensive unit tests with PHPUnit
- π³ Docker Support: Ready-to-use Docker configuration
- π» CLI Tool: Professional command-line interface with progress reporting
- PHP >= 8.0
- Composer
- Extensions:
curl,xml,mbstring,zlib
composer require iprodev/sitemap-generator-prophp bin/sitemap --url=https://www.example.comphp bin/sitemap \
--url=https://www.iprodev.com \
--out=./sitemaps \
--concurrency=20 \
--max-pages=10000 \
--max-depth=5 \
--public-base=https://www.iprodev.com \
--verbose| Option | Required | Default | Description |
|---|---|---|---|
--url |
β Yes | - | Starting URL to crawl |
--out |
No | ./output |
Output directory for sitemap files |
--concurrency |
No | 10 |
Number of concurrent HTTP requests (1-100) |
--max-pages |
No | 50000 |
Maximum number of pages to crawl |
--max-depth |
No | 5 |
Maximum link depth to follow |
--public-base |
No | - | Public base URL for sitemap index |
--verbose, -v |
No | false |
Enable verbose output |
--help, -h |
No | - | Show help message |
======================================================================
PHP XML Sitemap Generator
======================================================================
Configuration:
URL: https://www.example.com
Domain: www.example.com
Output: ./output
Concurrency: 20
Max Pages: 10000
Max Depth: 5
======================================================================
[0.50s] [info] Initializing crawler...
[0.75s] [info] Fetching robots.txt...
[1.20s] [info] Starting crawl...
[45.30s] [info] Crawl completed {"duration":"45.3s","pages":1523}
======================================================================
β
Success!
======================================================================
Generated Files:
β’ sitemap-1.xml.gz (125.4 KB)
β’ sitemap-index.xml (892 B)
Statistics:
β’ Total Pages: 1523
β’ Total Time: 46.2s
β’ Crawl Speed: 33.0 pages/sec
β’ Memory Used: 45.8 MB
β’ Output Dir: ./output
======================================================================
use IProDev\Sitemap\Fetcher;
use IProDev\Sitemap\Crawler;
use IProDev\Sitemap\SitemapWriter;
use IProDev\Sitemap\RobotsTxt;
// Initialize fetcher
$fetcher = new Fetcher(['concurrency' => 10]);
// Load robots.txt
$robots = RobotsTxt::fromUrl('https://www.example.com', $fetcher);
// Create crawler
$crawler = new Crawler($fetcher, $robots);
// Crawl website
$pages = $crawler->crawl('https://www.example.com', 10000, 5);
// Write sitemap files
$files = SitemapWriter::write(
$pages,
__DIR__ . '/sitemaps',
50000,
'https://www.example.com'
);
echo "Generated " . count($files) . " files\n";use IProDev\Sitemap\Fetcher;
use IProDev\Sitemap\Crawler;
use IProDev\Sitemap\SitemapWriter;
use IProDev\Sitemap\RobotsTxt;
use Monolog\Logger;
use Monolog\Handler\StreamHandler;
// Create logger
$logger = new Logger('sitemap');
$logger->pushHandler(new StreamHandler('sitemap.log', Logger::INFO));
// Initialize with logger
$fetcher = new Fetcher([
'concurrency' => 20,
'timeout' => 15,
], $logger);
$robots = RobotsTxt::fromUrl('https://www.example.com', $fetcher);
$crawler = new Crawler($fetcher, $robots, $logger);
// Crawl with error handling
try {
$pages = $crawler->crawl('https://www.example.com', 10000, 5);
$files = SitemapWriter::write($pages, './sitemaps', 50000, 'https://www.example.com');
// Get statistics
$stats = $crawler->getStats();
echo "Processed: {$stats['processed']} pages\n";
echo "Unique URLs: {$stats['unique_urls']}\n";
} catch (\InvalidArgumentException $e) {
echo "Configuration error: {$e->getMessage()}\n";
} catch (\RuntimeException $e) {
echo "Runtime error: {$e->getMessage()}\n";
}$fetcher = new Fetcher([
'concurrency' => 20,
'timeout' => 15,
'connect_timeout' => 10,
'headers' => [
'User-Agent' => 'MyBot/1.0',
],
'verify' => true, // SSL verification
], $logger);Run unit tests:
composer install
vendor/bin/phpunitRun with coverage:
vendor/bin/phpunit --coverage-html coverageCode style check:
vendor/bin/phpcs --standard=PSR12 src/ tests/Build the Docker image:
docker build -t sitemap-generator-pro .Run the container:
docker run --rm \
-v $(pwd)/sitemaps:/app/output \
sitemap-generator-pro \
--url=https://www.iprodev.com \
--out=/app/output \
--concurrency=20 \
--max-pages=10000 \
--public-base=https://www.iprodev.com \
--verbose// Constructor
new Fetcher(array $options = [], ?LoggerInterface $logger = null)
// Fetch multiple URLs concurrently
fetchMany(array $urls, callable $onFulfilled, ?callable $onRejected = null): void
// Fetch single URL
get(string $url): ResponseInterface
// Get concurrency setting
getConcurrency(): int// Constructor
new Crawler(Fetcher $fetcher, RobotsTxt $robots, ?LoggerInterface $logger = null)
// Crawl website
crawl(string $startUrl, int $maxPages = 10000, int $maxDepth = 5): array
// Get crawl statistics
getStats(): array// Write sitemap files
static write(
array $pages,
string $outPath,
int $maxPerFile = 50000,
?string $publicBase = null
): array// Extract links from HTML
static extractLinks(string $html, string $baseUrl): array
// Resolve relative URL
static resolveUrl(string $href, string $base): ?string
// Get canonical URL
static getCanonical(string $html, string $baseUrl): ?string
// Get meta robots directives
static getMetaRobots(string $html): array// Load from URL
static fromUrl(string $baseUrl, Fetcher $fetcher): RobotsTxt
// Check if URL is allowed
isAllowed(string $url): bool
// Get disallow rules
getDisallows(): array
// Get allow rules
getAllows(): arraystatic normalizeUrl(string $url): string
static formatBytes(int $bytes, int $precision = 2): string
static formatDuration(float $seconds): string
static isValidUrl(string $url): bool
static getDomain(string $url): ?string
static calculateProgress(int $current, int $total): float
static progressBar(int $current, int $total, int $width = 50): string
static getMemoryUsage(): string
static getPeakMemoryUsage(): string
static cleanUrl(string $url, bool $removeQuery = false): string--concurrency=5 --max-pages=1000 --max-depth=10--concurrency=10 --max-pages=10000 --max-depth=5--concurrency=20 --max-pages=50000 --max-depth=3The library includes comprehensive error handling:
- Invalid URLs: Validates all URLs before processing
- Network Errors: Gracefully handles timeouts and connection failures
- Memory Management: Efficient memory usage for large sites
- File System Errors: Proper validation and error messages
- Robots.txt Parsing: Handles malformed robots.txt files
The generator creates the following files:
sitemap-1.xml- First sitemap filesitemap-1.xml.gz- Compressed versionsitemap-2.xml.gz- Additional files if neededsitemap-index.xml- Index file listing all sitemaps
- Path traversal prevention
- URL validation and sanitization
- Safe XML generation with proper escaping
- Robots.txt respect
- Meta robots tag support
- SSL certificate verification
- Increase Concurrency: For faster crawling (up to 100)
- Reduce Max Depth: Focus on important pages
- Use Memory: Ensure adequate memory for large sites
- Network: Fast and stable internet connection recommended
- Robots.txt: Proper robots.txt reduces unnecessary requests
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Write tests for new features
- Follow PSR-12 coding standards
- Submit a pull request
MIT License - see LICENSE.md for details
Created by iprodev - https://github.com/iprodev
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Made with β€οΈ by iprodev