Skip to content

iprodev/PHP-XML-Sitemap-Generator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

PHP XML Sitemap Generator (Library + CLI)

A professional, production-ready PHP sitemap generator by IProDev (Hemn Chawroka) β€” supports concurrency, robots.txt, gzip compression, sitemap index files, and comprehensive error handling.

PHP Version License

✨ Features

  • πŸš€ High Performance: Concurrent HTTP requests using Guzzle
  • πŸ€– Robots.txt Support: Respects robots.txt rules (including wildcards)
  • πŸ“¦ Gzip Compression: Automatic .gz file generation
  • πŸ“Š Sitemap Index: Automatic index file creation for large sites
  • πŸ›‘οΈ Error Handling: Comprehensive error handling and validation
  • πŸ“ Logging: PSR-3 compatible logging support
  • 🎯 Canonical URLs: Automatic canonical URL detection
  • πŸ§ͺ Well Tested: Comprehensive unit tests with PHPUnit
  • 🐳 Docker Support: Ready-to-use Docker configuration
  • πŸ’» CLI Tool: Professional command-line interface with progress reporting

πŸ“‹ Requirements

  • PHP >= 8.0
  • Composer
  • Extensions: curl, xml, mbstring, zlib

πŸ“₯ Installation

composer require iprodev/sitemap-generator-pro

πŸš€ CLI Usage

Basic Usage

php bin/sitemap --url=https://www.example.com

Advanced Usage

php bin/sitemap \
  --url=https://www.iprodev.com \
  --out=./sitemaps \
  --concurrency=20 \
  --max-pages=10000 \
  --max-depth=5 \
  --public-base=https://www.iprodev.com \
  --verbose

CLI Options

Option Required Default Description
--url βœ… Yes - Starting URL to crawl
--out No ./output Output directory for sitemap files
--concurrency No 10 Number of concurrent HTTP requests (1-100)
--max-pages No 50000 Maximum number of pages to crawl
--max-depth No 5 Maximum link depth to follow
--public-base No - Public base URL for sitemap index
--verbose, -v No false Enable verbose output
--help, -h No - Show help message

CLI Output Example

======================================================================
  PHP XML Sitemap Generator
======================================================================
Configuration:
  URL:         https://www.example.com
  Domain:      www.example.com
  Output:      ./output
  Concurrency: 20
  Max Pages:   10000
  Max Depth:   5
======================================================================

[0.50s] [info] Initializing crawler...
[0.75s] [info] Fetching robots.txt...
[1.20s] [info] Starting crawl...
[45.30s] [info] Crawl completed {"duration":"45.3s","pages":1523}

======================================================================
  βœ… Success!
======================================================================
Generated Files:
  β€’ sitemap-1.xml.gz (125.4 KB)
  β€’ sitemap-index.xml (892 B)

Statistics:
  β€’ Total Pages:    1523
  β€’ Total Time:     46.2s
  β€’ Crawl Speed:    33.0 pages/sec
  β€’ Memory Used:    45.8 MB
  β€’ Output Dir:     ./output
======================================================================

πŸ’» Programmatic Usage

Basic Example

use IProDev\Sitemap\Fetcher;
use IProDev\Sitemap\Crawler;
use IProDev\Sitemap\SitemapWriter;
use IProDev\Sitemap\RobotsTxt;

// Initialize fetcher
$fetcher = new Fetcher(['concurrency' => 10]);

// Load robots.txt
$robots = RobotsTxt::fromUrl('https://www.example.com', $fetcher);

// Create crawler
$crawler = new Crawler($fetcher, $robots);

// Crawl website
$pages = $crawler->crawl('https://www.example.com', 10000, 5);

// Write sitemap files
$files = SitemapWriter::write(
    $pages, 
    __DIR__ . '/sitemaps', 
    50000, 
    'https://www.example.com'
);

echo "Generated " . count($files) . " files\n";

Advanced Example with Logging

use IProDev\Sitemap\Fetcher;
use IProDev\Sitemap\Crawler;
use IProDev\Sitemap\SitemapWriter;
use IProDev\Sitemap\RobotsTxt;
use Monolog\Logger;
use Monolog\Handler\StreamHandler;

// Create logger
$logger = new Logger('sitemap');
$logger->pushHandler(new StreamHandler('sitemap.log', Logger::INFO));

// Initialize with logger
$fetcher = new Fetcher([
    'concurrency' => 20,
    'timeout' => 15,
], $logger);

$robots = RobotsTxt::fromUrl('https://www.example.com', $fetcher);
$crawler = new Crawler($fetcher, $robots, $logger);

// Crawl with error handling
try {
    $pages = $crawler->crawl('https://www.example.com', 10000, 5);
    $files = SitemapWriter::write($pages, './sitemaps', 50000, 'https://www.example.com');
    
    // Get statistics
    $stats = $crawler->getStats();
    echo "Processed: {$stats['processed']} pages\n";
    echo "Unique URLs: {$stats['unique_urls']}\n";
    
} catch (\InvalidArgumentException $e) {
    echo "Configuration error: {$e->getMessage()}\n";
} catch (\RuntimeException $e) {
    echo "Runtime error: {$e->getMessage()}\n";
}

Custom Fetcher Configuration

$fetcher = new Fetcher([
    'concurrency' => 20,
    'timeout' => 15,
    'connect_timeout' => 10,
    'headers' => [
        'User-Agent' => 'MyBot/1.0',
    ],
    'verify' => true, // SSL verification
], $logger);

πŸ§ͺ Testing

Run unit tests:

composer install
vendor/bin/phpunit

Run with coverage:

vendor/bin/phpunit --coverage-html coverage

Code style check:

vendor/bin/phpcs --standard=PSR12 src/ tests/

🐳 Docker Usage

Build the Docker image:

docker build -t sitemap-generator-pro .

Run the container:

docker run --rm \
  -v $(pwd)/sitemaps:/app/output \
  sitemap-generator-pro \
  --url=https://www.iprodev.com \
  --out=/app/output \
  --concurrency=20 \
  --max-pages=10000 \
  --public-base=https://www.iprodev.com \
  --verbose

πŸ“š API Documentation

Fetcher

// Constructor
new Fetcher(array $options = [], ?LoggerInterface $logger = null)

// Fetch multiple URLs concurrently
fetchMany(array $urls, callable $onFulfilled, ?callable $onRejected = null): void

// Fetch single URL
get(string $url): ResponseInterface

// Get concurrency setting
getConcurrency(): int

Crawler

// Constructor
new Crawler(Fetcher $fetcher, RobotsTxt $robots, ?LoggerInterface $logger = null)

// Crawl website
crawl(string $startUrl, int $maxPages = 10000, int $maxDepth = 5): array

// Get crawl statistics
getStats(): array

SitemapWriter

// Write sitemap files
static write(
    array $pages, 
    string $outPath, 
    int $maxPerFile = 50000, 
    ?string $publicBase = null
): array

Parser

// Extract links from HTML
static extractLinks(string $html, string $baseUrl): array

// Resolve relative URL
static resolveUrl(string $href, string $base): ?string

// Get canonical URL
static getCanonical(string $html, string $baseUrl): ?string

// Get meta robots directives
static getMetaRobots(string $html): array

RobotsTxt

// Load from URL
static fromUrl(string $baseUrl, Fetcher $fetcher): RobotsTxt

// Check if URL is allowed
isAllowed(string $url): bool

// Get disallow rules
getDisallows(): array

// Get allow rules
getAllows(): array

Utils

static normalizeUrl(string $url): string
static formatBytes(int $bytes, int $precision = 2): string
static formatDuration(float $seconds): string
static isValidUrl(string $url): bool
static getDomain(string $url): ?string
static calculateProgress(int $current, int $total): float
static progressBar(int $current, int $total, int $width = 50): string
static getMemoryUsage(): string
static getPeakMemoryUsage(): string
static cleanUrl(string $url, bool $removeQuery = false): string

πŸ”§ Configuration Best Practices

For Small Sites (< 1,000 pages)

--concurrency=5 --max-pages=1000 --max-depth=10

For Medium Sites (1,000 - 10,000 pages)

--concurrency=10 --max-pages=10000 --max-depth=5

For Large Sites (> 10,000 pages)

--concurrency=20 --max-pages=50000 --max-depth=3

πŸ›‘οΈ Error Handling

The library includes comprehensive error handling:

  • Invalid URLs: Validates all URLs before processing
  • Network Errors: Gracefully handles timeouts and connection failures
  • Memory Management: Efficient memory usage for large sites
  • File System Errors: Proper validation and error messages
  • Robots.txt Parsing: Handles malformed robots.txt files

πŸ“ Generated Files

The generator creates the following files:

  • sitemap-1.xml - First sitemap file
  • sitemap-1.xml.gz - Compressed version
  • sitemap-2.xml.gz - Additional files if needed
  • sitemap-index.xml - Index file listing all sitemaps

πŸ”’ Security Considerations

  • Path traversal prevention
  • URL validation and sanitization
  • Safe XML generation with proper escaping
  • Robots.txt respect
  • Meta robots tag support
  • SSL certificate verification

πŸ“Š Performance Tips

  1. Increase Concurrency: For faster crawling (up to 100)
  2. Reduce Max Depth: Focus on important pages
  3. Use Memory: Ensure adequate memory for large sites
  4. Network: Fast and stable internet connection recommended
  5. Robots.txt: Proper robots.txt reduces unnecessary requests

🀝 Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Write tests for new features
  4. Follow PSR-12 coding standards
  5. Submit a pull request

πŸ“„ License

MIT License - see LICENSE.md for details

πŸ™ Credits

Created by iprodev - https://github.com/iprodev

πŸ“ž Support


Made with ❀️ by iprodev

About

PHP Script that generates a sitemap by crawling a given URL.

Resources

License

Stars

Watchers

Forks

Packages

No packages published