Package, which extracts paths and attributes from the image, anchor and other tags of the provided html.
composer require sebastiansulinski/path-extractorYou can instantiate Extractor either by using new keyword or static make method.
Constructor takes and optional argument, which represents the string to be parsed.
use SSD\PathExtractor\Extractor;
$extractor = new Extractor;
$extractor = new Extractor($html);
$extractor = Extractor::make();
$extractor = Extractor::make($html);Apart from being able to pass your string via constructor, you can also use the Extractor::for method to set it on the instance.
$extractor = new Extractor;
$extractor->for($html);To extract all images use the Extractor::extract(Image::class) method.
use \SSD\PathExtractor\Tags\Image;
$html = '<img src="/media/image.jpg" alt="My image">';
$html = .'<img src="/media/image2.png" alt="My image 2">';
$images = Extractor::make($html)->extract(Image::class);The above will return array containing the collection of \SSD\PathExtractor\Tags\Image class instances with properties src and alt available.
To extract all anchors use the Extractor::extract(Anchor::class) method.
use \SSD\PathExtractor\Tags\Anchor;
$html = '<a href="/media/files/one.pdf" target="_blank">Document one</a>';
$html = .'<a href="/media/files/two.docx" title="Word document">Word document</a>';
$anchors = Extractor::make($html)->extract(Anchor::class);The above will return array containing the collection of \SSD\PathExtractor\Tags\Anchor class instances with properties href, target, title and nodeValue available.
To extract all anchors use the Extractor::extract(Script::class) method.
use \SSD\PathExtractor\Tags\Script;
$html = '<script src="/media/script/one.js" async></script>';
$html = .'<script src="/media/script/two.js" async defer></script>';
$html = .'<script src="/media/script/three.js"></script>';
$scripts = Extractor::make($html)->extract(Script::class);The above will return array containing the collection of \SSD\PathExtractor\Tags\Script class instances with properties src, async, and defer available - last two with boolean true / false set based on whether they are present or not.
Sometimes you might want to only extract images or anchors with certain extensions.
To do this use the Extractor::withExtensions() method and pass the required extensions as argument.
$images = Extractor::make($html)->withExtensions('jpg')->extract(Image::class);
$anchors = Extractor::make($html)->withExtensions(['pdf', 'docx'])->extract(Anchor::class);
$anchors = Extractor::make($html)->withExtensions('pdf', 'docx')->extract(Anchor::class);Sometimes you might wish to prepend the protocol, domain name and even a port to the relative paths extracted from your html.
To do this, use the Extractor::withUrl() method.
$html = '<img src="/media/image.jpg" alt="My image">';
$html .= '<img src="https://ssdtutorials.com/media/image2.jpg" alt="My image 2">';
$images = Extractor::make($html)->withUrl('https://mywebsite.com')->extract(Image::class);The above will return an array containing two instances of \SSD\PathExtractor\Tags\Image - one with src set to https://mywebsite.com/media/image.jpg and the other to https://ssdtutorials.com/media/image2.jpg. Please note - it will not replace the paths which already contain protocol and domain.
If you'd like your input to first undergo the purification, you can use the Extractor::withTidy() method.
This method takes 2 optional arguments: array $config = [], which allows you to overwrite default tidy extension configuration as well as string $encoding = 'utf8' should you need to change the encoding.
By default config is set to
[
'clean' => 'yes',
'output-html' => 'yes',
'wrap' => 0,
]More on config options at HTML Tidy Configuration Options.
If you decide NOT to use tidy to purify your input, where for instance you will do this before passing the html to the constructor or for method and if the provided html contains invalid syntax, the \SSD\PathExtractor\InvalidHtmlException will be thrown - so make sure you catch it and act accordingly.
Each implementation of \SSD\PathExtractor\Tags\Tag will have their own, unique set of properties available
\SSD\PathExtractor\Tags\Anchor
- href
- target
- title
- rel
- nodeValue (represents text in between opening and closing a tag)
\SSD\PathExtractor\Tags\Image
- src
- alt
- width
- height
\SSD\PathExtractor\Tags\Script
- src
- type
- charset
- async
- defer
\SSD\PathExtractor\Tags\Link
- href
- type
- relOnce you have extracted the collection of resources, you can then return an html tag for each one by simply casting it to string or by calling the tag() method on it.
$html = '<img src="/media/image.jpg" alt="My image">';
$html = .'<img src="/media/image2.png" alt="My image 2">';
$tag1 = (string)Extractor::make($html)->withExtensions('jpg')->extract(Image::class)[0];
$tag2 = Extractor::make($html)->withExtensions('jpg')->extract(Image::class)[0]->tag();Both of the above will return
<img src="/media/image.jpg" alt="My image">You can also obtain array representation of each instance by calling Tag::toArray() method on it
Extractor::make($html)->withExtensions('jpg')->extract(Image::class)[0]->toArray()If you need more tag types i.e. link - simply add new class that extends \SSD\PathExtractor\Tags\Tag and implement the abstract methods required by it.
use SSD\PathExtractor\Tags\Tag;
use SSD\PathExtractor\Tags\Type;
class Link extends Tag
{
/**
* Get tag name.
*
* @return string
*/
static public function tagName(): string
{
return 'link';
}
/**
* Get path attribute.
*
* @return string
*/
static public function pathAttribute(): string
{
return 'href';
}
/**
* Get available attributes.
*
* @return array
*/
static public function availableAttributes(): array
{
return [
'href' => Type::STRING,
'type' => Type::STRING,
'rel' => Type::STRING,
];
}
/**
* Get formatted tag.
*
* @return string
*/
public function tag(): string
{
return '<link'.$this->tagAttributes('href', 'type', 'rel').'>';
}
}$string = '<img src="/media/image/one.jpg" alt="Image one">';
$string .= '<img src="https://mysite.com/media/image/two.jpg" alt="Image two">';
$string .= '<a href="/media/files/two.pdf" '.
'target="_blank" title="Document">Document</a>';
$string .= '<script src="/media/script/three.js" async></script>';
$string .= '<link href="/media/link/three.css" rel="stylesheet">';
$extractor = Extractor::make($string);
$images = array_map(function (Tag $tag) {
return $tag->path();
}, $extractor->extract(Image::class));
$anchors = array_map(function (Tag $tag) {
return $tag->path();
}, $extractor->extract(Anchor::class));
$scripts = array_map(function (Tag $tag) {
return $tag->path();
}, $extractor->extract(Script::class));
$links = array_map(function (Tag $tag) {
return $tag->path();
}, $extractor->extract(Link::class));
$this->assertEquals([
'/media/image/one.jpg',
'https://mysite.com/media/image/two.jpg',
'/media/files/two.pdf',
'/media/script/three.js',
'/media/link/three.css',
], array_merge($images, $anchors, $scripts, $links));