Crawler skip_uri_patterns is incomplete

I was using tarantula and noticed that it was trying to crawl a "tel:" link, ultimately failing because the path of the parsed URI was nil. I looked into it and saw that a simple fix would be to add tel to the [skip_uri_patterns](https://github.com/relevance/tarantula/blob/master/lib/relevance/tarantula/crawler.rb#L32) list in Crawler's initialize function. However, the crawler would have the same issue with other [URI schemes](http://en.wikipedia.org/wiki/URI_scheme) that aren't listed in skip_uri_patterns, so it seems like a more general approach may be better. Do you think it would make more sense to skip URIs that start with any scheme name, or is there a reason you specifically chose to only skip the javascript, mailto, and http schemes?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawler skip_uri_patterns is incomplete #35

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Crawler skip_uri_patterns is incomplete #35

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions