Skip to content

Add trafilatura as alternative to readability / mercury / html2text / defuddle #5

@pirate

Description

@pirate

Add a plugin similar to the readability / mercury aka postlight-parser / html2text ones, but using instead:

https://github.com/adbar/trafilatura

We dont need it's crawling/discovery features, only the single url in -> extract output features. Ideally it should expose env vars to allow toggling the various outputs it supports, including:

  • markdown
  • CSV
  • html
  • plain text
  • any others that might be useful

We should wire it up to take in the existing html extracted by the singlefile output, chrome dom output, wget output, etc. similar to readability / mercury instead of re-downloading the page from scratch.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions