Skip to content

feature : Introduce LLM Scrapper #7

@rafmartom

Description

@rafmartom

Current Behaviour

For the scrapping of each documentation a different ruleset need to be specified per documentation, the process needs to be unified as much as possible, you can see a template like ./documentations/template.sh , the rules should be getting unified as much as possible (maintaining always the flexibility of creating your own as there is always edge cases where the spidering is not easily done, the parsing is different as may need some JS , or something different to be checked etc...)

Still for the bulk of the documentations, the process is straight forward , and it is ending up on human intervention on the following parts:

  • Given a link , or list of links
  • Select the subpaths from the spidered .html
  • Apply a ruleset of selectors for all the .html pages , and if they differ, create per subpath rules.

Intended Behaviour

Make the LLM's assist in the process of selecting the subpaths of insterest for the Filtering Stage and Determine the Title and Body Selectors for each subpath.
Make the LLM update the ruleset with the time

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions