Skip to content

semantic-ai/decide-pdf-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

decide-pdf-scraper

This service allows to gather the download URLs of new PDFs containing meeting resolution of local governments. A PDF is considered new if its download URL is not yet present in the triple store as the object of the predicate eli:is_exemplified_by of an ELI Manifestation.

Set-up

  1. Add the service to your Semantic.Works application in the docker-compose.yml:

     pdf-scraper:
       image: semanticai/decide-pdf-scraper:0.0.1
       environment:
         TARGET_GRAPH: http://mu.semte.ch/graphs/harvesting
         PUBLICATION_GRAPH: http://mu.semte.ch/graphs/public/pdf
         ALLOW_MU_AUTH_SUDO: true
    
  2. The file sparql_config.py allows to easily configure SPARQL prefixes and URIs. In case a single graph for input and a single graph for output is desired, set the environment variables TARGET_GRAPH (input) and/or PUBLICATION_GRAPH (output).

Running

Run the container using

docker compose up -d # run without -d flag when you don't want to run it in the background

Example

Open your local SPARQL query editor (by default configured to run on http://localhost:8890/sparql as set by lblod/app-decide), and run the following query to create a Task to scrape for new PDFs:

PREFIX adms: <http://www.w3.org/ns/adms#>
PREFIX task: <http://redpencil.data.gift/vocabularies/tasks/>
PREFIX dct:  <http://purl.org/dc/terms/>
PREFIX xsd:  <http://www.w3.org/2001/XMLSchema#>
PREFIX nfo:  <http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#>
PREFIX nie:  <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#>
PREFIX mu:   <http://mu.semte.ch/vocabularies/core/>

INSERT DATA {

  GRAPH <http://mu.semte.ch/graphs/harvesting> {
    <http://data.lblod.info/id/tasks/demo-pdf-scraping>
      a task:Task ;
      mu:uuid "demo-pdf-scraping" ;
      adms:status <http://redpencil.data.gift/id/concept/JobStatus/scheduled> ;
      task:operation <http://lblod.data.gift/id/jobs/concept/TaskOperation/pdf-scraping> ;
      task:inputContainer <http://data.lblod.info/id/data-container/demo-scraping> ;
      dct:created "2025-10-31T09:00:00Z"^^xsd:dateTime .
  }

  GRAPH <http://mu.semte.ch/graphs/harvesting> {
    <http://data.lblod.info/id/data-container/demo-scraping>
      a nfo:DataContainer ;
      mu:uuid "demo-scraping" ;
      task:hasHarvestingCollection
        <http://lblod.data.gift/id/harvest-collections/demo-collection> .
  }

  GRAPH <http://mu.semte.ch/graphs/harvesting> {
    <http://lblod.data.gift/id/harvest-collections/demo-collection>
      a <http://lblod.data.gift/vocabularies/harvesting/HarvestingCollection> ;
      mu:uuid "demo-collection" ;
      dct:hasPart <http://lblod.data.gift/id/remote-data-objects/demo-source> .
  }

  GRAPH <http://mu.semte.ch/graphs/harvesting> {
    <http://lblod.data.gift/id/remote-data-objects/demo-source>
      a nfo:RemoteDataObject ;
      mu:uuid "demo-source" ;
      nie:url <https://district09.gent/nl/over-ons/wettelijke-documenten/besluiten-overlegorgaan> .  # CHANGE THIS TO THE DESIRED SOURCE
  }
}

Trigger this task using

curl -X POST http://localhost:8080/delta \
  -H "Content-Type: application/json" \
  -d '[
    {
      "inserts": [
        {
          "subject": { "type": "uri", "value": "http://data.lblod.info/id/tasks/demo-pdf-scraping" },
          "predicate": { "type": "uri", "value": "http://www.w3.org/ns/adms#status" },
          "object": { "type": "uri", "value": "http://redpencil.data.gift/id/concept/JobStatus/scheduled" },
          "graph": { "type": "uri", "value": "http://mu.semte.ch/graphs/harvesting" }
        }
      ],
      "deletes": []
    }
  ]'

The new PDFs are represented as remote objects in the triple store, grouped within a harvesting collection that belongs to the tasks’s output data container The following SPARQL queries can be used to check the results:

Check the tasks (including data output containers):

PREFIX adms: <http://www.w3.org/ns/adms#>
PREFIX task: <http://redpencil.data.gift/vocabularies/tasks/>

SELECT ?task ?status ?operation ?resultsContainer
WHERE {
  GRAPH <http://mu.semte.ch/graphs/harvesting> {
    ?task a task:Task ;
          adms:status ?status ;
          task:operation ?operation .

    OPTIONAL { ?task task:resultsContainer ?resultsContainer . }
  }
}
ORDER BY ?task

Check the remote objects:

PREFIX nfo:  <http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#>
PREFIX nie:  <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#>
PREFIX mu:   <http://mu.semte.ch/vocabularies/core/>

SELECT ?remoteDataObject ?uuid ?url
FROM <http://mu.semte.ch/graphs/harvesting>
WHERE {
  ?remoteDataObject a nfo:RemoteDataObject ;
                    mu:uuid ?uuid ;
                    nie:url ?url .
}

Check the harvesting collection and its parts:

PREFIX mu: <http://mu.semte.ch/vocabularies/core/>
PREFIX dct: <http://purl.org/dc/terms/>

SELECT ?harvestCollection ?uuid ?part
FROM <http://mu.semte.ch/graphs/harvesting>
WHERE {
  ?harvestCollection a <http://lblod.data.gift/vocabularies/harvesting/HarvestingCollection> ;
                     mu:uuid ?uuid ;
                     dct:hasPart ?part .
}

Check the output data container and its harvesting collection:

PREFIX mu: <http://mu.semte.ch/vocabularies/core/>
PREFIX nfo: <http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#>
PREFIX task: <http://redpencil.data.gift/vocabularies/tasks/>

SELECT ?dataContainer ?uuid ?harvestCollection
FROM <http://mu.semte.ch/graphs/harvesting>
WHERE {
  ?dataContainer a nfo:DataContainer ;
                 mu:uuid ?uuid ;
                 task:hasHarvestingCollection ?harvestCollection .
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors