Skip to content

Add ELOG scraper and deployment for FNAL dCache ELOG using FNAL Ollama server#456

Open
nhduongvn wants to merge 2 commits intoarchi-physics:devfrom
nhduongvn:fnal-dcache-elog
Open

Add ELOG scraper and deployment for FNAL dCache ELOG using FNAL Ollama server#456
nhduongvn wants to merge 2 commits intoarchi-physics:devfrom
nhduongvn:fnal-dcache-elog

Conversation

@nhduongvn
Copy link
Copy Markdown

  • Add ElogScraper to crawl ELOG logbooks (pagination, entry parsing, structured metadata extraction: tech, category, node, incident_date, etc.)
  • Support explicit elog-<url> prefix in input lists for unambiguous ELOG URL detection alongside the existing heuristic (_is_elog_url)
  • Update cms-comp-ops agent prompt with ELOG tool guidance: use tech: field for person queries, cite url metadata (not internal hashes), clarify that [N] are result indices not ELOG entry numbers, note 5-result limit
  • Add examples/deployments/basic-ollama-fnal with config targeting ollama.fnal.gov and the FNAL dCache ELOG as a data source

Assisted by Claude Sonnet 4.6

- Add ElogScraper to crawl ELOG logbooks (pagination, entry parsing,
  structured metadata extraction: tech, category, node, incident_date, etc.)
- Support explicit `elog-<url>` prefix in input lists for unambiguous ELOG
  URL detection alongside the existing heuristic (_is_elog_url)
- Update cms-comp-ops agent prompt with ELOG tool guidance: use tech: field
  for person queries, cite url metadata (not internal hashes), clarify that
  [N] are result indices not ELOG entry numbers, note 5-result limit
- Add examples/deployments/basic-ollama-fnal with config targeting
  ollama.fnal.gov and the FNAL dCache ELOG as a data source

Assisted by Claude Sonnet 4.6
@pmlugato pmlugato added enhancement New feature or request labels Feb 23, 2026
@swinney swinney requested a review from Copilot March 16, 2026 19:48
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class scraping support for ELOG logbooks and wires it into the existing ScraperManager, along with an example FNAL/Ollama deployment configuration and agent guidance.

Changes:

  • Extend ScraperManager to detect elog- URLs (and simple heuristics) and run an ELOG collection step.
  • Introduce ElogScraper integration to crawl ELOG index pages, discover entries, and persist each entry as a ScrapedResource.
  • Add example deployment config, input lists, and prompt/agent guidance updates for CMS Comp Ops usage.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/data_manager/collectors/scrapers/scraper_manager.py Adds ELOG config parsing, URL classification, and a new collect_elog collection path.
src/data_manager/collectors/scrapers/integrations/elog_scraper.py New requests/BeautifulSoup-based crawler for ELOG pagination + entry extraction.
examples/deployments/basic-ollama-fnal/miscellanea.list Example input list content (mostly commented).
examples/deployments/basic-ollama-fnal/dcache-elog.list Example ELOG-prefixed logbook URL input list.
examples/deployments/basic-ollama-fnal/config.yaml Example deployment config enabling ELOG scraping options and referencing input lists.
examples/deployments/basic-ollama-fnal/condense.prompt Adds condense prompt template for the example deployment.
examples/deployments/basic-ollama-fnal/agent.prompt Adds example agent prompt.
examples/agents/cms-comp-ops.md Adds guidance for using metadata tools with ELOG-derived fields/URLs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +160 to +162
def collect_elog(self, persistence: PersistenceService, extra_urls: List[str] = []) -> int:
"""Collect all entries from configured ELOG logbooks.

Comment on lines +171 to +181
urls_to_scrape: List[str] = list(extra_urls)
if self.elog_enabled:
urls_to_scrape.append(self.elog_config.get("url"))

if not urls_to_scrape:
return 0

total = 0
for url in urls_to_scrape:
cfg = {**self.elog_config, "url": url}
scraper = ElogScraper(cfg)
Comment on lines +33 to +40
self.base_url = config.get("url", "").rstrip("/") + "/"
self.max_entries: Optional[int] = config.get("max_entries")
self.verify_ssl = config.get("verify_ssl", False)
self._session = requests.Session()
if not self.verify_ssl:
import urllib3
urllib3.disable_warnings()

Comment on lines +149 to +150
for part in text.split():
pass # entry_time already in meta via hidden inputs if needed
# This is a very general prompt for condensing histories, so for base installs it will not need to be modified
#
# All condensing prompts must have the following tags in them, which will be filled with the appropriate information:
# {chat_history}
@@ -0,0 +1,54 @@
# Basic configuration file for a Archi deployment
local:
enabled: true
base_url: https://ollama.fnal.gov # make sure this matches your ollama server URL!
mode: ollama #call to LanChain class ChatOllama, other option is openai_compat which calls ChatOpenAI LanChain class
Comment on lines +63 to +67
def _discover_entry_urls(self) -> list[str]:
"""Return deduplicated entry URLs collected from all index pages."""
seen: set[str] = set()
result: list[str] = []

@juanpablosalas
Copy link
Copy Markdown
Collaborator

I left some initial comments and I'm testing it out using FNAL's Storage Archi instance. However, I think a big part that is missing is the base-config.yaml and setting up the config manager so that the e-log configuration is propagated there as well.

@juanpablosalas juanpablosalas self-requested a review March 27, 2026 21:07
@nhduongvn
Copy link
Copy Markdown
Author

Thank you Juan Pablo. I am taking a look on these comments

@pmlugato pmlugato changed the base branch from main to dev April 3, 2026 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request low prio

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants