Add ELOG scraper and deployment for FNAL dCache ELOG using FNAL Ollama server#456
Open
nhduongvn wants to merge 2 commits intoarchi-physics:devfrom
Open
Add ELOG scraper and deployment for FNAL dCache ELOG using FNAL Ollama server#456nhduongvn wants to merge 2 commits intoarchi-physics:devfrom
nhduongvn wants to merge 2 commits intoarchi-physics:devfrom
Conversation
- Add ElogScraper to crawl ELOG logbooks (pagination, entry parsing, structured metadata extraction: tech, category, node, incident_date, etc.) - Support explicit `elog-<url>` prefix in input lists for unambiguous ELOG URL detection alongside the existing heuristic (_is_elog_url) - Update cms-comp-ops agent prompt with ELOG tool guidance: use tech: field for person queries, cite url metadata (not internal hashes), clarify that [N] are result indices not ELOG entry numbers, note 5-result limit - Add examples/deployments/basic-ollama-fnal with config targeting ollama.fnal.gov and the FNAL dCache ELOG as a data source Assisted by Claude Sonnet 4.6
Contributor
There was a problem hiding this comment.
Pull request overview
Adds first-class scraping support for ELOG logbooks and wires it into the existing ScraperManager, along with an example FNAL/Ollama deployment configuration and agent guidance.
Changes:
- Extend
ScraperManagerto detectelog-URLs (and simple heuristics) and run an ELOG collection step. - Introduce
ElogScraperintegration to crawl ELOG index pages, discover entries, and persist each entry as aScrapedResource. - Add example deployment config, input lists, and prompt/agent guidance updates for CMS Comp Ops usage.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
src/data_manager/collectors/scrapers/scraper_manager.py |
Adds ELOG config parsing, URL classification, and a new collect_elog collection path. |
src/data_manager/collectors/scrapers/integrations/elog_scraper.py |
New requests/BeautifulSoup-based crawler for ELOG pagination + entry extraction. |
examples/deployments/basic-ollama-fnal/miscellanea.list |
Example input list content (mostly commented). |
examples/deployments/basic-ollama-fnal/dcache-elog.list |
Example ELOG-prefixed logbook URL input list. |
examples/deployments/basic-ollama-fnal/config.yaml |
Example deployment config enabling ELOG scraping options and referencing input lists. |
examples/deployments/basic-ollama-fnal/condense.prompt |
Adds condense prompt template for the example deployment. |
examples/deployments/basic-ollama-fnal/agent.prompt |
Adds example agent prompt. |
examples/agents/cms-comp-ops.md |
Adds guidance for using metadata tools with ELOG-derived fields/URLs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+160
to
+162
| def collect_elog(self, persistence: PersistenceService, extra_urls: List[str] = []) -> int: | ||
| """Collect all entries from configured ELOG logbooks. | ||
|
|
Comment on lines
+171
to
+181
| urls_to_scrape: List[str] = list(extra_urls) | ||
| if self.elog_enabled: | ||
| urls_to_scrape.append(self.elog_config.get("url")) | ||
|
|
||
| if not urls_to_scrape: | ||
| return 0 | ||
|
|
||
| total = 0 | ||
| for url in urls_to_scrape: | ||
| cfg = {**self.elog_config, "url": url} | ||
| scraper = ElogScraper(cfg) |
Comment on lines
+33
to
+40
| self.base_url = config.get("url", "").rstrip("/") + "/" | ||
| self.max_entries: Optional[int] = config.get("max_entries") | ||
| self.verify_ssl = config.get("verify_ssl", False) | ||
| self._session = requests.Session() | ||
| if not self.verify_ssl: | ||
| import urllib3 | ||
| urllib3.disable_warnings() | ||
|
|
Comment on lines
+149
to
+150
| for part in text.split(): | ||
| pass # entry_time already in meta via hidden inputs if needed |
| # This is a very general prompt for condensing histories, so for base installs it will not need to be modified | ||
| # | ||
| # All condensing prompts must have the following tags in them, which will be filled with the appropriate information: | ||
| # {chat_history} |
| @@ -0,0 +1,54 @@ | |||
| # Basic configuration file for a Archi deployment | |||
| local: | ||
| enabled: true | ||
| base_url: https://ollama.fnal.gov # make sure this matches your ollama server URL! | ||
| mode: ollama #call to LanChain class ChatOllama, other option is openai_compat which calls ChatOpenAI LanChain class |
Comment on lines
+63
to
+67
| def _discover_entry_urls(self) -> list[str]: | ||
| """Return deduplicated entry URLs collected from all index pages.""" | ||
| seen: set[str] = set() | ||
| result: list[str] = [] | ||
|
|
Collaborator
|
I left some initial comments and I'm testing it out using FNAL's Storage Archi instance. However, I think a big part that is missing is the base-config.yaml and setting up the config manager so that the e-log configuration is propagated there as well. |
Author
|
Thank you Juan Pablo. I am taking a look on these comments |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
elog-<url>prefix in input lists for unambiguous ELOG URL detection alongside the existing heuristic (_is_elog_url)Assisted by Claude Sonnet 4.6