Skip to content

Conversation

@rubenszinho
Copy link
Collaborator

@rubenszinho rubenszinho commented Dec 7, 2025

Overview

This tool navigates public CNPq/Lattes researcher profiles to detect Conflicts of Interest (COI) and summarize academic production over a configurable time window (default: 5 years).

Implementation

Dual deployment architecture:

  1. Open WebUI Tool (tool/) - Can be imported directly into Open WebUI's Tools interface for natural language interaction
  2. FastAPI Service (api/) - Standalone REST API deployed on Railway for direct HTTP access and testing

Live API endpoint: https://lattes-navigator-api-production.up.railway.app/health

Features

  • Browser automation using browser-use library with Playwright
  • 7 COI detection rules (co-authorship, advisor-advisee, institutional overlap, etc.)
  • Configurable time window for production analysis
  • Structured JSON output with evidence URLs
  • Health check and debug endpoints for monitoring

Testing

A demo module (demo/) was implemented for local testing and validation of agent navigation procedures. During testing, we observed that the Lattes platform implements CAPTCHA challenges to block automated access. A fallback mechanism was added to detect and log these cases gracefully.

API reference and test results are documented in:

  • README.md - Usage, endpoints, COI rules
  • TESTING.md - Test procedures and observed results

Deployment

The API is publicly available for testing on Railway:

Deployment Screenshot

Note: Open WebUI instance (https://open-webui-production-de8c.up.railway.app/) is temporarily deactivated due to memory constraints (~2GB RAM). Validation is being conducted directly through the FastAPI endpoint.

Known Limitations

  • Lattes platform may block automated access with CAPTCHAs
  • Data extraction depends on LLM response quality
  • Rate limiting implemented to respect server resources

Request

curl -s -X POST https://lattes-navigator-api-production.up.railway.app/analyze -H "Content-Type: application/json" -d '{"researchers": [{"name": "Ricardo Marcacini", "lattes_id": "4003190744770195"}], "time_window": 5}' | python3 -m json.tool

Prompt:

TASK: Extract academic data from Brazilian Lattes CV for "{name}".

IMPORTANT INSTRUCTIONS:
- WAIT at least 5 seconds after each page navigation for JavaScript to load
- Use TEXT-BASED selectors (click buttons by their text like "Buscar", not by index)
- If a click fails, WAIT 3 seconds and retry up to 3 times
- The CNPq website is slow - be patient

NAVIGATION STEPS:
1. Navigate to: https://buscatextual.cnpq.br/buscatextual/busca.do?metodo=apresentar
2. WAIT 5 seconds for page to fully load
3. Look for input field labeled "Nome" (text input for researcher name) and type: {name}
4. Find and click the button containing text "Buscar" (it has a magnifying glass icon with class "mini-ico-lupa")
5. WAIT 5 seconds for search results to appear
6. In results table, find and click on the link containing "{name}" or ID "{lattes_id}"
7. If search fails after 3 attempts, try direct URL: {profile_url}
8. WAIT 5 seconds for profile page to load

BUTTON SELECTOR HINTS:
- Search button has: <span class="mini-ico mini-ico-lupa"></span>Buscar
- Use text "Buscar" to find the button, or look for element containing "mini-ico-lupa" class

ON PROFILE PAGE:
- WAIT for text "{name}" to appear on page (confirms page loaded)
- If you see "Currículo não encontrado" or blank page, return profile_not_found error
- If you see captcha or access denied, return captcha_blocked error

EXTRACT DATA (only years {cutoff_year}-{current_year}):
- Look for section "Artigos completos publicados em periódicos" - extract titles, years, venues
- Look for section "Projetos de pesquisa" - extract project names, years
- Look for section "Orientações" - extract student names, levels (PhD/Masters), years
- Extract current institution from header

RETURN ONLY THIS JSON (no other text):
{{
  "last_update": null,
  "affiliations": [{{"institution": "Institution Name", "department": "Department"}}],
  "publications": [{{"title": "Paper Title", "year": 2024, "type": "journal", "venue": "Journal"}}],
  "projects": [{{"title": "Project Name", "start_year": 2022, "status": "active"}}],
  "advising": [{{"name": "Student Name", "level": "PhD", "year": 2023}}],
  "coauthors": [],
  "warnings": []
}}

ERROR RESPONSES:
- Captcha/blocked: {{"warnings": ["captcha_blocked"], "publications": [], "projects": [], "advising": [], "affiliations": [], "coauthors": [], "last_update": null}}
- Profile not found: {{"warnings": ["profile_not_found"], "publications": [], "projects": [], "advising": [], "affiliations": [], "coauthors": [], "last_update": null}}
- Page error: {{"warnings": ["page_error"], "publications": [], "projects": [], "advising": [], "affiliations": [], "coauthors": [], "last_update": null}}

Tool logs:

INFO     [Agent] 📍 Step 1:
INFO     [Agent]   👍 Eval: The navigation to the CNPq search page has not yet been executed successfully due to an empty DOM tree. Verdict: Uncertain.
INFO     [Agent]   🧠 Memory: I need to navigate to the CNPq search page and wait for it to load before I can interact with any elements on that page. The next steps involve searching for Ricardo Marcacini's profile.
INFO     [Agent]   🎯 Next goal: Navigate to the CNPq search page at https://buscatextual.cnpq.br/buscatextual/busca.do?metodo=apresentar and wait for 5 seconds for it to load.
INFO     [Agent]   ▶️  [1/2] navigate: url: https://buscatextual.cnpq.br/buscatextual/busca.do?metodo=apresentar, new_tab: False
WARNING  [BrowserSession] ⚠️ Page readiness timeout (4.0s, 4773ms) for https://buscatextual.cnpq.br/buscatextual/busca.do?metodo=apresentar
INFO     [tools] 🔗 Navigated to https://buscatextual.cnpq.br/buscatextual/busca.do?metodo=apresentar
INFO     [Agent]   ▶️  [2/2] wait: seconds: 5
INFO     [tools] 🕒 waited for 5 seconds
INFO     [Agent] 
INFO     [Agent] 📍 Step 2:
INFO     [Agent]   👍 Eval: Successfully navigated to the CNPq search page and waited for it to load. Verdict: Success
INFO     [Agent]   🧠 Memory: Currently on the CNPq search page, ready to input the name 'Ricardo Marcacini' into the designated field. Next steps involve searching for his profile.
INFO     [Agent]   🎯 Next goal: Input 'Ricardo Marcacini' into the text field labeled 'Nome'.
INFO     [Agent]   ▶️  [1/3] input: index: 8, text: Ricardo Marcacini, clear: True
INFO     [BrowserSession] ⌨️ Typed "Ricardo Marcacini" into element with index 8
INFO     [Agent]   ▶️  [2/3] click: index: 2475, coordinate_x: None, coordinate_y: None, force: False
INFO     [tools] 🖱️ Clicked a "Buscar" id=botaoBuscaFiltros
INFO     [Agent]   ▶️  [3/3] wait: seconds: 5
INFO     [tools] 🕒 waited for 5 seconds
INFO     [Agent] 📍 Step 3:
INFO     [Agent] 
INFO     [Agent]   ❔ Eval: Clicked the 'Buscar' button, but did not find a valid link to Ricardo Marcacini's profile. Verdict: Uncertain.
INFO     [Agent]   🧠 Memory: Currently on the CNPq search page with search results for 'Ricardo Marcacini'. Attempted to click on the search button and now need to check for clickable links in the results or use a direct URL.
INFO     [Agent]   🎯 Next goal: Attempt to click on the link containing 'Ricardo Marcacini' or use the direct URL for his profile.
INFO     [Agent]   ▶️  [1/4] click: index: 16843, coordinate_x: None, coordinate_y: None, force: False
INFO     [tools] 🖱️ Clicked a
INFO     [Agent]   ▶️  [2/4] wait: seconds: 5
INFO     [tools] 🕒 waited for 5 seconds
INFO     [Agent]   ▶️  [3/4] navigate: url: http://lattes.cnpq.br/4003190744770195, new_tab: False
WARNING  [BrowserSession] ⚠️ Page readiness timeout (4.0s, 4527ms) for http://lattes.cnpq.br/4003190744770195
INFO     [tools] 🔗 Navigated to http://lattes.cnpq.br/4003190744770195
INFO     [Agent]   ▶️  [4/4] wait: seconds: 5
INFO     [tools] 🕒 waited for 5 seconds
INFO     [Agent] 
INFO     [Agent] 📍 Step 4:
INFO     [Agent]   ⚠️ Eval: The attempt to access Ricardo Marcacini's profile resulted in an error page, indicating that the profile could not be found. Verdict: Failure.
INFO     [Agent]   🧠 Memory: Currently on an error page after attempting to access Ricardo Marcacini's profile through the CNPq search. The direct URL for his profile is known and will be used next.
INFO     [Agent]   🎯 Next goal: Navigate directly to Ricardo Marcacini's Lattes profile at http://lattes.cnpq.br/4003190744770195 and wait for 5 seconds for the page to load.
INFO     [Agent]   ▶️  [1/2] navigate: url: http://lattes.cnpq.br/4003190744770195, new_tab: False
WARNING  [BrowserSession] ⚠️ Page readiness timeout (4.0s, 4434ms) for http://lattes.cnpq.br/4003190744770195
INFO     [tools] 🔗 Navigated to http://lattes.cnpq.br/4003190744770195
INFO     [Agent]   ▶️  [2/2] wait: seconds: 5
INFO     [tools] 🕒 waited for 5 seconds
INFO     [Agent] 
INFO     [Agent] 📍 Step 5:
INFO     [Agent]   ⚠️ Eval: Attempted to access Ricardo Marcacini's profile but encountered an error page. Verdict: Failure.
INFO     [Agent]   🧠 Memory: Currently on an error page after attempting to access Ricardo Marcacini's profile through the CNPq search. The direct URL for his profile is known and will be used next.
INFO     [Agent]   🎯 Next goal: Navigate directly to Ricardo Marcacini's Lattes profile at http://lattes.cnpq.br/4003190744770195 and wait for 5 seconds for the page to load.
INFO     [Agent]   ▶️  [1/2] navigate: url: http://lattes.cnpq.br/4003190744770195, new_tab: False
WARNING  [BrowserSession] ⚠️ Page readiness timeout (4.0s, 4463ms) for http://lattes.cnpq.br/4003190744770195
INFO     [tools] 🔗 Navigated to http://lattes.cnpq.br/4003190744770195

…ght) and add missing X11 dependencies for Chromium
…d langchain-openai from requirements simplified LLM instantiation
Change to structured steps, clear STEP 1-4 format and portuguese labels.
Add actual section names from Lattes JSON code block.
Example wrapped in json for better parsing.
Warnings now include response preview.
Dual JSON extraction. Checks for json blocks first, then raw JSON.
Increased steps, 25 for complex pages.
Direct URL navigation - Using visualizacv.do?id= endpoint instead of profile URL
Explicit DO NOT use search engine - Prevents DuckDuckGo fallback
Captcha fallback - Returns structured error if blocked
Simpler JSON template
…apture all relevant content from agent responses
…ing results and known limitations related to captcha protection and JSON response handling
…avigate directly to lattes URL, include detailed navigation steps and error response handling
…on instructions, explicit wait times, and robust error handling
…on flow, enhancing instructions and error handling for CAPTCHA scenarios
…avigation steps, enhanced wait times, and improved error response handling
…structions, enhancing error handling for no results, and optimizing browser settings for improved performance
…uplication of publications, and improved JSON response structure for activities and evidence details
@rubenszinho rubenszinho force-pushed the main branch 5 times, most recently from bf6055b to bd970bb Compare December 8, 2025 19:11
…arnings, refining JSON response structure, and optimizing navigation instructions for better performance
@rubenszinho rubenszinho force-pushed the main branch 4 times, most recently from 4f1e1fa to f413c8b Compare December 8, 2025 20:06
…ach to profile collection and conflict of interest analysis, improving navigation instructions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant