Add ORCID registration task to article publication flow#494
Add ORCID registration task to article publication flow#494
Conversation
- Create operations/orcid_operations.py with post_data and register_orcid - Integrate ORCID registration in try_register_documents after document save - Add ENABLE_ORCID_PUSH and ORCID_PUSH_URL environment variables - Update packtools to 4.16.1 for build_payload support - Add unit tests for ORCID operations Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com>
Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds an ORCID “push” side-effect to the article publication flow so that, after a document is saved to OPAC, author/work data derived from the article XML is sent to the orcid-push service (POST /works) behind an environment-variable toggle.
Changes:
- Bumps
packtoolsversion to accessbuild_payloadfor ORCID payload generation. - Introduces
operations/orcid_operations.pywith a retrying HTTP POST helper andregister_orcid()that builds and posts payloads per author ORCID. - Integrates
register_orcid()intotry_register_documents()afterdocument.save(), plus adds unit tests and docker-compose env wiring.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| requirements.txt | Updates packtools dependency to a newer version used by ORCID payload building. |
| docker-compose.yml | Adds ENABLE_ORCID_PUSH / ORCID_PUSH_URL env passthrough for local runs. |
| docker-compose-dev.yml | Same env passthrough for dev compose setup. |
| airflow/dags/operations/orcid_operations.py | New ORCID push module with tenacity retry and payload generation via packtools. |
| airflow/dags/operations/sync_kernel_to_website_operations.py | Calls ORCID registration after saving the article. |
| airflow/tests/test_orcid_operations.py | Adds unit tests covering toggle behavior, posting, and error handling. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| try: | ||
| if fetch_document_xml: | ||
| document_xml = fetch_document_xml(document_id) | ||
| register_orcid(document_id, document_xml) |
There was a problem hiding this comment.
This adds a second fetch_document_xml(document_id) call for every saved article. The XML was already fetched earlier inside article_factory (same file) to build fields from SPS_Package, so this doubles Kernel traffic and adds latency per document. Consider fetching the XML once in try_register_documents and reusing it for both article_factory and register_orcid (or refactor article_factory to return/cache the XML).
| stop=stop_after_attempt(5), | ||
| ) | ||
| def post_data(url, data=None, headers=None, timeout=10, verify=True): |
There was a problem hiding this comment.
With timeout=10 and stop_after_attempt(5), a degraded ORCID Push endpoint can block the publication flow for up to ~60s per author (plus exponential waits) even though failures are meant to be non-blocking. Consider reducing timeout/attempts for this best-effort side effect and/or adding a total time cap (e.g., stop-after-delay) so ORCID push outages don’t significantly slow down the DAG run.
| stop=stop_after_attempt(5), | |
| ) | |
| def post_data(url, data=None, headers=None, timeout=10, verify=True): | |
| stop=stop_after_attempt(3), | |
| ) | |
| def post_data(url, data=None, headers=None, timeout=5, verify=True): |
| logger = logging.getLogger(__name__) | ||
|
|
||
| ENABLE_ORCID_PUSH = os.environ.get("ENABLE_ORCID_PUSH", "false").lower() == "true" | ||
| ORCID_PUSH_URL = os.environ.get("ORCID_PUSH_URL", "") |
There was a problem hiding this comment.
This module introduces logger while other operations/* modules consistently use Logger = logging.getLogger(__name__) (e.g., operations/docs_utils.py:22, operations/check_website_operations.py:22). For consistency across the codebase, consider renaming this to Logger and updating references in this module.
| @@ -0,0 +1,229 @@ | |||
| import os | |||
There was a problem hiding this comment.
Unused import: os is imported but not used in this test module. Removing it will keep the test file clean and avoids failing CI if linting is enabled later.
| import os |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return | ||
|
|
||
| try: | ||
| xml_tree = et.XML(document_xml) |
There was a problem hiding this comment.
XML is parsed with et.XML(document_xml) using the default parser. Elsewhere in this repo XML parsing uses an XMLParser(no_network=True) (e.g., docs_utils.py) to avoid network/entity resolution risks. Consider using an explicit secure parser here (e.g., no_network=True and disabling entity resolution) when parsing document_xml to mitigate XXE/SSRF vectors.
| xml_tree = et.XML(document_xml) | |
| parser = et.XMLParser(no_network=True, resolve_entities=False) | |
| xml_tree = et.XML(document_xml, parser=parser) |
| except Exception as exc: | ||
| logging.error( | ||
| "Could not register ORCID for document '%s': %s", | ||
| document_id, exc, |
There was a problem hiding this comment.
This except Exception logs only the exception message, which can make diagnosing ORCID push failures difficult (no traceback). Since this is intentionally non-blocking, consider using logging.exception(...) here (or including exc_info=True) so the stack trace is captured in logs.
| except Exception as exc: | |
| logging.error( | |
| "Could not register ORCID for document '%s': %s", | |
| document_id, exc, | |
| except Exception: | |
| logging.exception( | |
| "Could not register ORCID for document '%s'", | |
| document_id, |
O que esse PR faz?
Adiciona registro de ORCID ao fluxo de publicação de artigos. Após salvar um documento com sucesso em
try_register_documents, envia os dados de cada autor com ORCID para o serviçoorcid-pushvia POST/works. Falhas no push ORCID são logadas mas não bloqueiam o registro do documento.Controlado por duas variáveis de ambiente:
ENABLE_ORCID_PUSH— liga/desliga a funcionalidade (default:false)ORCID_PUSH_URL— URL do serviço (ex:https://orcid-push.scielo.org/)Onde a revisão poderia começar?
airflow/dags/operations/orcid_operations.py— módulo novo compost_data(retry com tenacity) eregister_orcid(usabuild_payloaddo packtools para gerar payloads a partir do XML).Depois, a integração em
airflow/dags/operations/sync_kernel_to_website_operations.pylinhas ~737-745.Como este poderia ser testado manualmente?
ENABLE_ORCID_PUSH=trueeORCID_PUSH_URL=https://orcid-push.scielo.org/sync_kernel_to_websitecom documentos que possuam autores com ORCID no XML/worksforam realizadasENABLE_ORCID_PUSH=falsee confirmar que nenhuma requisição é feitaAlgum cenário de contexto que queira dar?
Alterações principais:
operations/orcid_operations.py—post_data()com retry exponencial (5 tentativas, baseado no pattern do scms-uploadrequester.py);register_orcid()que parseia XML, gera payloads via packtools e posta para cada autorsync_kernel_to_website_operations.py— chamada aregister_orcidapósdocument.save(), envolvida em try/except isoladorequirements.txt— packtools2.9.5→4.16.1para acesso apacktools.sps.formats.orcid.payload_builder.build_payloaddocker-compose.yml/docker-compose-dev.yml— novas env varsENABLE_ORCID_PUSH,ORCID_PUSH_URLtests/test_orcid_operations.py— 14 testes unitários cobrindo retry, toggle, erros, payloads múltiplosScreenshots
N/A
Quais são tickets relevantes?
N/A
Referências
prepare_request_databuild_payloadOriginal prompt
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.