Skip to content

Conversation

@silveirado
Copy link
Member

@silveirado silveirado commented Dec 30, 2025

Descrição

Este PR adiciona funcionalidades de streaming de dados, tabelas dinâmicas (pivot) e geração de gráficos, com integração Python para processamento eficiente de grandes volumes de dados.

Principais Funcionalidades

1. HTTP Streaming Endpoint (findStream)

  • ✅ Novo endpoint /rest/stream/:document/findStream para streaming HTTP verdadeiro
  • ✅ Processamento registro a registro sem acumular dados em memória
  • ✅ Aplicação de permissões e transformações em tempo real
  • ✅ Uso inteligente de nós secundários do MongoDB com fallback
  • ✅ Ordenação padrão para consistência

2. Pivot Tables Endpoint

  • ✅ Novo endpoint /rest/data/:document/pivot para tabelas dinâmicas
  • ✅ Integração Python com Polars para processamento rápido
  • ✅ Formato hierárquico de saída com metadados enriquecidos
  • ✅ Suporte a campos aninhados e formatação de lookups
  • ✅ Labels multilíngue (pt-BR/en)
  • Estrutura hierárquica de colunas (columnHeaders) para suporte a colunas multi-nível
  • ✅ Suporte a date buckets (D, W, M, Q, Y) para agregação temporal
  • ✅ Documentação completa atualizada com exemplos de colunas hierárquicas

3. Graph Endpoint

  • ✅ Novo endpoint /rest/data/:document/graph para geração de gráficos SVG
  • ✅ Integração Python usando Polars para agregações (3-10x mais rápido)
  • ✅ Uso de pandas/matplotlib para visualização
  • ✅ Suporte a 6 tipos de gráficos: bar, line, pie, scatter, histogram, timeSeries
  • ✅ Processamento interno com streaming (findStream)

Commits Incluídos

  1. feat: [WIP] add HTTP streamable endpoint findStream - Endpoint base de streaming
  2. refactor: apply clean code principles - Refatoração seguindo princípios clean code
  3. docs: add findStream endpoint documentation - Documentação do findStream
  4. docs: add Architecture Decision Records (ADRs) - ADRs para decisões arquiteturais
  5. feat: implement smart secondary node usage - Uso inteligente de nós secundários
  6. fix: remove unused imports - Limpeza de imports não utilizados
  7. feat(pivot): implement hierarchical output - Implementação de pivot com saída hierárquica
  8. docs(pivot): update API documentation and add ADR - Documentação do pivot
  9. docs(postman): update pivot endpoint example - Atualização da collection Postman
  10. fix(docker): update Dockerfile for Python support - Suporte Python no Docker
  11. feat: add graph endpoint with Polars and Pandas - Endpoint de gráficos
  12. docs(pivot): update documentation and tests for columnHeaders - Atualização de documentação e testes para estrutura hierárquica de colunas

Arquivos Criados

Streaming

  • src/imports/data/api/findStream.ts
  • src/imports/data/api/findUtils.ts
  • src/imports/data/api/streamTransforms.ts
  • src/imports/data/api/streamConstants.ts
  • src/server/routes/rest/stream/streamApi.ts

Pivot

  • src/imports/data/api/pivotStream.ts
  • src/imports/data/api/pivotMetadata.ts
  • src/imports/types/pivot.ts
  • src/scripts/python/pivot_table.py

Graph

  • src/imports/data/api/graphStream.ts
  • src/imports/types/graph.ts
  • src/scripts/python/graph_generator.py

Testes

  • __test__/data/api/runFindStreamTests.ts
  • __test__/data/api/runFindStreamBenchmark.ts
  • __test__/data/api/runFindStreamConfidenceTest.ts
  • __test__/data/api/runPivotIntegrationTest.ts
  • __test__/data/api/runGraphIntegrationTest.ts
  • __test__/data/api/pivotStream.test.ts
  • __test__/data/api/graphStream.test.ts

Documentação

  • docs/pt-BR/adr/0001-http-streaming-para-busca-de-dados.md
  • docs/pt-BR/adr/0002-extracao-de-logica-comum-para-find-utils.md
  • docs/pt-BR/adr/0003-node-transform-streams-para-processamento-sequencial.md
  • docs/pt-BR/adr/0004-ordenacao-padrao-para-consistencia.md
  • docs/pt-BR/adr/0005-uso-obrigatorio-nos-secundarios-para-leitura.md
  • docs/pt-BR/adr/0006-integracao-python-para-pivot-tables.md
  • docs/pt-BR/adr/0007-formato-hierarquico-saida-pivot.md
  • docs/pt-BR/adr/0008-graph-endpoint-com-polars-pandas.md
  • (Versões em inglês de todos os ADRs)

Arquivos Modificados

  • src/imports/data/api/index.ts
  • src/imports/data/api/pythonStreamBridge.ts
  • src/imports/utils/mongo.ts (hasSecondaryNodes)
  • src/server/routes/rest/data/dataApi.ts
  • src/server/routes/index.ts
  • Dockerfile (suporte Python/uv)
  • docs/pt-BR/api.md e docs/en/api.md (atualizado com columnHeaders)
  • docs/postman/Konecty-API.postman_collection.json
  • __test__/data/api/pivotStream.test.ts (testes atualizados para columnHeaders)
  • __test__/data/api/runPivotIntegrationTest.ts (testes de integração atualizados)
  • docs/en/adr/0007-hierarchical-pivot-output-format.md (atualizado com columnHeaders)
  • docs/pt-BR/adr/0007-formato-hierarquico-saida-pivot.md (atualizado com columnHeaders)

Testes

  • ✅ Testes unitários para findStream, pivotStream e graphStream
  • ✅ Testes de integração para todos os endpoints
  • ✅ Testes de benchmark comparando performance
  • ✅ Testes de confiança validando consistência de dados
  • Testes atualizados para verificar estrutura hierárquica de columnHeaders
  • ✅ Build TypeScript sem erros

Performance

  • findStream: Streaming verdadeiro, sem acumular dados em memória
  • Pivot: Polars para processamento rápido de grandes volumes
  • Graph: Polars 3-10x mais rápido que Pandas para agregações
  • MongoDB: Uso inteligente de nós secundários com fallback

Documentação

  • ✅ 8 ADRs documentando decisões arquiteturais (pt-BR e en)
  • ✅ Documentação completa da API para todos os endpoints
  • Documentação atualizada com exemplos de columnHeaders hierárquicos
  • Exemplos de colunas multi-nível (date buckets com status)
  • ✅ Collection Postman atualizada com exemplos reais
  • ✅ Exemplos de uso para cada funcionalidade

Dependências Python

  • polars - Para agregações rápidas (pivot e graph)
  • pandas - Para visualização (graph)
  • matplotlib - Para geração de SVG (graph)
  • pyarrow - Para conversão Polars → Pandas

Todas as dependências são gerenciadas automaticamente pelo uv quando os scripts rodam pela primeira vez.

Mudanças Recentes

Estrutura Hierárquica de Colunas (columnHeaders)

O endpoint de pivot tables agora retorna uma estrutura hierárquica de cabeçalhos de coluna (columnHeaders) que suporta:

  • Colunas multi-nível (ex: date buckets com status)
  • Formatação automática de lookups
  • Suporte a date buckets (D=dia, W=semana, M=mês, Q=trimestre, Y=ano)
  • Estrutura similar ao ExtJS mz-pivot axisTop

A documentação e testes foram atualizados para refletir essas mudanças.


Note

Adds high-throughput data retrieval and analytics endpoints plus infra to support them.

  • New endpoints: GET /rest/stream/:document/findStream (NDJSON streaming), GET /rest/data/:document/pivot (hierarchical JSON), GET /rest/data/:document/graph (SVG)
  • Python integration: Orchestrates Node → Python via pythonStreamBridge; uses Polars (aggregation) and Pandas/matplotlib (charts)
  • Dockerfile: Installs Python, Rust, uv; prebuilds Polars and copies /app/scripts/python
  • Query/streaming core: Shared buildFindQuery in findUtils; transform streams in streamTransforms; default sort; secondary read preference with fallback
  • Tests: Unit/integration/e2e, confidence and benchmark runners for findStream, pivotStream, graphStream
  • Docs: API docs expanded; ADRs added (streaming, transforms, findUtils, default sorting, secondary reads, Python pivot, hierarchical pivot, graphs); Postman collection updated

Written by Cursor Bugbot for commit 3695247. This will update automatically on new commits. Configure here.

… streaming

- Create findStream function with record-by-record processing
- Extract common logic to findUtils.ts (DRY principle)
- Create Transform streams for field permissions and date conversion
- Add ObjectToJsonTransform for HTTP streaming
- Add new endpoint /rest/stream/:document/findStream
- Register streamApi in routes/index.ts
- Add unit tests for Transform streams and findUtils
- Add integration, E2E, and benchmark tests
- Add confidence test to validate data consistency
- All tests execute directly in Node (no Jest dependency)
- Benchmark shows 82% memory reduction and 99% faster TTFB for 55k records

TODO: Refactor and cleanup
- Extract magic numbers to streamConstants.ts (DRY)
- Replace let with const (const-pref)
- Replace forEach/for loops with functional methods (.map, .filter, .reduce)
- Extract helper functions from findUtils.ts (buildSortOptions, buildAccessConditionsForField, buildAccessConditionsMap, calculateConditionsKeys)
- Extract parseFilterFromQuery to eliminate duplication in streamApi.ts
- Create streamTestHelpers.ts with reusable test functions
- Use BluebirdPromise.map with concurrency limits in all promise operations
- Add default sort { _id: 1 } to findStream for consistent ordering
- Match find.ts behavior in findUtils.ts for query construction consistency
- Refactor test files to use helpers and functional methods
- Fix test variable references (testResults.allPassed)

All tests passing:
- Unit and integration tests: 7/7 passed
- Benchmark: 99.3% faster TTFB, 45% faster total time, 81.8% better throughput
- Confidence test: All datasets match exactly with find paginated endpoint
- Add comprehensive documentation for /rest/stream/:document/findStream endpoint
- Document streaming format (newline-delimited JSON)
- Include client-side processing examples (JavaScript)
- Add advantages comparison with traditional find endpoint
- Add usage guidelines and best practices
- Update Postman collection with 3 new requests:
  - Find Stream (main request with all parameters)
  - Find Stream - Contact (simple example)
  - Find Stream - With Filter (complex filter example)
- Documentation available in pt-BR and en
- Include response examples and error handling
…ntation

- ADR-0001: HTTP Streaming para Busca de Dados
  Documents decision to implement HTTP streaming endpoint
  Includes performance metrics (68% memory reduction, 99.3% faster TTFB)

- ADR-0002: Extração de Lógica Comum para findUtils
  Documents DRY principle application
  Explains shared logic extraction between find and findStream

- ADR-0003: Node.js Transform Streams para Processamento Sequencial
  Documents use of Transform streams for record-by-record processing
  Explains pipeline architecture

- ADR-0004: Ordenação Padrão para Consistência
  Documents default sorting decision ({ _id: 1 })
  Explains consistency requirements for confidence tests

All ADRs available in pt-BR and en
Includes README files with index
- Add hasSecondaryNodes() function to check for available secondary nodes
- Implement dynamic read preference selection:
  - Uses 'secondary' when secondaries are available (maximum isolation)
  - Falls back to 'secondaryPreferred' when no secondaries (no errors)
- Add performance optimizations:
  - STREAM_BATCH_SIZE: 1000 documents per batch
  - STREAM_MAX_TIME_MS: 5 minutes max query time
- Apply same read preference to countDocuments for consistency
- Update ADR-0005 to reflect smart fallback approach
- Works in all environments (dev without secondaries, prod with secondaries)

See ADR-0005 for detailed rationale
- Remove KonectyResult (not used)
- Remove errorReturn (not used)
- Remove successReturn (not used)
- Remove DataDocument (not used directly, only in streamTransforms)

All imports are now used, lint passes without errors
- Add hierarchical pivot table structure with nested children
- Enrich pivot config with metadata from MetaObject.Meta
- Implement lookup field formatting with formatPattern
- Add recursive field metadata resolution for nested lookups
- Concatenate parent labels in nested fields (e.g., 'Grupo > Nome')
- Calculate subtotals per hierarchy level
- Calculate grand totals for all data
- Update Python script to build hierarchical structure
- Support Accept-Language header for multilingual labels
- Update integration and unit tests for new structure

Breaking changes:
- Pivot API response format changed from flat array to hierarchical structure
- Response now includes metadata, data (hierarchical), and grandTotals
…rmat

- Update API documentation (en/pt-BR) with new hierarchical response structure
- Add examples showing metadata, nested children, subtotals, and grandTotals
- Document lookup formatting rules and nested field label concatenation
- Add ADR-0007 documenting hierarchical pivot output format decision
- Update ADR READMEs to include new ADR

Breaking changes documented:
- Response format changed from flat array to hierarchical structure
- New metadata field with enriched field information
- Nested children arrays for multi-level hierarchies
- Subtotals per level and grand totals
- Update Postman collection example response to show new hierarchical structure
- Include metadata, nested children, subtotals, and grandTotals in example
- Reflect breaking change in response format
- Add Rust, cargo, and musl-dev for building polars from source on Alpine
- Fix ENV format to use key=value syntax (removes warning)
- Fix COPY paths to use absolute paths (/app instead of app)
- Add python3-dev and py3-pip for Python development dependencies
- Ensure konecty user has access to build tools
- Note: polars will compile on first execution (takes ~2-5 minutes), then cached

Alpine Linux (musl) doesn't have precompiled polars wheels, so compilation
from source is required. This is handled automatically by uv when the script
runs for the first time.
- Add GET /rest/data/:document/graph endpoint for SVG chart generation
- Implement graphStream function orchestrating findStream + Python
- Create graph_generator.py script using Polars for aggregations and pandas/matplotlib for visualization
- Support 6 chart types: bar, line, pie, scatter, histogram, timeSeries
- Add collectSVGFromPython function to pythonStreamBridge for SVG collection
- Add GraphConfig and GraphStreamParams TypeScript types
- Create unit and integration tests for graph endpoint
- Add ADR-0008 documenting Polars+Pandas decision (pt-BR and en)
- Update API documentation with graph endpoint examples (pt-BR and en)
- Update Postman collection with graph examples using Opportunity document
- Performance: Polars is 3-10x faster than Pandas for aggregations
- Convert only aggregated results to Pandas (memory efficient)
- Add pyarrow dependency for Polars to_pandas() conversion
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is being reviewed by Cursor Bugbot

Details

Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

- Fix wrong variable name in runFindStreamTests.ts (failed++ -> testResults.failed++)
- Change BENCHMARK_ITERATION_CONCURRENCY from 3 to 1 for accurate memory measurements
- Fix TypeScript linting errors (any type, empty line, type guards)
@silveirado
Copy link
Member Author

✅ Fixed issues reported by Cursor Bugbot:

  1. Fixed wrong variable name in runFindStreamTests.ts (changed failed++ to testResults.failed++)
  2. Changed BENCHMARK_ITERATION_CONCURRENCY from 3 to 1 for accurate memory measurements
  3. Fixed TypeScript linting errors

All issues have been resolved in commit a763b95.

@silveirado silveirado changed the title feat: Add graph endpoint with Polars and Pandas integration feat: add HTTP streaming, pivot tables and graph endpoints with Python integration Dec 30, 2025
- Replace echo with printf in polars pre-build step
- BusyBox ash doesn't interpret \n in echo, causing malformed input
- printf correctly interprets \n as newline character
- This ensures polars is properly pre-compiled during Docker build
- Prevents multi-minute delay on first pivot/graph request
cpuSystem: endCpu.system / MILLISECONDS_PER_SECOND,
recordCount,
throughput,
peakMemory: peakMemory - startMemory.heapUsed,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double subtraction causes incorrect peak memory in benchmark

The readStreamRecordsWithMetrics helper function already returns peakMemory as a delta (computed as memoryState.peakMemory - startMemory.heapUsed at line 107 of streamTestHelpers.ts). However, benchmarkFindStream subtracts startMemory.heapUsed again at line 131, resulting in peakMemory - 2 * startMemory.heapUsed. This causes incorrect (likely negative) peak memory values for the stream endpoint benchmark, while benchmarkFindPaginated correctly computes the delta from the raw peak value. The fix is to use peakMemory directly without the second subtraction.

Additional Locations (1)

Fix in Cursor Fix in Web

- Add support for hierarchical column headers in pivotStream response
- Update tests to validate presence and structure of columnHeaders
- Modify API documentation to reflect new columnHeaders field
- Implement logic in Python script to handle and return column headers
- Ensure backward compatibility with existing pivot functionality

Breaking changes:
- Response format now includes columnHeaders, enhancing the pivot table structure.
"description": "Find Opportunity records with complex filter. Example filtering by multiple status values."
},
"response": []
},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Malformed JSON structure in Postman collection item

Medium Severity

The "Find Stream - With Filter" item has incorrect indentation that breaks the JSON structure. Comparing with the correctly formatted "Find Stream - Contact" item (line 630), the description at line 658 and response at line 660 are indented one level less than required. This causes response to appear outside its parent item object, making the Postman collection invalid JSON that would fail to import.

Fix in Cursor Fix in Web


if (findStr !== streamStr) {
// Show first difference for debugging
return `${key}: find=${findStr.substring(0, MAX_SAMPLE_LENGTH)}... vs stream=${streamStr.substring(0, MAX_SAMPLE_LENGTH)}...`;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling substring on undefined causes TypeError

Medium Severity

In compareRecordFields, when a key exists in one record but not the other, accessing the missing key returns undefined. Calling JSON.stringify(undefined) returns the primitive undefined (not a string). Then calling .substring() on lines 186 throws a TypeError: Cannot read property 'substring' of undefined. This crashes the confidence test whenever records have different fields.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants