Consolidate load functions, add schema for casting, and add CSV options for reading #2423

arienandalibi · 2025-12-16T02:04:07Z

Takes care of #2394

Not included in the issue above:

Add schema parameter to load functions to support casting of columns
Add CSV reading options:
- delimiter character
- comment character
- escape character
- quote character
- terminator character
- allow_truncated_rows flag
- has_header flag
- more options may potentially be added

… load_edges_from_polars that internally calls to_pandas() on the polars dataframe. Fireducks works.

… in rust instead of obtaining each column of each batch from Python individually.

…g up. Committing benchmarks and tests that check graph equality when using different ingestion pathways.

…id of them and always stream. Added benchmark for loading from fireducks.

…e_props, that all use the __arrow_c_stream__() interface. If a data source is passed with no __len__ function, we calculate the len ourselves. Updated ingestion benchmarks to also test pandas_streaming, fireducks_streaming, polars_streaming

# Conflicts: # python/python/raphtory/__init__.pyi

… number of rows.

…renamed to load_edge_metadata/load_node_metadata.

…if the data source provides __len__(), and if not, the loading/progress bar for loading nodes and edges doesn't show progression, only iterations per second.

…ess bar for loading updates properly when using the __arrow_c_stream__ interface.

…ingestion pathways

…dge_deletions_from_df

# Conflicts: # python/python/raphtory/vectors/__init__.pyi

…vent graphs/persistent graphs.

# Conflicts: # python/python/raphtory/graphql/__init__.pyi # python/python/raphtory/vectors/__init__.pyi

…ta from python. Replaced it with PyRecordBatchReader::from_arrow_pycapsule for safety and future changes.

…rrow format at once. Now stream 1 mil rows at a time.

…iles.

… function on PyProperties. Added test for schema casting.

# Conflicts: # raphtory/src/python/graph/io/arrow_loaders.rs

…ify what type to cast columns to.

…nested type using pyarrow Table. Cast whole RecordBatch at once now using StructArray.

…taTypes can be extracted from Python without feature gating behind arrow (larger dependency). Refactored data_type_as_prop_type to be in raphtory-api as long as any of "arrow", "storage", or "python" features is enabled, since they all have dep:arrow-schema.

…rison for PropType. Fixed previous tests and added tests for dict schema input, pyarrow types, nested (StructArray) properties, nested schemas, mixed and matched PropType and pyarrow types, both in property and in schema,...

…ssed but no CSV files were detected.

# Conflicts: # python/python/raphtory/__init__.pyi # python/python/raphtory/iterables/__init__.pyi # python/python/raphtory/node_state/__init__.pyi # python/tests/test_ingestion_equivalence_df.py # python/tests/test_load_from_df.py # raphtory-api/src/python/mod.rs # raphtory/src/python/graph/graph.rs # raphtory/src/python/graph/graph_with_deletions.rs # raphtory/src/python/graph/io/arrow_loaders.rs # raphtory/src/python/packages/base_modules.rs

…d parquet/csv). Make sure each ingestion path returns the same node ids.

… well as failures from malformed inputs

…lformed (or any column). Added tests for malformed inputs in csv.

…umn is not found. removed extra_field parquet test bc it didn't work. cleaned up test file.

…e_deletions for csv files/directories

…v optional-dependencies in pyproject.toml. General clean-up before adding other functions (load_edge, load_node_metadata, ...) in python graph.

…unctions. Added load_edges, load_node_metadata, load_edge_metadata functions to PyGraph and PyPersistentGraph. Removed Pandas loaders.

…folder which is not available in the crate root. Fixed parquet_loaders.rs.

arienandalibi added 30 commits November 11, 2025 03:27

Added tests for loading edges from polars and from fireducks. Added a…

a13da1e

… load_edges_from_polars that internally calls to_pandas() on the polars dataframe. Fireducks works.

Adding loading of data (only edges for now) from arrow directly

6afe50e

Adding loading of data (only edges for now) from arrow with streaming…

c167252

… in rust instead of obtaining each column of each batch from Python individually.

Added loading of edges from DuckDB, either normally or using streaming.

dc5635d

Added loading edges from fireducks.pandas dataframes. General cleanin…

7e11387

…g up. Committing benchmarks and tests that check graph equality when using different ingestion pathways.

Adding flag to stream/not stream data in load_* functions. Will get r…

2b58f18

…id of them and always stream. Added benchmark for loading from fireducks.

Cleaned up benchmark print statements

a1137db

Merge branch 'master' into fireducks_polars

3d60aa5

# Conflicts: # python/python/raphtory/__init__.pyi

Ran make stubs

52365af

Removed num_rows from DFView. No longer calculating/storing the total…

55bb553

… number of rows.

Cleaned up load_*_from_df functions. load_edge_props/load_node_props …

c171dff

…renamed to load_edge_metadata/load_node_metadata.

Re-added total number of rows in DFView, but as an Option. We use it …

6bf05b8

…if the data source provides __len__(), and if not, the loading/progress bar for loading nodes and edges doesn't show progression, only iterations per second.

Added splitting of large chunks into smaller chunks so that the progr…

228b0f5

…ess bar for loading updates properly when using the __arrow_c_stream__ interface.

Renamed props to metadata for remaining functions

4f428bd

Added tests to check equality between graphs created using different …

39b89ec

…ingestion pathways

Changed load_*_metadata_* back to load_*_props_*

7690df5

Fixed tests and updated workflow dependencies

941c7c1

Added try-catch blocks for fireducks import in tests

7d25913

Fixed tests and notebooks

f72b010

Fixed invalid function call in test

b38c52a

Fixed fireducks package not available on Windows (for now anyway)

62c8ea2

Added load_*_from_df functions to PyPersistentGraph, including load_e…

a7bc881

…dge_deletions_from_df

Merge branch 'master' into fireducks_polars

1795c18

Merge branch 'master' into fireducks_polars

925bf83

# Conflicts: # python/python/raphtory/vectors/__init__.pyi

Cleaned up load_from_df tests and parametrized them to run for both e…

cc164a6

…vent graphs/persistent graphs.

Fixed bug in tests

8e4e66b

Removed btc dataset benchmarks

446064f

Merge branch 'master' into fireducks_polars

94f2e30

# Conflicts: # python/python/raphtory/graphql/__init__.pyi # python/python/raphtory/vectors/__init__.pyi

Merge cleanup and fixing python docs errors

0ba3e5a

arienandalibi added 30 commits December 4, 2025 20:29

Fixed CSV reader to calculate column indices for each file separately.

a6400fa

Changed unsafe ArrowArrayStreamReader pointer cast to stream arrow da…

378c03c

…ta from python. Replaced it with PyRecordBatchReader::from_arrow_pycapsule for safety and future changes.

Added test for loading data from CSV

11a52cf

Changed CSV reading to avoid loading whole CSV files into memory in a…

ff655e1

…rrow format at once. Now stream 1 mil rows at a time.

Added support for mixed directories containing both CSV and parquet f…

4643c16

…iles.

Added schema argument to load_nodes function

3d68a99

Fixed load_nodes docs. Added PropType in Python. Added get_dtype_of()…

0a0858e

… function on PyProperties. Added test for schema casting.

Merge branch 'fireducks_polars' into consolidate_load_functions

94771c9

# Conflicts: # raphtory/src/python/graph/io/arrow_loaders.rs

Fixed casting of columns, can use PropType variants in python to spec…

d2bc247

…ify what type to cast columns to.

Added casting using pyarrow types as input in the schema

deb8cd4

Added casting of nested datatypes in the data source. Added test for …

6944273

…nested type using pyarrow Table. Cast whole RecordBatch at once now using StructArray.

Added CSV options for when loading CSV. Errors if CSV options were pa…

e3013ca

…ssed but no CSV files were detected.

Added schema support for Parquet and CSV files

55e13ab

Post merge cleanup

ac02d72

Added test for loading from directories (pure parquet, pure csv, mixe…

3b4cdc6

…d parquet/csv). Make sure each ingestion path returns the same node ids.

Added btc_dataset tests for loading/casting from different sources as…

1b01306

… well as failures from malformed inputs

Fixed error message displaying incorrectly when the time column is ma…

3b43e94

…lformed (or any column). Added tests for malformed inputs in csv.

Added malformed parquet test files

e7b91f3

Fixed CSV loader to return the same error as other loaders when a col…

8fbe73a

…umn is not found. removed extra_field parquet test bc it didn't work. cleaned up test file.

Added tests for malformed files

53dec1b

Added tests for compressed csv files (gz and bz2 compression).

b961c2a

Added test for directory with no CSV/Parquet files

152ac27

Added load functions for edges, node_metadata, edge_metadata, and edg…

265732a

…e_deletions for csv files/directories

Added pyarrow.DataType import to gen-stubs.py and pyarrow-stubs to de…

10b41bd

…v optional-dependencies in pyproject.toml. General clean-up before adding other functions (load_edge, load_node_metadata, ...) in python graph.

Removed load_*_from_df, load_*_from_pandas, and load_*_from_parquet f…

da68b8c

…unctions. Added load_edges, load_node_metadata, load_edge_metadata functions to PyGraph and PyPersistentGraph. Removed Pandas loaders.

Fixed some python tests

c4d7136

Fixed cast_columns function to not be imported from a python feature …

259dcfb

…folder which is not available in the crate root. Fixed parquet_loaders.rs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Consolidate load functions, add schema for casting, and add CSV options for reading #2423

Consolidate load functions, add schema for casting, and add CSV options for reading #2423

Uh oh!

arienandalibi commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Consolidate load functions, add schema for casting, and add CSV options for reading #2423

Are you sure you want to change the base?

Consolidate load functions, add schema for casting, and add CSV options for reading #2423

Uh oh!

Conversation

arienandalibi commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants