-
Notifications
You must be signed in to change notification settings - Fork 0
Refactor #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ependency in output formatting and enhance row reading efficiency in reader. Update version to 0.1.0 in project files.
…r to support progress callbacks for improved user experience during file operations.
…ns for improved table presentation in the console.
Summary of ChangesHello @Tendo33, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces significant performance and memory optimizations across the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughAdds a Rich progress UI to the CLI split flow, a Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant CLI as parq/cli.py
participant Progress as Rich Progress
participant Reader as ParquetReader
User->>CLI: invoke split command
CLI->>Progress: start Progress context & add task
CLI->>Reader: call split_file(..., progress_callback)
loop per batch
Reader->>Reader: process batch
Reader->>CLI: progress_callback(current, total)
CLI->>Progress: update task progress
end
Reader-->>CLI: return split results
CLI->>Progress: stop Progress context
CLI-->>User: print results / exit
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request is a great step forward in optimizing the tool by removing the pandas dependency. The performance and memory usage improvements, especially in read_head, read_tail, and print_table, are significant. The addition of a progress bar for the split command is also a nice touch for user experience. My review includes a few suggestions to further enhance code style, maintainability, and performance in some areas. Overall, excellent work on this refactor!
| from rich.progress import ( | ||
| BarColumn, | ||
| Progress, | ||
| SpinnerColumn, | ||
| TextColumn, | ||
| TimeRemainingColumn, | ||
| ) | ||
|
|
||
| from parq.output import console |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better code style and maintainability, it's recommended to move imports to the top of the file, as per PEP 8 guidelines.1 While local imports can sometimes be useful for lazy loading or avoiding circular dependencies, in this case, rich is a core part of the CLI and moving these imports to the top would improve consistency and readability. I've suggested removing them here so you can add them at the top of the file.
Style Guide References
Footnotes
-
PEP 8: Imports are always put at the top of the file, just after any module comments and docstrings, and before global variables and constants. ↩
| for row_idx in range(arrow_table.num_rows): | ||
| row_values = [] | ||
| for col_name in arrow_table.column_names: | ||
| value = arrow_table[col_name][row_idx].as_py() | ||
| row_values.append(str(value)) | ||
| table.add_row(*row_values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation iterates through the table cell by cell, which can be inefficient. A more performant and Pythonic approach is to convert columns to Python lists first and then iterate over the transposed data. This leverages PyArrow's optimized columnar operations.
| for row_idx in range(arrow_table.num_rows): | |
| row_values = [] | |
| for col_name in arrow_table.column_names: | |
| value = arrow_table[col_name][row_idx].as_py() | |
| row_values.append(str(value)) | |
| table.add_row(*row_values) | |
| # Add rows by transposing columns for efficiency | |
| column_data = [col.to_pylist() for col in arrow_table.columns] | |
| for row in zip(*column_data): | |
| # Convert all values in the row to string before adding | |
| table.add_row(*[str(item) for item in row]) |
parq/output.py
Outdated
| table.add_row(*row_values) | ||
|
|
||
| console.print(table) | ||
| # {{END MODIFICATIONS}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parq/reader.py
Outdated
| return table.slice(0, min(n, self.num_rows)) | ||
| # Fast path: small files or single row group | ||
| # Avoids overhead of row group calculation | ||
| if self.num_rows <= n * 10 or self.num_row_groups == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number 10 here is a magic number, making the condition's intent less clear. Consider extracting it into a named constant at the module level (e.g., FAST_PATH_ROWS_MULTIPLIER = 10) with a comment explaining this heuristic. This would improve readability and maintainability. This same magic number is also used in the read_tail method on line 161.
| output_pattern: str, | ||
| file_count: Optional[int] = None, | ||
| record_count: Optional[int] = None, | ||
| progress_callback: Optional[callable] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type hint callable is quite generic. For better type safety and clarity, you can use typing.Callable to specify the signature of the callback function. You will need to add from typing import Callable at the top of the file.
| progress_callback: Optional[callable] = None, | |
| progress_callback: Optional["Callable[[int, int], None]"] = None, |
tests/test_reader.py
Outdated
| assert df.iloc[1]["id"] == 5 | ||
| # Verify it's the last rows (using PyArrow directly) | ||
| # Extract id column values | ||
| id_values = [table["id"][i].as_py() for i in range(len(table))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
parq/reader.py (1)
262-313: Lazy‑open writers and ensure final progress hits 100%Open each writer when its first rows are written; close as you advance. Also emit a final progress update.
- # Create writers for each output file - writers = [] - try: - for output_path in output_files: - # Create parent directories if needed - output_path.parent.mkdir(parents=True, exist_ok=True) - - # Create writer with same schema and compression as source - writer = pq.ParquetWriter( - output_path, - self.schema, - compression=self._get_compression_type(), - ) - writers.append(writer) - - # Read and distribute data in batches - current_file_idx = 0 - current_file_rows = 0 - rows_processed = 0 + # Read and distribute data in batches + writers: dict[int, pq.ParquetWriter] = {} + try: + current_file_idx = 0 + current_file_rows = 0 + rows_processed = 0 @@ - if rows_to_write == batch_rows and batch_offset == 0: + # Lazily open current writer + if current_file_idx not in writers: + out_path = output_files[current_file_idx] + out_path.parent.mkdir(parents=True, exist_ok=True) + writers[current_file_idx] = pq.ParquetWriter( + out_path, + self.schema, + compression=self._get_compression_type(), + ) + + if rows_to_write == batch_rows and batch_offset == 0: # Write entire batch - writers[current_file_idx].write_batch(batch) + writers[current_file_idx].write_batch(batch) else: # Write partial batch slice_batch = batch.slice(batch_offset, rows_to_write) - writers[current_file_idx].write_batch(slice_batch) + writers[current_file_idx].write_batch(slice_batch) @@ - if progress_callback: - progress_callback(rows_processed, total_rows) + if progress_callback: + progress_callback(rows_processed, total_rows) @@ - current_file_idx += 1 + # Close finished writer before moving on + if current_file_idx in writers: + writers[current_file_idx].close() + current_file_idx += 1 current_file_rows = 0 + + # Ensure final 100% update + if progress_callback and rows_processed != total_rows: + progress_callback(total_rows, total_rows) finally: - # Always close writers - for writer in writers: - if writer: - writer.close() + # Always close any open writers + for w in writers.values(): + w.close()This avoids empty files and reduces open file handles when num_files is large.
🧹 Nitpick comments (11)
tests/test_reader.py (1)
74-78: Prefer vectorized column extraction via to_pylist()Current loop indexes each scalar. You can simplify and speed this up:
- # Extract id column values - id_values = [table["id"][i].as_py() for i in range(len(table))] - assert id_values[0] == 4 - assert id_values[1] == 5 + # Extract id column values + id_values = table["id"].to_pylist() + assert id_values == [4, 5]parq/output.py (4)
106-108: Docstring wording: not “zero‑copy”Per‑cell
.as_py()materializes Python objects; this isn’t zero‑copy. Suggest: “Avoids pandas conversion; reads directly from PyArrow for lower overhead.”
118-120: Table styling defaults
show_lines=Trueand extra padding look nice but can slow rendering for large previews and wrap on narrow terminals. Considershow_lines=Falseand rely on zebra row styles, or gate this via a flag.
122-134: Avoid per‑cell scalar access; batch to Python lists onceIndexing each scalar via
arrow_table[col_name][row_idx].as_py()is O(rows×cols) Python dispatch. For small head/tail it’s fine; for wider tables it’s noticeably slower. Convert each column once, then zip rows:- # Add rows directly from PyArrow table (zero-copy access) - # This avoids the overhead of pandas conversion and iterrows() - for row_idx in range(arrow_table.num_rows): - row_values = [] - for col_name in arrow_table.column_names: - value = arrow_table[col_name][row_idx].as_py() - row_values.append(str(value)) - table.add_row(*row_values) + # Add rows: convert columns to Python lists once, then zip rows + columns_py = [arrow_table[name].to_pylist() for name in arrow_table.column_names] + for row in zip(*columns_py): + table.add_row(*(str(v) for v in row))If you need custom formatting (e.g., bytes, timestamps, decimals), add a small formatter per type before str().
197-245: Split summary UX nits
- “Rows per file” is a floor; you already prefix with “~”, good. Consider displaying “min–max rows per file” if cheap to compute.
- Start file index at 1 for user-facing table.
- table.add_column("#", style="cyan", no_wrap=True) + table.add_column("#", style="cyan", no_wrap=True) ... - for idx, file_path in enumerate(output_files): + for idx, file_path in enumerate(output_files, start=1): if file_path.exists(): size = OutputFormatter._format_file_size(file_path.stat().st_size) - table.add_row(str(idx), str(file_path), size) + table.add_row(str(idx), str(file_path), size)parq/cli.py (2)
233-251: Progress UI: handle non‑TTY and interrupts gracefully
- Fallback to plain text when not attached to a TTY to avoid escape sequences in logs/CI. Use
console.is_terminalor honor env flags (TTY_COMPATIBLE / TTY_INTERACTIVE). Based on learnings.- Catch
KeyboardInterruptto close progress cleanly and emit a friendly message.I can draft a small wrapper to choose between Progress vs. simple prints.
253-255: Nit: unused parameter in update_progress
totalisn’t used; drop it for clarity or use it to compute rate/ETA if you render it.parq/reader.py (4)
197-208: Type hint progress_callback preciselyUse typing.Callable for better editor/static‑type support:
- progress_callback: Optional[callable] = None, + from typing import Callable, Optional + progress_callback: Optional[Callable[[int, int], None]] = None,
244-260: Eager file preparation can create empty outputsWhen file_count > total_rows, this creates and closes empty Parquet files. Prefer lazy writer creation on first write, and only return paths that received data. If you must keep indices stable, at least avoid touching the filesystem for files that get 0 rows.
I can supply a lazy-writer refactor; see next comment for a concrete diff.
80-85: Metadata key namingYou store a single compression under
"compression_types". Consider"compression"(singular) or a mapping per column if you plan to extend it.
336-339: Optional: preserve per‑column compressionIf source columns use mixed compression, capture a mapping and pass it to ParquetWriter:
comp = {} for i, field in enumerate(self.schema): comp[field.name] = self.metadata.row_group(0).column(i).compression return comp or "SNAPPY"And then call
ParquetWriter(..., compression=comp). Falls back to a string if uniform.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (5)
parq/cli.py(1 hunks)parq/output.py(1 hunks)parq/reader.py(5 hunks)pyproject.toml(0 hunks)tests/test_reader.py(1 hunks)
💤 Files with no reviewable changes (1)
- pyproject.toml
🧰 Additional context used
🧬 Code graph analysis (1)
parq/cli.py (1)
parq/reader.py (3)
ParquetReader(13-339)num_rows(40-42)split_file(192-322)
… edge cases; optimize row reading performance in OutputFormatter by utilizing columnar access.
… reader.py for cleaner code; streamline the codebase by eliminating unnecessary comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
parq/output.py (1)
126-135: Use zip to avoid per‑cell indexed lookups.Zipping the column lists is simpler and typically faster than double indexing; also handle None cleanly.
- # Transpose and iterate over rows - for row_idx in range(arrow_table.num_rows): - row_values = [ - str(columns_data[col_idx][row_idx]) for col_idx in range(len(columns_data)) - ] - table.add_row(*row_values) + # Transpose and iterate rows via zip for efficiency + for row in zip(*columns_data): + table.add_row(*[(str(v) if v is not None else "") for v in row])parq/cli.py (1)
227-233: Move Rich/console imports to module top.PEP 8: keep imports at top; Rich is core here and also used elsewhere. Improves readability and static analysis.
- from rich.progress import ( - BarColumn, - Progress, - SpinnerColumn, - TextColumn, - TimeRemainingColumn, - ) - from parq.output import console + # (Move these to the top of the file with other imports) + # from rich.progress import BarColumn, Progress, SpinnerColumn, TextColumn, TimeRemainingColumn + # from parq.output import consoleAlso applies to: 235-235
parq/reader.py (1)
224-224: Tighten the callback type hint.Use typing.Callable for clarity and tooling.
-from typing import List, Optional +from typing import List, Optional, Callable @@ - progress_callback: Optional[callable] = None, + progress_callback: Optional["Callable[[int, int], None]"] = None,
🧹 Nitpick comments (5)
parq/output.py (1)
118-119: Avoid heavy UI/render and peak memory on large previews.Always-on show_lines and full-column to_pylist can be expensive for big tables. Gate lines and consider chunking.
- table = Table( + show_lines_arg = arrow_table.num_rows <= 200 + table = Table( title=f"[bold blue]📄 {title}[/bold blue]", box=box.ROUNDED, show_header=True, header_style="bold magenta", - padding=(0, 1), - show_lines=True, + padding=(0, 1), + show_lines=show_lines_arg, )Optional: when arrow_table.num_rows > 50_000, build rows from small batches (iter_batches) to cap peak memory; I can provide a patch if desired.
Also applies to: 126-129
parq/cli.py (2)
247-249: Use the provided total when updating.Keeps the task accurate if the source adjusts totals.
- def update_progress(current: int, total: int): - progress.update(task, completed=current) + def update_progress(current: int, total: int): + progress.update(task, completed=current, total=total)
237-245: Apply Progress hardening for CI/non‑TTY environments.Enable transient completion and auto‑disable when not a terminal:
with Progress( SpinnerColumn(), TextColumn("[bold blue]{task.description}"), BarColumn(), TextColumn("[progress.percentage]{task.percentage:>3.0f}%"), TimeRemainingColumn(), console=console, + transient=True, + disable=not console.is_terminal, ) as progress:Console.is_terminal is a read-only boolean property that indicates whether the Console is writing to a terminal, making this approach valid for detecting and disabling progress in non-TTY environments. Rich 14.x adds TTY_COMPATIBLE and TTY_INTERACTIVE environment variables if you need to override auto-detection in CI pipelines.
parq/reader.py (2)
282-341: Optional: clean up partial outputs on failure.On write errors, the current code closes writers but leaves partial files. Consider deleting any created files to avoid corrupt artifacts.
Sketch:
try: ... except Exception: for p in output_files: try: if p.exists(): p.unlink() except Exception: pass raise finally: for w in writers: if w: w.close()I can send a precise patch if you want this behavior.
343-353: Normalize compression value for consistency with PyArrow documentation examples.PyArrow's ParquetWriter accepts the compression parameter case-insensitively, so the current code is functionally correct. However, documentation examples commonly use lowercase names like "snappy" and "gzip", and normalizing ensures consistency. Additionally,
metadata.row_group(0).column(0).compressionmay return an enum object that requires conversion to a string.The suggested refactor is sound practice:
- if self.num_row_groups > 0: - compression = self.metadata.row_group(0).column(0).compression - return compression - return "SNAPPY" # Default compression + if self.num_row_groups > 0: + comp = self.metadata.row_group(0).column(0).compression + # Enum -> name, else str; normalize to lowercase for consistency. + name = getattr(comp, "name", str(comp)).lower() + return name + return "snappy" # Default
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
parq/cli.py(1 hunks)parq/output.py(1 hunks)parq/reader.py(6 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
parq/cli.py (1)
parq/reader.py (2)
num_rows(45-47)split_file(219-341)
🔇 Additional comments (3)
parq/output.py (1)
106-108: Good move away from pandas conversion.Direct PyArrow access reduces copies and speeds up preview.
parq/reader.py (2)
12-16: Nice: extracted FAST_PATH_ROWS_MULTIPLIER.Clearer intent and avoids magic numbers.
136-149: Solid guards and optimized row‑group reads.Input validation + zero‑row fast return + selective row‑group reads look good.
Also applies to: 179-186
Refactor output and reader for optimized performance
Summary by CodeRabbit
New Features
Performance Improvements
Bug Fixes
Tests
Chores