Skip to content

Conversation

@res-life
Copy link
Contributor

Add test case

Description

Fix csv diff compared to Spark CPU
closes #20812

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Signed-off-by: Chong Gao <res_life@163.com>

Add test case

Signed-off-by: Chong Gao <res_life@163.com>
@res-life res-life requested a review from a team as a code owner December 18, 2025 07:46
@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 18, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@res-life res-life marked this pull request as draft December 18, 2025 07:46
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Dec 18, 2025
@res-life
Copy link
Contributor Author

/ok to test

@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 18, 2025

/ok to test

@res-life, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@res-life res-life added bug Something isn't working non-breaking Non-breaking change labels Dec 18, 2025
@res-life
Copy link
Contributor Author

/ok to test b76068c

@res-life
Copy link
Contributor Author

It's draft, not sure if this PR will fails other cases.

@res-life
Copy link
Contributor Author

pre-commit.ci autofix

@GaryShen2008
Copy link
Contributor

Hi @vuule , we tried to fix the csv diff issue by Cursor. We tried on the cases, it seems working. But we don't have confidence about other cases. Can you review this PR to feedback if it makes sense or not? Thanks.

@res-life
Copy link
Contributor Author

/ok to test dc4c828

@vuule
Copy link
Contributor

vuule commented Dec 19, 2025

I'll take a look, thanks for the PR!

@vuule
Copy link
Contributor

vuule commented Dec 23, 2025

One of the new tests fails with the changes in this PR. So the fix here is about as far as I got on my end. Looking into fixing the second test case.

@res-life
Copy link
Contributor Author

res-life commented Jan 6, 2026

@vuule I double checked, this PR fixes the second test case of #20812. But this PR has regression.

@vuule
Copy link
Contributor

vuule commented Jan 6, 2026

The new test is failing for me

27: [ RUN      ] CsvReaderTest.DoubleQuotesContinuous
27: /home/coder/cudf/cpp/tests/utilities/column_utilities.cu:557: Failure
27: Failed
27: first difference: lhs[0] = "packageName":"test","type":"test","url_scheme":false,"referer":",test, rhs[0] = "packageName":"test","type":"test","url_scheme":false,"referer":"",test
27: Google Test trace:
27: /home/coder/cudf/cpp/tests/io/csv_test.cpp:2687:  <--  line of failure
27: 
27: 
27: [  FAILED  ] CsvReaderTest.DoubleQuotesContinuous (0 ms)

@vuule
Copy link
Contributor

vuule commented Jan 6, 2026

/ok to test 0098d49

@vuule
Copy link
Contributor

vuule commented Jan 6, 2026

The core issue seems to be that cuDF always applies some string processing steps that should only be done when the string is actually quoted.

I think I have a fix (all tests pass), but it seems to lower the performance quite a bit in some cases.
Working on getting the changes into a mergeable state.

@res-life
Copy link
Contributor Author

res-life commented Jan 8, 2026

@vuule Thanks very much for helping this issue which is from a customer.
I'll close this draft PR and feel free to put up your own PR.
After your PR is ready, I'll validate the customer query later.

@vuule
Copy link
Contributor

vuule commented Jan 8, 2026

I'm fine with pushing to this PR, no need to close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]The data was wrong when reading CSV with double quotation marks in some case

3 participants