Skip to content

Got error when converting TPC-DS in .dat format to parquet format #6

@linqinluli

Description

@linqinluli

After I execute

cargo run --release -- generate --benchmark tpcds \
  --scale 1000 \
  --partitions 48 \
  --generator-path /path/to/DSGen-software-code-3.2.0rc1/tools \
  --output /tmp/tpcds/sf1000/

The data are generated in folder /tmp/tpcds/sf1000/. Then I execute

mkdir /tmp/tpcds/sf1000-parquet

cargo run --release -- convert --benchmark tpcds \
  --input /tmp/tpcds/sf1000/
  --output /tmp/tpcds/sf1000-parquet/

I got error below
ArrowError(CsvError("incorrect number of fields for line 1, expected 31 got more than 31"))
I found the code cause the error might be
df.write_parquet(&output_filename, Some(props)).await?;
in lib.rs

After I delete the first number in call_center.dat/part-1.dat, the error became to
ArrowError(CsvError("incorrect number of fields for line 2, expected 31 got 32"))

However the process of TPCH data is OK. The generators of TPCH and TPC-DS are obtained as you described in your repo.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions