Skip to content

creating artifacts/tokenizer/refinedweb_mix/spm_32000_unigram.model #6

@virtustate

Description

@virtustate

This is very cool, thanks for creating and sharing! From all the .md files, it looks like you used some pretty sophisticated AI assisted coding. I'd love to hear what you used and any details you'd like to share.

Some scripts, e.g. scripts/data/run_sample.sh, fail, not finding the tokenizer: artifacts/tokenizer/refinedweb_mix/spm_32000_unigram.model which I have yet to figure out how to create.

scripts/data/run_full.sh looks promising to create the tokenizer, but failed finding a training split:
Bad split: train. Available splits: ['test']

Any hints on getting farther with this appreciated. Thanks!

1410 │ │ elif split in splits_generators: │
│ 1411 │ │ │ splits_generator = splits_generators[split] │
│ 1412 │ │ else: │
│ ❱ 1413 │ │ │ raise ValueError(f"Bad split: {split}. Available splits: {list(splits_genera │
│ 1414 │ │ │
│ 1415 │ │ # Create a dataset for each of the given splits │
│ 1416 │ │ datasets = map_nested( │
│ │
│ ╭───────────────────────────────────────────────────────────── locals ──────────────────────────────────────────────────────────────╮ │
│ │ base_path = None │ │
│ │ dl_manager = <datasets.download.streaming_download_manager.StreamingDownloadManager object at 0x1156d53a0> │ │
│ │ self = <datasets.packaged_modules.text.text.Text object at 0x1068c5970> │ │
│ │ split = 'train' │ │
│ │ splits_generators = { │ │
│ │ │ 'test': SplitGenerator( │ │
│ │ │ │ name='test', │ │
│ │ │ │ gen_kwargs={ │ │
│ │ │ │ │ 'files': [ │ │
│ │ │ │ │ │ <datasets.utils.file_utils.FilesIterable object at 0x11577e630>, │ │
│ │ │ │ │ │ <datasets.utils.file_utils.FilesIterable object at 0x114991e80>, │ │
│ │ │ │ │ │ <datasets.utils.file_utils.FilesIterable object at 0x11563e2a0>, │ │
│ │ │ │ │ │ <datasets.utils.file_utils.FilesIterable object at 0x116a6f260>, │ │
│ │ │ │ │ │ <datasets.utils.file_utils.FilesIterable object at 0x116a6f920>, │ │
│ │ │ │ │ │ <datasets.utils.file_utils.FilesIterable object at 0x116a6eb70>, │ │
│ │ │ │ │ │ <datasets.utils.file_utils.FilesIterable object at 0x116a6f0b0>, │ │
│ │ │ │ │ │ <datasets.utils.file_utils.FilesIterable object at 0x116a6f7a0>, │ │
│ │ │ │ │ │ <datasets.utils.file_utils.FilesIterable object at 0x116a6ec30>, │ │
│ │ │ │ │ │ <datasets.utils.file_utils.FilesIterable object at 0x116a6ee70>, │ │
│ │ │ │ │ │ ... +92 │ │
│ │ │ │ │ ] │ │
│ │ │ │ }, │ │
│ │ │ │ split_info=SplitInfo( │ │
│ │ │ │ │ name='test', │ │
│ │ │ │ │ num_bytes=0, │ │
│ │ │ │ │ num_examples=0, │ │
│ │ │ │ │ shard_lengths=None, │ │
│ │ │ │ │ dataset_name=None │ │
│ │ │ │ ) │ │
│ │ │ ) │ │
│ │ }
ValueError: Bad split: train. Available splits: ['test']

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions