Datasets:

LLM360
/

CrystalCoderDatasets

Languages:

English

ArXiv:

Tags:

pretrained

License:

Dataset card Files Files and versions Community

Dataset Viewer

View in Dataset Viewer

Viewer

Split (1)

train

The dataset viewer is not available for this split.

Cannot load the dataset split (in streaming mode) to extract the first rows.

Error code:   StreamingRowsError
Exception:    OSError
Message:      cannot find loader for this HDF5 file
Traceback:    Traceback (most recent call last):
                File "/src/services/worker/src/worker/job_runners/split/first_rows.py", line 328, in compute
                  compute_first_rows_from_parquet_response(
                File "/src/services/worker/src/worker/job_runners/split/first_rows.py", line 88, in compute_first_rows_from_parquet_response
                  rows_index = indexer.get_rows_index(
                File "/src/libs/libcommon/src/libcommon/parquet_utils.py", line 631, in get_rows_index
                  return RowsIndex(
                File "/src/libs/libcommon/src/libcommon/parquet_utils.py", line 512, in __init__
                  self.parquet_index = self._init_parquet_index(
                File "/src/libs/libcommon/src/libcommon/parquet_utils.py", line 529, in _init_parquet_index
                  response = get_previous_step_or_raise(
                File "/src/libs/libcommon/src/libcommon/simple_cache.py", line 566, in get_previous_step_or_raise
                  raise CachedArtifactError(
              libcommon.simple_cache.CachedArtifactError: The previous step failed.
              
              During handling of the above exception, another exception occurred:
              
              Traceback (most recent call last):
                File "/src/services/worker/src/worker/utils.py", line 91, in get_rows_or_raise
                  return get_rows(
                File "/src/libs/libcommon/src/libcommon/utils.py", line 183, in decorator
                  return func(*args, **kwargs)
                File "/src/services/worker/src/worker/utils.py", line 68, in get_rows
                  rows_plus_one = list(itertools.islice(ds, rows_max_number + 1))
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/iterable_dataset.py", line 1392, in __iter__
                  example = _apply_feature_types_on_example(
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/iterable_dataset.py", line 1082, in _apply_feature_types_on_example
                  decoded_example = features.decode_example(encoded_example, token_per_repo_id=token_per_repo_id)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/features/features.py", line 1940, in decode_example
                  return {
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/features/features.py", line 1941, in <dictcomp>
                  column_name: decode_nested_example(feature, value, token_per_repo_id=token_per_repo_id)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/features/features.py", line 1341, in decode_nested_example
                  return schema.decode_example(obj, token_per_repo_id=token_per_repo_id)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/features/image.py", line 185, in decode_example
                  image.load()  # to avoid "Too many open files" errors
                File "/src/services/worker/.venv/lib/python3.9/site-packages/PIL/ImageFile.py", line 366, in load
                  raise OSError(msg)
              OSError: cannot find loader for this HDF5 file

Need help to make the dataset viewer work? Open a discussion for direct support.

Description of the Dataset

This release integrates the entire data sequence utilized in the CrystalCoder training. It encompasses data sequences from the three pre-training stages, combining information from two prior works: the SlimPajama dataset and StarCoder, totaling approximately 1300 billion tokens. These tokens are distributed across three stages, each with distinct weights.

Stage 1

During this initial stage, half of the SlimPajama data is utilized, equivalent to approximately 345 billion tokens.

Stage 2

In the second stage, the remaining half of the SlimPajama data is employed, along with two epochs of StarCoder data. For the StarCoder data, we apply FIM augmentation with an FIM rate of 0.9 and an SPM rate of 0.5. The total token count for this stage is calculated as 0.5 * 690 + 2 * 291, resulting in 927 billion tokens.

Stage 3

The third stage involves reusing Python and web-related data from the StarCoder data, including HTML, CSS, and JavaScript. This data is utilized for training over three epochs, with the application of FIM at a rate of 0.3 alongside an SPM rate of 0.5. The total token count for this stage is 100 billion. Additionally, a small portion of the SlimPajama dataset, excluding the Github part, is also reused, contributing around 10 billion tokens.

Instruction tuning (Stage 3a)

To enhance the model's proficiency in real chat scenarios, we utilize a diverse set of instruction tuning datasets, totaling approximately 1 billion tokens. Specifically, our data include OASST1-guanaco, SlimOrca, ShareGPT_V4.3, Evol-ShareGPT, CodeAlpaca, Rosetta Code, Evol-CodeAlpaca 1, Evol-CodeAlpaca 2, and a self-generated dataset centered on website creation through the Alpaca pipeline. We will release the full dataset soon.

The detailed breakdown of the tokens is as followed:

Primary Usage

This dataset serves as the foundation for training CrystalCoder and supports further reproduction. For training from scratch, please refer to our training codes. For training from middle checkpoints, please load the dataloader states in checkpoints and follow this tutorial.

License

Pretraining data in langauge model mostly comes from a collection of data sources with various licenses. Any use of all or part of the data here must abide by the terms of the original licenses, including attribution clauses when relevant. We refer users to SlimPajama dataset and StarCoder for detailed license attribution.

We release our work under ODC-BY, hence granting the rights over the dataset, but not the contents of the dataset individually.

Downloads last month: 0

Edit dataset card

Models trained or fine-tuned on LLM360/CrystalCoderDatasets

LLM360/CrystalChat

Text Generation • Updated 7 days ago • 145 • 32