| Name | Modified | Size | Downloads / Week |
|---|---|---|---|
| Parent folder | |||
| 4.2.0 source code.tar.gz | 2025-10-09 | 2.1 MB | |
| 4.2.0 source code.zip | 2025-10-09 | 2.2 MB | |
| README.md | 2025-10-09 | 2.1 kB | |
| Totals: 3 Items | 4.3 MB | 0 | |
Dataset Features
- Sample without replacement option when interleaving datasets by @radulescupetru in https://github.com/huggingface/datasets/pull/7786
python
ds = interleave_datasets(datasets, stopping_strategy="all_exhausted_without_replacement")
- Parquet: add
on_bad_filesargument to error/warn/skip bad files by @lhoestq in https://github.com/huggingface/datasets/pull/7806
python
ds = load_dataset(parquet_dataset_id, on_bad_files="warn")
-
Add parquet scan options and docs by @lhoestq in https://github.com/huggingface/datasets/pull/7801
-
docs to select columns and filter data efficiently
python
ds = load_dataset(parquet_dataset_id, columns=["col_0", "col_1"])
ds = load_dataset(parquet_dataset_id, filters=[("col_0", "==", 0)])
* new argument to control buffering and caching when streaming
python
fragment_scan_options = pyarrow.dataset.ParquetFragmentScanOptions(cache_options=pyarrow.CacheOptions(prefetch_limit=1, range_size_limit=128 << 20))
ds = load_dataset(parquet_dataset_id, streaming=True, fragment_scan_options=fragment_scan_options)
What's Changed
- Document HDF5 support by @klamike in https://github.com/huggingface/datasets/pull/7740
- update tips in docs by @lhoestq in https://github.com/huggingface/datasets/pull/7790
- feat: avoid some copies in torch formatter by @drbh in https://github.com/huggingface/datasets/pull/7787
- Support huggingface_hub v0.x and v1.x by @Wauplin in https://github.com/huggingface/datasets/pull/7783
- Define CI future by @lhoestq in https://github.com/huggingface/datasets/pull/7799
- More Parquet streaming docs by @lhoestq in https://github.com/huggingface/datasets/pull/7803
- Less api calls when resolving data_files by @lhoestq in https://github.com/huggingface/datasets/pull/7805
- typo by @lhoestq in https://github.com/huggingface/datasets/pull/7807
New Contributors
- @drbh made their first contribution in https://github.com/huggingface/datasets/pull/7787
Full Changelog: https://github.com/huggingface/datasets/compare/4.1.1...4.2.0