Name | Modified | Size | Downloads / Week |
---|---|---|---|
Parent folder | |||
4.0.0 source code.tar.gz | 2025-07-09 | 1.9 MB | |
4.0.0 source code.zip | 2025-07-09 | 2.0 MB | |
README.md | 2025-07-09 | 8.4 kB | |
Totals: 3 Items | 3.9 MB | 5 |
New Features
- Add
IterableDataset.push_to_hub()
by @lhoestq in https://github.com/huggingface/datasets/pull/7595
```python # Build streaming data pipelines in a few lines of code ! from datasets import load_dataset
ds = load_dataset(..., streaming=True) ds = ds.map(...).filter(...) ds.push_to_hub(...) ```
- Add
num_proc=
to.push_to_hub()
(Dataset and IterableDataset) by @lhoestq in https://github.com/huggingface/datasets/pull/7606
python
# Faster push to Hub ! Available for both Dataset and IterableDataset
ds.push_to_hub(..., num_proc=8)
- New
Column
object - Implementation of iteration over values of a column in an IterableDataset object by @TopCoder2K in https://github.com/huggingface/datasets/pull/7564
- Lazy column by @lhoestq in https://github.com/huggingface/datasets/pull/7614
```python # Syntax: ds["column_name"] # datasets.Column([...]) or datasets.IterableColumn(...)
# Iterate on a column: for text in ds["text"]: ...
# Load one cell without bringing the full column in memory first_text = ds["text"][0] # equivalent to ds[0]["text"] ``` * Torchcodec decoding by @TyTodd in https://github.com/huggingface/datasets/pull/7616 - Enables streaming only the ranges you need !
```python # Don't download full audios/videos when it's not necessary # Now with torchcodec it only streams the required ranges/frames: from datasets import load_dataset
ds = load_dataset(..., streaming=True) for example in ds: video = example["video"] frames = video.get_frames_in_range(start=0, stop=6, step=1) # only stream certain frames ```
- Requires
torch>=2.7.0
and FFmpeg >= 4 - Not available for Windows yet but it is coming soon - in the meantime please use
datasets<4.0
- Load audio data with
AudioDecoder
:
```python audio = dataset[0]["audio"] # <datasets.features._torchcodec.AudioDecoder object at 0x11642b6a0> samples = audio.get_all_samples() # or use get_samples_played_in_range(...) samples.data # tensor([[ 0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 2.3447e-06, -1.9127e-04, -5.3330e-05]] samples.sample_rate # 16000
# old syntax is still supported array, sr = audio["array"], audio["sampling_rate"] ```
- Load video data with
VideoDecoder
:
python
# video = dataset[0]["video"] <torchcodec.decoders._video_decoder.VideoDecoder object at 0x14a61d5a0>
first_frame = video.get_frame_at(0)
first_frame.data.shape # (3, 240, 320)
first_frame.pts_seconds # 0.0
frames = video.get_frames_in_range(0, 6, 1)
frames.data.shape # torch.Size([5, 3, 240, 320])
Breaking changes
- Remove scripts altogether by @lhoestq in https://github.com/huggingface/datasets/pull/7592
trust_remote_code
is no longer supported- Torchcodec decoding by @TyTodd in https://github.com/huggingface/datasets/pull/7616
- torchcodec replaces soundfile for audio decoding
- torchcodec replaces decord for video decoding
- Replace Sequence by List by @lhoestq in https://github.com/huggingface/datasets/pull/7634
- Introduction of the
List
type
```python from datasets import Features, List, Value
features = Features({ "texts": List(Value("string")), "four_paragraphs": List(Value("string"), length=4) }) ```
Sequence
was a legacy type from tensorflow datasets which converted list of dicts to dicts of lists. It is no longer a type but it becomes a utility that returns aList
or adict
depending on the subfeature
```python from datasets import Sequence
Sequence(Value("string")) # List(Value("string")) Sequence({"texts": Value("string")}) # {"texts": List(Value("string"))} ```
Other improvements and bug fixes
- Refactor
Dataset.map
to reuse cache files mapped with differentnum_proc
by @ringohoffman in https://github.com/huggingface/datasets/pull/7434 - fix string_to_dict test by @lhoestq in https://github.com/huggingface/datasets/pull/7571
- Preserve formatting in concatenated IterableDataset by @francescorubbo in https://github.com/huggingface/datasets/pull/7522
- Fix typos in PDF and Video documentation by @AndreaFrancis in https://github.com/huggingface/datasets/pull/7579
- fix: Add embed_storage in Pdf feature by @AndreaFrancis in https://github.com/huggingface/datasets/pull/7582
- load_dataset splits typing by @lhoestq in https://github.com/huggingface/datasets/pull/7587
- Fixed typos by @TopCoder2K in https://github.com/huggingface/datasets/pull/7572
- Fix regex library warnings by @emmanuel-ferdman in https://github.com/huggingface/datasets/pull/7576
- [MINOR:TYPO] Update save_to_disk docstring by @cakiki in https://github.com/huggingface/datasets/pull/7575
- Add missing property on
RepeatExamplesIterable
by @SilvanCodes in https://github.com/huggingface/datasets/pull/7581 - Avoid multiple default config names by @albertvillanova in https://github.com/huggingface/datasets/pull/7585
- Fix broken link to albumentations by @ternaus in https://github.com/huggingface/datasets/pull/7593
- fix string_to_dict usage for windows by @lhoestq in https://github.com/huggingface/datasets/pull/7598
- No TF in win tests by @lhoestq in https://github.com/huggingface/datasets/pull/7603
- Docs and more methods for IterableDataset: push_to_hub, to_parquet... by @lhoestq in https://github.com/huggingface/datasets/pull/7604
- Tests typing and fixes for push_to_hub by @lhoestq in https://github.com/huggingface/datasets/pull/7608
- fix parallel push_to_hub in dataset_dict by @lhoestq in https://github.com/huggingface/datasets/pull/7613
- remove unused code by @lhoestq in https://github.com/huggingface/datasets/pull/7615
- Update
_dill.py
to useco_linetable
for Python 3.10+ in place ofco_lnotab
by @qgallouedec in https://github.com/huggingface/datasets/pull/7609 - Fixes in docs by @lhoestq in https://github.com/huggingface/datasets/pull/7620
- Add albumentations to use dataset by @ternaus in https://github.com/huggingface/datasets/pull/7596
- minor docs data aug by @lhoestq in https://github.com/huggingface/datasets/pull/7621
- fix: raise error in FolderBasedBuilder when data_dir and data_files are missing by @ArjunJagdale in https://github.com/huggingface/datasets/pull/7623
- fix save_infos by @lhoestq in https://github.com/huggingface/datasets/pull/7639
- better features repr by @lhoestq in https://github.com/huggingface/datasets/pull/7640
- update docs and docstrings by @lhoestq in https://github.com/huggingface/datasets/pull/7641
- fix length for ci by @lhoestq in https://github.com/huggingface/datasets/pull/7642
- Backward compat sequence instance by @lhoestq in https://github.com/huggingface/datasets/pull/7643
- fix sequence ci by @lhoestq in https://github.com/huggingface/datasets/pull/7644
- Custom metadata filenames by @lhoestq in https://github.com/huggingface/datasets/pull/7663
- Update the beans dataset link in Preprocess by @HJassar in https://github.com/huggingface/datasets/pull/7659
- Backward compat list feature by @lhoestq in https://github.com/huggingface/datasets/pull/7666
- Fix infer list of images by @lhoestq in https://github.com/huggingface/datasets/pull/7667
- Fix audio bytes by @lhoestq in https://github.com/huggingface/datasets/pull/7670
- Fix double sequence by @lhoestq in https://github.com/huggingface/datasets/pull/7672
New Contributors
- @TopCoder2K made their first contribution in https://github.com/huggingface/datasets/pull/7564
- @francescorubbo made their first contribution in https://github.com/huggingface/datasets/pull/7522
- @emmanuel-ferdman made their first contribution in https://github.com/huggingface/datasets/pull/7576
- @SilvanCodes made their first contribution in https://github.com/huggingface/datasets/pull/7581
- @ternaus made their first contribution in https://github.com/huggingface/datasets/pull/7593
- @ArjunJagdale made their first contribution in https://github.com/huggingface/datasets/pull/7623
- @TyTodd made their first contribution in https://github.com/huggingface/datasets/pull/7616
- @HJassar made their first contribution in https://github.com/huggingface/datasets/pull/7659
Full Changelog: https://github.com/huggingface/datasets/compare/3.6.0...4.0.0