Datasets - Browse /4.0.0 at SourceForge.net

The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Parent folder
4.0.0 source code.tar.gz	2025-07-09	1.9 MB	0
4.0.0 source code.zip	2025-07-09	2.0 MB	3
README.md	2025-07-09	8.4 kB	0
Totals: 3 Items		3.9 MB	3

New Features

Add IterableDataset.push_to_hub() by @lhoestq in https://github.com/huggingface/datasets/pull/7595

```python # Build streaming data pipelines in a few lines of code ! from datasets import load_dataset

ds = load_dataset(..., streaming=True) ds = ds.map(...).filter(...) ds.push_to_hub(...) ```

Add num_proc= to .push_to_hub() (Dataset and IterableDataset) by @lhoestq in https://github.com/huggingface/datasets/pull/7606

python # Faster push to Hub ! Available for both Dataset and IterableDataset ds.push_to_hub(..., num_proc=8)

New Column object
Implementation of iteration over values of a column in an IterableDataset object by @TopCoder2K in https://github.com/huggingface/datasets/pull/7564
Lazy column by @lhoestq in https://github.com/huggingface/datasets/pull/7614

```python # Syntax: ds["column_name"] # datasets.Column([...]) or datasets.IterableColumn(...)

# Iterate on a column: for text in ds["text"]: ...

# Load one cell without bringing the full column in memory first_text = ds["text"][0] # equivalent to ds[0]["text"] ``` * Torchcodec decoding by @TyTodd in https://github.com/huggingface/datasets/pull/7616 - Enables streaming only the ranges you need !

```python # Don't download full audios/videos when it's not necessary # Now with torchcodec it only streams the required ranges/frames: from datasets import load_dataset

ds = load_dataset(..., streaming=True) for example in ds: video = example["video"] frames = video.get_frames_in_range(start=0, stop=6, step=1) # only stream certain frames ```

Requires torch>=2.7.0 and FFmpeg >= 4
Not available for Windows yet but it is coming soon - in the meantime please use datasets<4.0
Load audio data with AudioDecoder:

```python audio = dataset[0]["audio"] # <datasets.features._torchcodec.AudioDecoder object at 0x11642b6a0> samples = audio.get_all_samples() # or use get_samples_played_in_range(...) samples.data # tensor([[ 0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 2.3447e-06, -1.9127e-04, -5.3330e-05]] samples.sample_rate # 16000

# old syntax is still supported array, sr = audio["array"], audio["sampling_rate"] ```

Load video data with VideoDecoder:

python # video = dataset[0]["video"] <torchcodec.decoders._video_decoder.VideoDecoder object at 0x14a61d5a0> first_frame = video.get_frame_at(0) first_frame.data.shape # (3, 240, 320) first_frame.pts_seconds # 0.0 frames = video.get_frames_in_range(0, 6, 1) frames.data.shape # torch.Size([5, 3, 240, 320])

Breaking changes

Remove scripts altogether by @lhoestq in https://github.com/huggingface/datasets/pull/7592
trust_remote_code is no longer supported
Torchcodec decoding by @TyTodd in https://github.com/huggingface/datasets/pull/7616
torchcodec replaces soundfile for audio decoding
torchcodec replaces decord for video decoding
Replace Sequence by List by @lhoestq in https://github.com/huggingface/datasets/pull/7634
Introduction of the List type

```python from datasets import Features, List, Value

features = Features({ "texts": List(Value("string")), "four_paragraphs": List(Value("string"), length=4) }) ```

Sequence was a legacy type from tensorflow datasets which converted list of dicts to dicts of lists. It is no longer a type but it becomes a utility that returns a List or a dict depending on the subfeature

```python from datasets import Sequence

Sequence(Value("string")) # List(Value("string")) Sequence({"texts": Value("string")}) # {"texts": List(Value("string"))} ```

Other improvements and bug fixes

Refactor Dataset.map to reuse cache files mapped with different num_proc by @ringohoffman in https://github.com/huggingface/datasets/pull/7434
fix string_to_dict test by @lhoestq in https://github.com/huggingface/datasets/pull/7571
Preserve formatting in concatenated IterableDataset by @francescorubbo in https://github.com/huggingface/datasets/pull/7522
Fix typos in PDF and Video documentation by @AndreaFrancis in https://github.com/huggingface/datasets/pull/7579
fix: Add embed_storage in Pdf feature by @AndreaFrancis in https://github.com/huggingface/datasets/pull/7582
load_dataset splits typing by @lhoestq in https://github.com/huggingface/datasets/pull/7587
Fixed typos by @TopCoder2K in https://github.com/huggingface/datasets/pull/7572
Fix regex library warnings by @emmanuel-ferdman in https://github.com/huggingface/datasets/pull/7576
[MINOR:TYPO] Update save_to_disk docstring by @cakiki in https://github.com/huggingface/datasets/pull/7575
Add missing property on RepeatExamplesIterable by @SilvanCodes in https://github.com/huggingface/datasets/pull/7581
Avoid multiple default config names by @albertvillanova in https://github.com/huggingface/datasets/pull/7585
Fix broken link to albumentations by @ternaus in https://github.com/huggingface/datasets/pull/7593
fix string_to_dict usage for windows by @lhoestq in https://github.com/huggingface/datasets/pull/7598
No TF in win tests by @lhoestq in https://github.com/huggingface/datasets/pull/7603
Docs and more methods for IterableDataset: push_to_hub, to_parquet... by @lhoestq in https://github.com/huggingface/datasets/pull/7604
Tests typing and fixes for push_to_hub by @lhoestq in https://github.com/huggingface/datasets/pull/7608
fix parallel push_to_hub in dataset_dict by @lhoestq in https://github.com/huggingface/datasets/pull/7613
remove unused code by @lhoestq in https://github.com/huggingface/datasets/pull/7615
Update _dill.py to use co_linetable for Python 3.10+ in place of co_lnotab by @qgallouedec in https://github.com/huggingface/datasets/pull/7609
Fixes in docs by @lhoestq in https://github.com/huggingface/datasets/pull/7620
Add albumentations to use dataset by @ternaus in https://github.com/huggingface/datasets/pull/7596
minor docs data aug by @lhoestq in https://github.com/huggingface/datasets/pull/7621
fix: raise error in FolderBasedBuilder when data_dir and data_files are missing by @ArjunJagdale in https://github.com/huggingface/datasets/pull/7623
fix save_infos by @lhoestq in https://github.com/huggingface/datasets/pull/7639
better features repr by @lhoestq in https://github.com/huggingface/datasets/pull/7640
update docs and docstrings by @lhoestq in https://github.com/huggingface/datasets/pull/7641
fix length for ci by @lhoestq in https://github.com/huggingface/datasets/pull/7642
Backward compat sequence instance by @lhoestq in https://github.com/huggingface/datasets/pull/7643
fix sequence ci by @lhoestq in https://github.com/huggingface/datasets/pull/7644
Custom metadata filenames by @lhoestq in https://github.com/huggingface/datasets/pull/7663
Update the beans dataset link in Preprocess by @HJassar in https://github.com/huggingface/datasets/pull/7659
Backward compat list feature by @lhoestq in https://github.com/huggingface/datasets/pull/7666
Fix infer list of images by @lhoestq in https://github.com/huggingface/datasets/pull/7667
Fix audio bytes by @lhoestq in https://github.com/huggingface/datasets/pull/7670
Fix double sequence by @lhoestq in https://github.com/huggingface/datasets/pull/7672

New Contributors

@TopCoder2K made their first contribution in https://github.com/huggingface/datasets/pull/7564
@francescorubbo made their first contribution in https://github.com/huggingface/datasets/pull/7522
@emmanuel-ferdman made their first contribution in https://github.com/huggingface/datasets/pull/7576
@SilvanCodes made their first contribution in https://github.com/huggingface/datasets/pull/7581
@ternaus made their first contribution in https://github.com/huggingface/datasets/pull/7593
@ArjunJagdale made their first contribution in https://github.com/huggingface/datasets/pull/7623
@TyTodd made their first contribution in https://github.com/huggingface/datasets/pull/7616
@HJassar made their first contribution in https://github.com/huggingface/datasets/pull/7659

Full Changelog: https://github.com/huggingface/datasets/compare/3.6.0...4.0.0

Source: README.md, updated 2025-07-09

Datasets Files

Hub of ready-to-use datasets for ML models

New Features

Breaking changes

Other improvements and bug fixes

New Contributors

Datasets Files

Hub of ready-to-use datasets for ML models

Get an email when there's a new version of Datasets

New Features

Breaking changes

Other improvements and bug fixes

New Contributors