Download Latest Version 4.0.0 source code.tar.gz (1.9 MB)
Email in envelope

Get an email when there's a new version of Datasets

Home / 4.0.0
Name Modified Size InfoDownloads / Week
Parent folder
4.0.0 source code.tar.gz 2025-07-09 1.9 MB
4.0.0 source code.zip 2025-07-09 2.0 MB
README.md 2025-07-09 8.4 kB
Totals: 3 Items   3.9 MB 5

New Features

```python # Build streaming data pipelines in a few lines of code ! from datasets import load_dataset

ds = load_dataset(..., streaming=True) ds = ds.map(...).filter(...) ds.push_to_hub(...) ```

python # Faster push to Hub ! Available for both Dataset and IterableDataset ds.push_to_hub(..., num_proc=8)

```python # Syntax: ds["column_name"] # datasets.Column([...]) or datasets.IterableColumn(...)

# Iterate on a column: for text in ds["text"]: ...

# Load one cell without bringing the full column in memory first_text = ds["text"][0] # equivalent to ds[0]["text"] ``` * Torchcodec decoding by @TyTodd in https://github.com/huggingface/datasets/pull/7616 - Enables streaming only the ranges you need !

```python # Don't download full audios/videos when it's not necessary # Now with torchcodec it only streams the required ranges/frames: from datasets import load_dataset

ds = load_dataset(..., streaming=True) for example in ds: video = example["video"] frames = video.get_frames_in_range(start=0, stop=6, step=1) # only stream certain frames ```

  • Requires torch>=2.7.0 and FFmpeg >= 4
  • Not available for Windows yet but it is coming soon - in the meantime please use datasets<4.0
  • Load audio data with AudioDecoder:

```python audio = dataset[0]["audio"] # <datasets.features._torchcodec.AudioDecoder object at 0x11642b6a0> samples = audio.get_all_samples() # or use get_samples_played_in_range(...) samples.data # tensor([[ 0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 2.3447e-06, -1.9127e-04, -5.3330e-05]] samples.sample_rate # 16000

# old syntax is still supported array, sr = audio["array"], audio["sampling_rate"] ```

  • Load video data with VideoDecoder:

python # video = dataset[0]["video"] <torchcodec.decoders._video_decoder.VideoDecoder object at 0x14a61d5a0> first_frame = video.get_frame_at(0) first_frame.data.shape # (3, 240, 320) first_frame.pts_seconds # 0.0 frames = video.get_frames_in_range(0, 6, 1) frames.data.shape # torch.Size([5, 3, 240, 320])

Breaking changes

```python from datasets import Features, List, Value

features = Features({ "texts": List(Value("string")), "four_paragraphs": List(Value("string"), length=4) }) ```

  • Sequence was a legacy type from tensorflow datasets which converted list of dicts to dicts of lists. It is no longer a type but it becomes a utility that returns a List or a dict depending on the subfeature

```python from datasets import Sequence

Sequence(Value("string")) # List(Value("string")) Sequence({"texts": Value("string")}) # {"texts": List(Value("string"))} ```

Other improvements and bug fixes

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/3.6.0...4.0.0

Source: README.md, updated 2025-07-09