The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Parent folder
py_data_juicer-1.5.1-py3-none-any.whl	2026-03-17	2.1 MB	0
README.md	2026-03-17	4.6 kB	0
Release v1.5.1_ LaTeX OPs_ Compressed Format Support_ Operator Robustness Fixes source code.tar.gz	2026-03-17	51.6 MB	0
Release v1.5.1_ LaTeX OPs_ Compressed Format Support_ Operator Robustness Fixes source code.zip	2026-03-17	52.5 MB	0
Totals: 4 Items		106.2 MB	0

Major Updates

📊 Stats: 13 PRs merged, from 7 contributors
📄 Two new LaTeX-focused mapper OPs shipped, extending data-juicer's document processing capabilities to handle .tex archives and figure contexts.
🗜️ Compressed dataset format support: json[l].gz files can now be loaded directly, and Ray datasets gain proper support for reading compressed JSON files.
📚 New documentation added covering cache, export, and tracing workflows to help users better understand and debug data processing pipelines.
🤖 Major refactor and upgrade of data-juicer-agents completed: The project architecture and CLI/session capabilities were comprehensively redesigned for better maintainability and extensibility. See date-juicer-agents for more details.

New OPs

latex_merge_tex_mapper: Got a bunch of .tex files packed in an archive? This OP automatically extracts and merges them into a single unified LaTeX document, making it much easier to process multi-file LaTeX projects. [#932]
latex_figure_context_extractor_mapper: Extracts figure-related context (e.g., captions, surrounding paragraphs) from LaTeX source files, so you can build richer multimodal datasets from academic papers. [#923]

Enhancements

Load dataset with extra kwargs: You can now pass arbitrary extra arguments to datasets.load_dataset() via the new load_dataset_kwargs config field — handy for datasets that need non-standard loading options. [#922]
Custom tokenizer in RemoveRepeatSentencesMapper: The mapper now accepts a custom tokenizer, so you're no longer stuck with the default sentence splitter — great for non-English text or domain-specific tokenization needs. [#925]
Compressed JSON support: Added support for reading json[l].gz files directly, and fixed Ray datasets to properly handle compressed JSON — no more manual decompression before feeding data in. [#919]
Faster TokenNumFilter with batch tokenization: Instead of tokenizing one sample at a time, TokenNumFilter now processes the whole batch in one shot, significantly speeding up token-count-based filtering. [#929]
Cache redundant sum() calls in repetition filters: Repetition filters were calling sum() multiple times on the same data. These results are now cached, saving unnecessary computation on large batches. [#924]
New docs: cache, export, and tracing: Added dedicated documentation pages explaining how data-juicer handles caching, result exporting, and execution tracing — a much-needed addition for debugging complex pipelines. [#935]
Enhanced op_search with BM25/Regex & MCP Server upgrade: Added BM25 and regex search modes to op_search (no longer requiring dj-agents), and expanded the MCP server with four new tools covering op search, dataset analysis, config schema retrieval, and dataset loading strategy discovery. [#937]

Fixed Bugs

Wrong cache key in ImageFaceCountFilter: The filter was using an incorrect key when reading from cache, causing it to miss cached results and redo redundant work. Now fixed. [#921]
GeneralFusedOP silently dropping Mapper results: When running a fused pipeline, Mapper outputs were being discarded instead of passed downstream. This was a silent data loss bug — now properly fixed. [#928]
Shared _default_kwargs mutation polluting other OP instances: Operator instances were accidentally sharing a mutable default kwargs dict, meaning modifying one OP's config could inadvertently affect other instances. Each instance now gets its own copy. [#926]
NlpaugEnMapper only augmenting the first sample in a batch: Due to a bug in the batching logic, text augmentation was only being applied to the very first sample, leaving the rest untouched. All samples in a batch are now correctly augmented. [#927]

Acknowledgements

@JohnGiorgi contributed three impactful improvements: load_dataset_kwargs support, custom tokenizer in RemoveRepeatSentencesMapper, and batch tokenization optimization in TokenNumFilter. [#922] [#925] [#929]
@dubin555 squashed multiple operator bugs and added performance optimizations across filters and the fused pipeline. [#921] [#924] [#926] [#927] [#928]
@leeyyi and @liyuyi-2001 made their first contributions with two brand-new LaTeX OPs. [#923] [#932]
@HunterLine added compressed JSON dataset support. [#919]

Full Changelog: https://github.com/datajuicer/data-juicer/compare/v1.5.0...v1.5.1

Source: README.md, updated 2026-03-17

Data-Juicer Files

Data processing for and with foundation models

Major Updates

New OPs

Enhancements

Fixed Bugs

Acknowledgements

Data-Juicer Files

Data processing for and with foundation models

Get an email when there's a new version of Data-Juicer

Major Updates

New OPs

Enhancements

Fixed Bugs

Acknowledgements