Download Latest Version Release v1.5.1_ LaTeX OPs_ Compressed Format Support_ Operator Robustness Fixes source code.tar.gz (51.6 MB)
Email in envelope

Get an email when there's a new version of Data-Juicer

Home / v1.5.1
Name Modified Size InfoDownloads / Week
Parent folder
py_data_juicer-1.5.1-py3-none-any.whl 2026-03-17 2.1 MB
README.md 2026-03-17 4.6 kB
Release v1.5.1_ LaTeX OPs_ Compressed Format Support_ Operator Robustness Fixes source code.tar.gz 2026-03-17 51.6 MB
Release v1.5.1_ LaTeX OPs_ Compressed Format Support_ Operator Robustness Fixes source code.zip 2026-03-17 52.5 MB
Totals: 4 Items   106.2 MB 0

Major Updates

  • 📊 Stats: 13 PRs merged, from 7 contributors
  • đź“„ Two new LaTeX-focused mapper OPs shipped, extending data-juicer's document processing capabilities to handle .tex archives and figure contexts.
  • 🗜️ Compressed dataset format support: json[l].gz files can now be loaded directly, and Ray datasets gain proper support for reading compressed JSON files.
  • 📚 New documentation added covering cache, export, and tracing workflows to help users better understand and debug data processing pipelines.
  • 🤖 Major refactor and upgrade of data-juicer-agents completed: The project architecture and CLI/session capabilities were comprehensively redesigned for better maintainability and extensibility. See date-juicer-agents for more details.

New OPs

  • latex_merge_tex_mapper: Got a bunch of .tex files packed in an archive? This OP automatically extracts and merges them into a single unified LaTeX document, making it much easier to process multi-file LaTeX projects. [#932]
  • latex_figure_context_extractor_mapper: Extracts figure-related context (e.g., captions, surrounding paragraphs) from LaTeX source files, so you can build richer multimodal datasets from academic papers. [#923]

Enhancements

  • Load dataset with extra kwargs: You can now pass arbitrary extra arguments to datasets.load_dataset() via the new load_dataset_kwargs config field — handy for datasets that need non-standard loading options. [#922]
  • Custom tokenizer in RemoveRepeatSentencesMapper: The mapper now accepts a custom tokenizer, so you're no longer stuck with the default sentence splitter — great for non-English text or domain-specific tokenization needs. [#925]
  • Compressed JSON support: Added support for reading json[l].gz files directly, and fixed Ray datasets to properly handle compressed JSON — no more manual decompression before feeding data in. [#919]
  • Faster TokenNumFilter with batch tokenization: Instead of tokenizing one sample at a time, TokenNumFilter now processes the whole batch in one shot, significantly speeding up token-count-based filtering. [#929]
  • Cache redundant sum() calls in repetition filters: Repetition filters were calling sum() multiple times on the same data. These results are now cached, saving unnecessary computation on large batches. [#924]
  • New docs: cache, export, and tracing: Added dedicated documentation pages explaining how data-juicer handles caching, result exporting, and execution tracing — a much-needed addition for debugging complex pipelines. [#935]
  • Enhanced op_search with BM25/Regex & MCP Server upgrade: Added BM25 and regex search modes to op_search (no longer requiring dj-agents), and expanded the MCP server with four new tools covering op search, dataset analysis, config schema retrieval, and dataset loading strategy discovery. [#937]

Fixed Bugs

  • Wrong cache key in ImageFaceCountFilter: The filter was using an incorrect key when reading from cache, causing it to miss cached results and redo redundant work. Now fixed. [#921]
  • GeneralFusedOP silently dropping Mapper results: When running a fused pipeline, Mapper outputs were being discarded instead of passed downstream. This was a silent data loss bug — now properly fixed. [#928]
  • Shared _default_kwargs mutation polluting other OP instances: Operator instances were accidentally sharing a mutable default kwargs dict, meaning modifying one OP's config could inadvertently affect other instances. Each instance now gets its own copy. [#926]
  • NlpaugEnMapper only augmenting the first sample in a batch: Due to a bug in the batching logic, text augmentation was only being applied to the very first sample, leaving the rest untouched. All samples in a batch are now correctly augmented. [#927]

Acknowledgements

  • @JohnGiorgi contributed three impactful improvements: load_dataset_kwargs support, custom tokenizer in RemoveRepeatSentencesMapper, and batch tokenization optimization in TokenNumFilter. [#922] [#925] [#929]
  • @dubin555 squashed multiple operator bugs and added performance optimizations across filters and the fused pipeline. [#921] [#924] [#926] [#927] [#928]
  • @leeyyi and @liyuyi-2001 made their first contributions with two brand-new LaTeX OPs. [#923] [#932]
  • @HunterLine added compressed JSON dataset support. [#919]

Full Changelog: https://github.com/datajuicer/data-juicer/compare/v1.5.0...v1.5.1

Source: README.md, updated 2026-03-17