| Name | Modified | Size | Downloads / Week |
|---|---|---|---|
| Parent folder | |||
| py_data_juicer-1.5.1-py3-none-any.whl | 2026-03-17 | 2.1 MB | |
| README.md | 2026-03-17 | 4.6 kB | |
| Release v1.5.1_ LaTeX OPs_ Compressed Format Support_ Operator Robustness Fixes source code.tar.gz | 2026-03-17 | 51.6 MB | |
| Release v1.5.1_ LaTeX OPs_ Compressed Format Support_ Operator Robustness Fixes source code.zip | 2026-03-17 | 52.5 MB | |
| Totals: 4 Items | 106.2 MB | 0 | |
Major Updates
- 📊 Stats: 13 PRs merged, from 7 contributors
- đź“„ Two new LaTeX-focused mapper OPs shipped, extending data-juicer's document processing capabilities to handle
.texarchives and figure contexts. - 🗜️ Compressed dataset format support:
json[l].gzfiles can now be loaded directly, and Ray datasets gain proper support for reading compressed JSON files. - 📚 New documentation added covering cache, export, and tracing workflows to help users better understand and debug data processing pipelines.
- 🤖 Major refactor and upgrade of data-juicer-agents completed: The project architecture and CLI/session capabilities were comprehensively redesigned for better maintainability and extensibility. See date-juicer-agents for more details.
New OPs
latex_merge_tex_mapper: Got a bunch of.texfiles packed in an archive? This OP automatically extracts and merges them into a single unified LaTeX document, making it much easier to process multi-file LaTeX projects. [#932]latex_figure_context_extractor_mapper: Extracts figure-related context (e.g., captions, surrounding paragraphs) from LaTeX source files, so you can build richer multimodal datasets from academic papers. [#923]
Enhancements
- Load dataset with extra kwargs: You can now pass arbitrary extra arguments to
datasets.load_dataset()via the newload_dataset_kwargsconfig field — handy for datasets that need non-standard loading options. [#922] - Custom tokenizer in
RemoveRepeatSentencesMapper: The mapper now accepts a custom tokenizer, so you're no longer stuck with the default sentence splitter — great for non-English text or domain-specific tokenization needs. [#925] - Compressed JSON support: Added support for reading
json[l].gzfiles directly, and fixed Ray datasets to properly handle compressed JSON — no more manual decompression before feeding data in. [#919] - Faster
TokenNumFilterwith batch tokenization: Instead of tokenizing one sample at a time,TokenNumFilternow processes the whole batch in one shot, significantly speeding up token-count-based filtering. [#929] - Cache redundant
sum()calls in repetition filters: Repetition filters were callingsum()multiple times on the same data. These results are now cached, saving unnecessary computation on large batches. [#924] - New docs: cache, export, and tracing: Added dedicated documentation pages explaining how data-juicer handles caching, result exporting, and execution tracing — a much-needed addition for debugging complex pipelines. [#935]
- Enhanced
op_searchwith BM25/Regex & MCP Server upgrade: Added BM25 and regex search modes to op_search (no longer requiring dj-agents), and expanded the MCP server with four new tools covering op search, dataset analysis, config schema retrieval, and dataset loading strategy discovery. [#937]
Fixed Bugs
- Wrong cache key in
ImageFaceCountFilter: The filter was using an incorrect key when reading from cache, causing it to miss cached results and redo redundant work. Now fixed. [#921] GeneralFusedOPsilently dropping Mapper results: When running a fused pipeline, Mapper outputs were being discarded instead of passed downstream. This was a silent data loss bug — now properly fixed. [#928]- Shared
_default_kwargsmutation polluting other OP instances: Operator instances were accidentally sharing a mutable default kwargs dict, meaning modifying one OP's config could inadvertently affect other instances. Each instance now gets its own copy. [#926] NlpaugEnMapperonly augmenting the first sample in a batch: Due to a bug in the batching logic, text augmentation was only being applied to the very first sample, leaving the rest untouched. All samples in a batch are now correctly augmented. [#927]
Acknowledgements
- @JohnGiorgi contributed three impactful improvements:
load_dataset_kwargssupport, custom tokenizer inRemoveRepeatSentencesMapper, and batch tokenization optimization inTokenNumFilter. [#922] [#925] [#929] - @dubin555 squashed multiple operator bugs and added performance optimizations across filters and the fused pipeline. [#921] [#924] [#926] [#927] [#928]
- @leeyyi and @liyuyi-2001 made their first contributions with two brand-new LaTeX OPs. [#923] [#932]
- @HunterLine added compressed JSON dataset support. [#919]
Full Changelog: https://github.com/datajuicer/data-juicer/compare/v1.5.0...v1.5.1