Download Latest Version Release v1.4.1_ MCP server_ GPU-based Minhash deduplicator_ Improved unit test coverage. source code.tar.gz (33.4 MB)
Email in envelope

Get an email when there's a new version of Data-Juicer

Home / v1.4.1
Name Modified Size InfoDownloads / Week
Parent folder
py_data_juicer-1.4.1-py3-none-any.whl 2025-07-16 1.8 MB
README.md 2025-07-16 1.9 kB
Release v1.4.1_ MCP server_ GPU-based Minhash deduplicator_ Improved unit test coverage. source code.tar.gz 2025-07-16 33.4 MB
Release v1.4.1_ MCP server_ GPU-based Minhash deduplicator_ Improved unit test coverage. source code.zip 2025-07-16 34.0 MB
Totals: 4 Items   69.1 MB 1

Major Updates

  • 🔧 Introduce Data-Juicer MCP server. Users can make use of the data processing capabilities in the MCP way conveniently. [#690] [#737]
  • 💪🏻 Unit test coverage rate is improved to 85%+ and several bugs in test cases are resolved (OOM, encoding error, and so on), which makes Data-Juicer more reliable. [#698] [#717] [#720] [#727]
  • 🤝 Minhash deduplication based on GPU is supported, collaborated with developers from Nvidia. [#694] [#644]
  • 🧩 RayExporter supports more formats to export a ray dataset in addition to json/jsonl. [#687]
  • 🎥 Two demo videos are added to introduce the Data-Juicer core functions, agentic usages, and sandbox. [#738]

New Operators

  • download_file_mapper downloads data from URLs to local files or specified fields. [#709]

Enhancements

  • New analysis method: correlation analysis among stats is added. [#663]
  • Several core dependencies are updated and fixed to a newer version, and dependency conflicts are resolved. [#715] [#717] [#723]
  • The EasyAnimate pipelines in the sandbox are updated to follow the refactoring of sandbox. [#710]
  • Apply more reliable pre-commit tools to improve the code style of Data-Juicer. [#714]
  • Support store and process bytes data of images in the dataset. [#725]

Bugs Fixed

  • The wheel & docker image building bug is fixed. [#706]
  • Fix bugs in log_summarization. [#710]
  • Fix "no module named data_juicer" error after installing from the wheel file. [#727]

Acknowledgement

  • @fanronghai helps to fix the param error in dataset_splitting_by_language tool. [#713]
  • @ayushdg helps to support a GPU-version Minhash deduplicator. [#644]
  • @ricksun2023 helps to fix the bugs when there are more than one same-name OPs in the configs. [#730]

Full Changelog: https://github.com/modelscope/data-juicer/compare/v1.4.0...v1.4.1

Source: README.md, updated 2025-07-16