Download Latest Version Release v1.5.1_ LaTeX OPs_ Compressed Format Support_ Operator Robustness Fixes source code.tar.gz (51.6 MB)
Email in envelope

Get an email when there's a new version of Data-Juicer

Home / v1.4.6
Name Modified Size InfoDownloads / Week
Parent folder
py_data_juicer-1.4.6-py3-none-any.whl 2026-02-02 2.0 MB
README.md 2026-02-02 1.8 kB
Release v1.4.6_ introduce Q_A Copilot_ Video bytes I_O_ Tracer for Ray mode source code.tar.gz 2026-02-02 46.7 MB
Release v1.4.6_ introduce Q_A Copilot_ Video bytes I_O_ Tracer for Ray mode source code.zip 2026-02-02 47.5 MB
Totals: 4 Items   96.2 MB 0

Major Updates

  • 🤖 Our Q&A copilot is introduced to resolve questions from users. Now the robot is available in the docs, DingTalk group, Discord, etc. [#891]
  • 🎬 I/O for video bytes: support bytes reading/storing for videos. [#882]
  • 🫆 Tracer for ray mode: now the tracer supports to trace changed samples in ray mode. [#885]

Enhancements

  • Prepare a new dockerfile for use case of embodied AI, and update the cuda/system/... versions of the basic docker image. [#887]
  • Add Copilot News & Refined DingTalk link/QR code & Discord link/QR code in the docs. [#891]
  • Convert the word retrieval from lists to sets to speed up two OPs. [#890]
  • Add a new workflow to automatically fetch the traffic report from github insigts. [#899] [#900]

Fixed Bugs

  • Fix TypeError when using field_types in YAML config for RequiredFieldsValidator. [#886]
  • Replace the deprecated concurrency parameter with compute parameter in the ray.data.Dataset.map_batches() call. [#888]
  • Prevent divide-by-zero in calculate_ray_np when Ray cluster not ready. [#864]
  • Add thread limiting for multi-process workloads to prevent over-subscription. [#877]
  • Fix the bug where the unittest of standalone mode could be stuck. [#896]
  • Update several out-of-date links in the docs. [#898]

Acknowledgement

  • @dubin555 helps to fix several bugs and enhance the processing performance for some OPs. [#886] [#890]
  • @xyuzh helps to update the ray usage to the latest version in some OPs, fix some bugs and optimize the parallel strategies. [#888] [#864]
  • @XinyuLiu1999 helps to fix a bug of over-subscription on multi-process workloads. [#877]

Full Changelog: https://github.com/datajuicer/data-juicer/compare/v1.4.5...v1.4.6

Source: README.md, updated 2026-02-02