Download Latest Version Release v1.5.1_ LaTeX OPs_ Compressed Format Support_ Operator Robustness Fixes source code.tar.gz (51.6 MB)
Email in envelope

Get an email when there's a new version of Data-Juicer

Home / v1.5.0
Name Modified Size InfoDownloads / Week
Parent folder
py_data_juicer-1.5.0-py3-none-any.whl 2026-02-27 2.1 MB
README.md 2026-02-26 3.2 kB
Release v1.5.0_ Partitioned Ray Executor_ Embodied-AI OPs_ OP-level Env Management source code.tar.gz 2026-02-26 51.5 MB
Release v1.5.0_ Partitioned Ray Executor_ Embodied-AI OPs_ OP-level Env Management source code.zip 2026-02-26 52.4 MB
Totals: 4 Items   106.1 MB 0

Major Updates

  • 📊 Stats: 244 files changed with 22,394 additions and 2,053 deletions, from 12 contributors
  • 🗂️ New partitioned ray executor: [#748]
  • Support data partitioning, checkpointing, event logging in ray mode.
  • Improved fault tolerence, extensibility, observability, flexibility, and processing performance.
  • 🤖 New OPs for embodied AI: improved processing capability to handle camera-view videos.
  • 🧩 Support OP-level isolated environment maintaining in ray mode to help resolve the dependency confliction issue from different OPs. [#892]
  • Allow to merge possible environments from different OPs that share common dependencies in different strategies and reuse the created environments.
  • Based on ray runtime environment.

New OPs

  • video_camera_calibration_static_deepcalib_mapper: Compute the camera intrinsics and field of view (FOV) for a static camera using DeepCalib. [#871]
  • video_camera_calibration_static_moge_mapper: Compute the camera intrinsics and field of view (FOV) for a static camera using Moge-2. [#871]
  • video_undistort_mapper: Undistort raw videos with corresponding camera intrinsics and distortion coefficients. [#871]
  • video_hand_reconstruction_hawor_mapper: Use HaWoR and MoGe-2 for hand reconstruction. [#893]
  • video_camera_pose_mapper: Extract camera poses with MegaSaM and MoGe-2. [#894]

Enhancements

  • Allow batch inference for image_captioning_mapper to improve processing performance. [#901]
  • Optimize the logics of a branch by avoiding unnecessary function calls. [#903] '
  • Refactor Operator Search and Metadata Extraction for Enhanced Accuracy. [#889]
  • Allow to return meta infos only for extract_keyframes func and remove the sample info in error logs to reduce the size of logs. [#904]
  • Reduce the memory usage in convert_to_absolute_paths func by iterating only over the specified columns. [#907]
  • Reorganize the main readme and update the tutorials in the playground to the latest version. [#908]
  • Optimize issue templates: emphasize English usage and add Q&A Copilot check. [#912]
  • Convert abs path for dataset in object store. [#913]

Fixed Bugs

  • Fix the bug to make minhash deduplicator be able to trace all duplicate items. [#906]
  • Fix the "multiple values for num_proc" bug in TextFormmater. [#905]
  • Fix the homepage rendering issue and remove outdated OP docs. [#910]
  • Fix several bugs in test stability and robustness. [#918]

Acknowledgement

  • @dubin555 helps to improve the processing performance of some OPs and funcs. [#901] [#903]
  • @HunterLine helps to fix a bug in minhash deduplicator to trace all duplicate items. [#906]

New Contributors

All Contributors

@HYLcool @dubin555 @claude @Qirui-jiao @cmgzn @Cathy0908 @Dludora @yxdyc @gemini-code-assist @HunterLine @ext.wanghao204 @cyruszhang

Full Changelog: https://github.com/datajuicer/data-juicer/compare/v1.4.6...v1.5.0

Source: README.md, updated 2026-02-26