Download Latest Version Release v1.5.1_ LaTeX OPs_ Compressed Format Support_ Operator Robustness Fixes source code.tar.gz (51.6 MB)
Email in envelope

Get an email when there's a new version of Data-Juicer

Home / v1.4.4
Name Modified Size InfoDownloads / Week
Parent folder
py_data_juicer-1.4.4-py3-none-any.whl 2025-12-01 1.9 MB
README.md 2025-12-01 3.6 kB
Release v1.4.4_ NeurIPS 2025 Spotlight_ New Video _ Multimodal Ops_ Repo Reorganization_ S3 I_O Support source code.tar.gz 2025-12-01 50.7 MB
Release v1.4.4_ NeurIPS 2025 Spotlight_ New Video _ Multimodal Ops_ Repo Reorganization_ S3 I_O Support source code.zip 2025-12-01 51.6 MB
Totals: 4 Items   104.2 MB 0

Major Updates

  • 🎉 Update NeurIPS 2025 News: our Data-Juicer 2.0 paper is accepted as a NeurIPS'25 Spotlight (top 3.1% of all submissions)! And our two other works are also accepted by NeurIPS'25. [#788]
  • đź§© The sandbox component, data-juicer recipes, and data-juicer agents have been officially split from the main repository as data-juicer-sandbox/hub/agents respectively, to enable independent development and faster iteration. [#817] [#827] [#830]
  • 🤝 S3 I/O support: Added S3 support in data loader and exporter for seamless cloud storage integration. [#806]

New OPs

  • detect_main_character_mapper: Extract all main character names based on the given image and its caption. [#795]
  • detect_character_locations_mapper: Given an image and a list of main character names, extract the bounding boxes for each present character. (YOLOE + MLLM) [#795]
  • detect_character_attributes_mapper: Takes an image, a caption, and main character names as input to extract the characters' attributes. [#795]
  • vggt_mapper: Input a video of a single scene, and use VGGT to extract information including Camera Pose, Depth Maps, Point Maps, and 3D Point Tracks. [#804]
  • video_whole_body_pose_estimation_mapper: Input a video containing people, and use the DWPose model to extract the body, hand, feet, and face keypoints of the human subjects in the video, i.e., 2D Whole-body Pose Estimation. [#812]
  • video_hand_reconstruction_mapper: Use the WiLoR model for hand localization and reconstruction. [#818]

Enhancements

  • Enhanced documentation for operator details, significantly expanding coverage of effect demonstrations and usage examples, and improved homepage styling for better readability. [#778] [#819]
  • Added notebook detection and auto-redirect in logger setup for better user experience in Jupyter environments. [#790]
  • Optimized the build_op_doc hook for more reliable documentation generation. [#794]
  • Improved auto num_proc calculation in Ray mode for better resource utilization across operators. [#789] [#825]
  • Enabled support for videos and audios in WebDataset I/O, expanding multimodal data handling capabilities. [#803]
  • Updated repository URLs and links across the project for consistency and correctness. [#805]
  • Added support for FFmpeg and Decord backends in video data processing, improving flexibility and performance. [#826] [#829]
  • Added an MCP server CLI entry point to facilitate modular service deployment and upodate MCP documentation. [#798]

Fixed Bugs

  • Fixed the Auto Prompt pipeline in sandbox to restore correct prompt generation behavior. [#791]
  • Fixed a Ray connection error by properly passing the config parameter through resource utility functions. [#808]
  • Fixed several CUDA-based operators to use internal resource monitor. [#809]
  • Fixed custom op module loading issues and optimized video_extract_frames_mapper for saving extracted frames. [#803]
  • Reset num_proc for vLLM and set default batch_size to 10 for CUDA operators to improve stability. [#814]
  • Fixed Sphinx autodoc compatibility issue in the SpecialTokens metaclass to restore documentation build. [#816]
  • Resolved a bug in trace_filter by excluding the __dj_stats__ column during dataset comparison. [#828]
  • Fix several typos in video_split_by_scene_mapper. [#744]

Acknowledgement

  • @kyo-tom helps to fix the ray connection error in [#808]
  • @liuyuhanalex helps to fix several small typos in [#744]

Full Changelog: https://github.com/datajuicer/data-juicer/compare/v1.4.3...v1.4.4

Source: README.md, updated 2025-12-01