Data-Juicer - Browse /v1.4.4 at SourceForge.net

The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Parent folder
py_data_juicer-1.4.4-py3-none-any.whl	2025-12-01	1.9 MB	0
README.md	2025-12-01	3.6 kB	0
Release v1.4.4_ NeurIPS 2025 Spotlight_ New Video _ Multimodal Ops_ Repo Reorganization_ S3 I_O Support source code.tar.gz	2025-12-01	50.7 MB	0
Release v1.4.4_ NeurIPS 2025 Spotlight_ New Video _ Multimodal Ops_ Repo Reorganization_ S3 I_O Support source code.zip	2025-12-01	51.6 MB	0
Totals: 4 Items		104.2 MB	0

Major Updates

🎉 Update NeurIPS 2025 News: our Data-Juicer 2.0 paper is accepted as a NeurIPS'25 Spotlight (top 3.1% of all submissions)! And our two other works are also accepted by NeurIPS'25. [#788]
🧩 The sandbox component, data-juicer recipes, and data-juicer agents have been officially split from the main repository as data-juicer-sandbox/hub/agents respectively, to enable independent development and faster iteration. [#817] [#827] [#830]
🤝 S3 I/O support: Added S3 support in data loader and exporter for seamless cloud storage integration. [#806]

detect_main_character_mapper: Extract all main character names based on the given image and its caption. [#795]
detect_character_locations_mapper: Given an image and a list of main character names, extract the bounding boxes for each present character. (YOLOE + MLLM) [#795]
detect_character_attributes_mapper: Takes an image, a caption, and main character names as input to extract the characters' attributes. [#795]
vggt_mapper: Input a video of a single scene, and use VGGT to extract information including Camera Pose, Depth Maps, Point Maps, and 3D Point Tracks. [#804]
video_whole_body_pose_estimation_mapper: Input a video containing people, and use the DWPose model to extract the body, hand, feet, and face keypoints of the human subjects in the video, i.e., 2D Whole-body Pose Estimation. [#812]
video_hand_reconstruction_mapper: Use the WiLoR model for hand localization and reconstruction. [#818]

Enhanced documentation for operator details, significantly expanding coverage of effect demonstrations and usage examples, and improved homepage styling for better readability. [#778] [#819]
Added notebook detection and auto-redirect in logger setup for better user experience in Jupyter environments. [#790]
Optimized the build_op_doc hook for more reliable documentation generation. [#794]
Improved auto num_proc calculation in Ray mode for better resource utilization across operators. [#789] [#825]
Enabled support for videos and audios in WebDataset I/O, expanding multimodal data handling capabilities. [#803]
Updated repository URLs and links across the project for consistency and correctness. [#805]
Added support for FFmpeg and Decord backends in video data processing, improving flexibility and performance. [#826] [#829]
Added an MCP server CLI entry point to facilitate modular service deployment and upodate MCP documentation. [#798]

Fixed the Auto Prompt pipeline in sandbox to restore correct prompt generation behavior. [#791]
Fixed a Ray connection error by properly passing the config parameter through resource utility functions. [#808]
Fixed several CUDA-based operators to use internal resource monitor. [#809]
Fixed custom op module loading issues and optimized video_extract_frames_mapper for saving extracted frames. [#803]
Reset num_proc for vLLM and set default batch_size to 10 for CUDA operators to improve stability. [#814]
Fixed Sphinx autodoc compatibility issue in the SpecialTokens metaclass to restore documentation build. [#816]
Resolved a bug in trace_filter by excluding the __dj_stats__ column during dataset comparison. [#828]
Fix several typos in video_split_by_scene_mapper. [#744]

Full Changelog: https://github.com/datajuicer/data-juicer/compare/v1.4.3...v1.4.4

Source: README.md, updated 2025-12-01