Name | Modified | Size | Downloads / Week |
---|---|---|---|
Parent folder | |||
NVIDIA NeMo Curator 1.0.0 source code.tar.gz | 2025-10-01 | 3.4 MB | |
NVIDIA NeMo Curator 1.0.0 source code.zip | 2025-10-01 | 3.9 MB | |
README.md | 2025-10-01 | 10.8 kB | |
Totals: 3 Items | 7.3 MB | 4 |
This major release represents a fundamental architecture shift from Dask to Ray, expanding NeMo Curator to support multimodal data curation with new video and audio capabilities. This refactor enables unified backend processing, better heterogeneous computing support, and enhanced autoscaling for dynamic workloads.
Installation Updates
- New Docker container: Updated Docker infrastructure with CUDA 12.8.1 and Ubuntu 24.04 base; obtainable through the NGC Catalog (
nvcr.io/nvidia/nemo-curator:25.09
) - Docker file to build own image: Simplified Dockerfile structure for custom container builds with FFmpeg support
- UV source installations: Integrated UV package manager (v0.8.22) for faster dependency management
- PyPI improvements: Enhanced PyPI installation with modular extras for targeted functionality:
Extra | Installation Command | Description |
---|---|---|
All Modalities | nemo-curator[all] |
Complete installation with all modalities and GPU support |
Text Curation | nemo-curator[text_cuda12] |
GPU-accelerated text processing with RAPIDS |
Image Curation | nemo-curator[image_cuda12] |
Image processing with NVIDIA DALI |
Audio Curation | nemo-curator[audio_cuda12] |
Speech recognition with NeMo ASR models |
Video Curation | nemo-curator[video_cuda12] |
Video processing with GPU acceleration |
Basic GPU | nemo-curator[cuda12] |
CUDA utilities without modality-specific dependencies |
All GPU installations require the NVIDIA PyPI index:
bash
uv pip install --extra-index-url https://pypi.nvidia.com nemo-curator[EXTRA]
New Modalities
Video
NeMo Curator now supports comprehensive video data curation with distributed processing capabilities:
- Video splitting: Fixed-stride and scene-change detection (TransNetV2) for clip extraction
- Semantic deduplication: K-means clustering and pairwise similarity for near-duplicate clip removal
- Content filtering: Motion-based filtering and aesthetic filtering for quality improvement
- Embedding generation: InternVideo2 and Cosmos-Embed1 models for clip-level embeddings
- Ray-based distributed architecture: Scalable video processing with autoscaling support
Audio
New audio curation capabilities for speech data processing:
- ASR inference: Automatic speech recognition using NeMo Framework pretrained models
- Quality assessment: Word Error Rate (WER) and Character Error Rate (CER) calculation
- Speech metrics: Duration analysis and speech rate metrics (words/characters per second)
- Text integration: Seamless integration with text curation workflows via
AudioToDocumentStage
- Manifest support: JSONL manifest format for audio file management
Modality Refactors
Text
- Ray backend migration: Complete transition from Dask to Ray for distributed text processing
- Improved model-based classifier throughput: Better overlapping of compute between tokenization and inference through length-based sequence sorting for optimal GPU memory utilization
- Task-centric architecture: New
Task
-based processing model for finer-grained control - Pipeline redesign: Updated
ProcessingStage
andPipeline
architecture with resource specification
Image
- Pipeline-based architecture: Transitioned from legacy
ImageTextPairDataset
to modern stage-based processing withImageReaderStage
,ImageEmbeddingStage
, and filter stages - DALI-based image loading: New
ImageReaderStage
uses NVIDIA DALI for high-performance WebDataset tar shard processing with GPU/CPU fallback - Modular processing stages: Separate stages for embedding generation, aesthetic filtering, and NSFW filtering
- Task-based data flow: Images processed as
ImageBatch
tasks containingImageObject
instances with metadata, embeddings, and classification scores
Learn more about image curation.
Deduplication Improvements
Enhanced deduplication capabilities across all modalities with improved performance and flexibility:
- Exact and Fuzzy deduplication: Updated rapidsmpf-based shuffle backend for more efficient GPU-to-GPU data transfer and better spilling capabilities
- Semantic deduplication: Support for deduplicating text, image, and video datasets using unified embedding-based workflows
- New ranking strategies: Added
RankingStrategy
which allows you to rank elements within cluster centers to decide which point to prioritize during duplicate removal, supporting metadata-based ranking to prioritize specific datasets or inputs
Core Refactors
The architecture refactor introduces a layered system with unified interfaces and multiple execution backends:
User Layer: Pipeline → ProcessingStage X→Y → ProcessingStage Y→Z → ProcessingStage Z→W
↓
Orchestration Layer: BaseExecutor Interface
↓
Backend Layer: XennaExecutor (Production Ready) | RayActorPoolExecutor (Experimental) | RayDataExecutor (Experimental)
↓
Adaptation Layer: Xenna Adapter | Ray Actor Adapter | Ray Data Adapter
↓
Execution Layer: Cosmos-Xenna (Streaming/Batch) | Ray Actor Pool (Load Balancing) | Ray Data API (Dataset Processing)
Pipelines
- New Pipeline API: Ray-based pipeline execution with
BaseExecutor
interface - Multiple backends: Support for Xenna, Ray Actor Pool, and Ray Data execution backends
- Resource specification: Configurable CPU and GPU memory requirements per stage
- Stage composition: Improved stage validation and execution orchestration
Stages
- ProcessingStage redesign: Generic
ProcessingStage[X, Y]
base class with type safety - Resource requirements: Built-in resource specification for CPU and GPU memory
- Backend adapters: Stage adaptation layer for different Ray orchestration systems
- Input/output validation: Enhanced type checking and data validation
Tutorials
- Text tutorials: Updated all text curation tutorials to use new Ray-based API
- Image tutorials: Migrated image processing tutorials to unified backend
- Audio tutorials: New audio curation tutorials
- Video tutorials: New video processing tutorials
For all tutorial content, refer to the tutorials directory in the NeMo Curator GitHub repository.
Known Limitations
(Pending Refactor in Future Release)
Generation
- Synthetic data generation: Synthetic text generation features are being refactored for Ray compatibility
- Hard negative mining: Retrieval-based data generation workflows under development
PII
- PII processing: Personal Identifiable Information removal tools are being updated for Ray backend
- Privacy workflows: Enhanced privacy-preserving data curation capabilities in development
Blending & Shuffling
- Data blending: Multi-source dataset blending functionality being refactored
- Dataset shuffling: Large-scale data shuffling operations under development
Docs Refactor
- Local preview capability: Improved documentation build system with local preview support
- Modality-specific guides: Comprehensive documentation for each supported modality (text, image, audio, video)
- API reference: Complete API documentation with type annotations and examples
What's Next
The next release will focus on completing the refactor of Generation, PII, and Blending & Shuffling features, along with additional performance optimizations and new modality support.