Name | Modified | Size | Downloads / Week |
---|---|---|---|
Parent folder | |||
NVIDIA NeMo Curator 0.9.0 source code.tar.gz | 2025-07-28 | 5.8 MB | |
NVIDIA NeMo Curator 0.9.0 source code.zip | 2025-07-28 | 6.2 MB | |
README.md | 2025-07-28 | 1.4 kB | |
Totals: 3 Items | 12.0 MB | 1 |
Major Features and Enhancements
- New How-to Data Recipes (Tutorials)
- Multimodal DAPT Curation w/ PDF Extraction
- Llama Nemotron Data Curation
- LLM NIM - PII Redaction
- Performance and Code Optimizations
- Simplified Clustering Logic: Significantly improved semantic deduplication clustering performance
- Removed convoluted backend switching logic that caused performance issues
- Eliminated expensive length assertions that could cause timeouts on large datasets
- Improved GPU utilization during KMeans clustering operations
- Tested on 37M embedding dataset (80GB) across 7 GPUs with substantial performance gains
Bug Fixes
- FastText Download URL Fix
- Corrected the
fasttext
model download URL in nemotron-cc tutorial - Changed from
dl.fbaipublicfiles.com/fastText/
todl.fbaipublicfiles.com/fasttext/
- Ensures reliable model downloads for language identification
- NeMo Retriever Tutorial Bug Fix
- Fixed lambda function bug in
RetrieverEvalSetGenerator
- Corrected score assignment from
df["question"].apply(lambda: 1)
todf["score"] = 1
- API Usage Updates
- Updated examples and tutorials to use correct
DocumentDataset
API - Replaced deprecated
write_to_disk(result, output_dir, output_type="parquet")
withresult.to_parquet(output_dir)
- Updated exact deduplication workflows:
deduplicator.remove()
now returnsDocumentDataset
directly