NeMo Curator - Browse /v0.9.0 at SourceForge.net

The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Parent folder
NVIDIA NeMo Curator 0.9.0 source code.tar.gz	2025-07-28	5.8 MB	0
NVIDIA NeMo Curator 0.9.0 source code.zip	2025-07-28	6.2 MB	1
README.md	2025-07-28	1.4 kB	0
Totals: 3 Items		12.0 MB	1

New How-to Data Recipes (Tutorials)
Multimodal DAPT Curation w/ PDF Extraction
Llama Nemotron Data Curation
LLM NIM - PII Redaction
Performance and Code Optimizations
Simplified Clustering Logic: Significantly improved semantic deduplication clustering performance
Removed convoluted backend switching logic that caused performance issues
Eliminated expensive length assertions that could cause timeouts on large datasets
Improved GPU utilization during KMeans clustering operations
Tested on 37M embedding dataset (80GB) across 7 GPUs with substantial performance gains

FastText Download URL Fix
Corrected the fasttext model download URL in nemotron-cc tutorial
Changed from dl.fbaipublicfiles.com/fastText/ to dl.fbaipublicfiles.com/fasttext/
Ensures reliable model downloads for language identification
NeMo Retriever Tutorial Bug Fix
Fixed lambda function bug in RetrieverEvalSetGenerator
Corrected score assignment from df["question"].apply(lambda: 1) to df["score"] = 1
API Usage Updates
Updated examples and tutorials to use correct DocumentDataset API
Replaced deprecated write_to_disk(result, output_dir, output_type="parquet") with result.to_parquet(output_dir)
Updated exact deduplication workflows: deduplicator.remove() now returns DocumentDataset directly

Source: README.md, updated 2025-07-28

NeMo Curator Files