CC-Net

cc_net provides tools to download, segment, clean, and filter Common Crawl to build large-scale text corpora, including monolingual datasets and the multilingual CC-100 collection introduced in the associated paper. It includes pipelines to fetch snapshots, extract text, de-duplicate, identify language, and apply quality filtering based on heuristics and language models. The outputs are intended for pretraining language models and for creating standardized corpora that can be reproduced or updated with new crawls. The repository documents practical concerns like HTTP failures, snapshot differences, and stats JSONs, reflecting community use across many languages. While powerful, the repo has been archived and is read-only, so users should expect to run it as-is or fork for maintenance. Even in archived state, issues and releases pages remain useful references for implementation details and dataset lineage.

Features

End-to-end Common Crawl download and extraction
Language identification and monolingual segmentation
Quality filtering and de-duplication pipelines
Support for building multilingual datasets like CC-100
Reproducible statistics and corpus metadata outputs
Scripts and configs for snapshot-by-snapshot processing

Project Samples

Project Activity

See All Activity >

License

MIT License

Follow CC-Net

CC-Net Web Site

Other Useful Business Software

Keep company data safe with Chrome Enterprise

Protect your business with AI policies and data loss prevention in the browser

Make AI work your way with Chrome Enterprise. Block unapproved sites and set custom data controls that align with your company's policies.

Download Chrome

Rate This Project

User Reviews

Be the first to post a review of CC-Net!

Additional Project Details

Operating Systems

Linux, Mac

Programming Language

Python

Related Categories

Python Natural Language Processing (NLP) Tool

Registered

4 days ago

Similar Business Software

LM-Kit.NET

LM-Kit.NET is a cutting-edge, high-level inference SDK designed specifically to bring the advanced capabilities of Large Language Models (LLM) into the C# ecosystem. Tailored for developers working within .NET, LM-Kit.NET provides a comprehensive suite of powerful Generative AI tools, making...

See Software
Gensim

Gensim is a free, open source Python library designed for unsupervised topic modeling and natural language processing, focusing on large-scale semantic modeling. It enables the training of models like Word2Vec, FastText, Latent Semantic Analysis (LSA), and Latent Dirichlet Allocation (LDA),...

See Software
Google Cloud Natural Language API

Get insightful text analysis with machine learning that extracts, analyzes, and stores text. Train high-quality machine learning custom models without a single line of code with AutoML. Apply natural language understanding (NLU) to apps with Natural Language API. Use entity analysis to find and...

See Software
Azure AI Language

Azure AI Language is a managed service for developing natural language processing applications. Identify key terms and phrases, analyze sentiment, summarize text, and build conversational interfaces. Use Language to annotate, train, evaluate, and deploy customizable AI models with minimal...

See Software
Azure OpenAI Service

Apply advanced coding and language models to a variety of use cases. Leverage large-scale, generative AI models with deep understandings of language and code to enable new reasoning and comprehension capabilities for building cutting-edge applications. Apply these coding and language models...

See Software
Cohere

Cohere is an enterprise AI platform that enables developers and businesses to build powerful language-based applications. Specializing in large language models (LLMs), Cohere provides solutions for text generation, summarization, and semantic search. Their model offerings include the Command...

See Software

Report inappropriate content

CC-Net

Tools to download and cleanup Common Crawl data

Get an email when there's a new version of CC-Net

Features

Project Samples

Project Activity

Categories

License

Follow CC-Net

User Reviews

Additional Project Details

Operating Systems

Programming Language

Related Categories

Registered