Showing 41 open source projects for "cleaning"

View related business solutions
  • Try Google Cloud Risk-Free With $300 in Credit Icon
    Try Google Cloud Risk-Free With $300 in Credit

    No hidden charges. No surprise bills. Cancel anytime.

    Use your credit across every product. Compute, storage, AI, analytics. When it runs out, 20+ products stay free. You only pay when you choose to.
    Start Free
  • Gemini 3 and 200+ AI Models on One Platform Icon
    Gemini 3 and 200+ AI Models on One Platform

    Access Google's best plus Claude, Llama, and Gemma. Fine-tune and deploy from one console.

    Build generative AI apps with Vertex AI. Switch between models without switching platforms.
    Start Free
  • 1
    AI Data Science Team

    AI Data Science Team

    An AI-powered data science team of agents

    AI Data Science Team is a Python library and agent ecosystem designed to accelerate and automate common data science workflows by modeling them as specialized AI “agents” that can be orchestrated to perform tasks like data cleaning, transformation, analysis, visualization, and machine learning. It provides a modular agent framework where each agent focuses on a step in the typical data science pipeline — for example, loading data from CSV/Excel files, cleaning and wrangling messy datasets, engineering predictive features, building models with AutoML, connecting to SQL databases, and producing visual outputs — all driven by natural language or programmatic instructions. ...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 2
    nb-clean

    nb-clean

    Clean Jupyter notebooks of outputs, metadata, and empty cells

    ...Note that the Git filter and pre-commit hook work differently, with different effects on your working directory. The pre-commit hook operates on the notebook on disk, cleaning the copy in your working directory. The Git filter cleans notebooks as they are added to the index, leaving the copy in your working directory dirty.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 3
    labelme Image Polygonal Annotation

    labelme Image Polygonal Annotation

    Image polygonal annotation with Python

    ...It is written in Python and uses Qt for its graphical interface. Image annotation for polygon, rectangle, circle, line and point. Image flag annotation for classification and cleaning. Video annotation. (video annotation). GUI customization (predefined labels / flags, auto-saving, label validation, etc). Exporting VOC-format dataset for semantic/instance segmentation. (semantic segmentation, instance segmentation). Exporting COCO-format dataset for instance segmentation. (instance segmentation). The first time you run labelme, it will create a config file in ~/.labelmerc. ...
    Downloads: 13 This Week
    Last Update:
    See Project
  • 4
    ExtractThinker

    ExtractThinker

    ExtractThinker is a Document Intelligence library for LLMs

    ExtractThinker is a tool designed to facilitate the extraction and analysis of information from various data sources, aiding in data processing and knowledge discovery.
    Downloads: 0 This Week
    Last Update:
    See Project
  • MongoDB Atlas runs apps anywhere Icon
    MongoDB Atlas runs apps anywhere

    Deploy in 115+ regions with the modern database for every enterprise.

    MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
    Start Free
  • 5
    DOLMA

    DOLMA

    Data and tools for generating and inspecting OLMo pre-training data

    DOLMA (Data Optimization and Learning for Model Alignment) is a framework designed to manage large-scale datasets for training and fine-tuning language models efficiently.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 6
    The Data Engineering Handbook

    The Data Engineering Handbook

    Links to everything you'd ever want to learn about data engineering

    ...Rather than being a code project itself, it’s a learning handbook that links to books, articles, tutorials, community groups, boot camps, and real-world project examples that collectively form a roadmap to mastering data engineering skills. It includes beginner and intermediate boot camps, interview guides, data cleaning and transformation resources, and curated lists of newsletters and industry communities, making it useful both for self-study and technical interview preparation. The repository is actively maintained and widely starred, reflecting its role as a go-to reference for newcomers and experienced practitioners alike.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 7
    Open Interpreter

    Open Interpreter

    A natural language interface for computers

    Open Interpreter is an open-source tool that provides a natural-language interface for interacting with your computer. It lets large language models (LLMs) run code locally (Python, JavaScript, shell, etc.), enabling you to ask your computer to do tasks like data analysis, file manipulation, browsing, etc. in human terms (“chat with your computer”), with safeguards. Runs locally or via configured remote LLM servers/inference backends, giving flexibility to use models you trust or have...
    Downloads: 20 This Week
    Last Update:
    See Project
  • 8
    All-in-RAG

    All-in-RAG

    Big Model Application Development Practice 1

    All-in-RAG is an open-source educational project designed to teach developers how to build applications using retrieval-augmented generation techniques. The repository provides a structured learning path that covers both theoretical foundations and practical implementation steps for RAG systems. It explains the full development pipeline required to create knowledge-aware AI assistants, including data preparation, document indexing, vector embedding generation, and retrieval strategies. The...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 9
    Agentic Data Scientist

    Agentic Data Scientist

    An end-to-end Data Scientist

    Agentic Data Scientist is an experimental AI-driven research framework that orchestrates data science workflows through autonomous agents that can reason, plan, and execute complex analytics tasks. Unlike traditional scripted pipelines, this project lets AI agents break down high-level research goals into sub-tasks such as data acquisition, cleaning, modeling, evaluation, and reporting, with minimal human direction. Each agent is designed to independently call functions, interact with data sources, and adapt to uncertainties during processing, enabling iterative refinement of models without manual coordination. The framework supports interoperability with existing data tools and libraries, letting the agents leverage libraries like pandas, scikit-learn, and visualization frameworks to perform real computations rather than mock demonstrations.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Powerful App Monitoring Without Surprise Bills Icon
    Powerful App Monitoring Without Surprise Bills

    AppSignal starts at $23/month with all features included. No overages, no hidden fees. 30-day free trial.

    Tired of monitoring tools that punish you for scaling? AppSignal offers transparent, predictable pricing with every feature unlocked on every plan. Track errors, monitor performance, detect anomalies, and manage logs across Ruby, Python, Node.js, and more. Trusted by developers since 2012 with free dev-to-dev support. No credit card required to start your 30-day trial.
    Try AppSignal Free
  • 10
    NeMo Curator

    NeMo Curator

    Scalable data pre processing and curation toolkit for LLMs

    NeMo Curator is a Python library specifically designed for fast and scalable dataset preparation and curation for large language model (LLM) use-cases such as foundation model pretraining, domain-adaptive pretraining (DAPT), supervised fine-tuning (SFT) and paramter-efficient fine-tuning (PEFT). It greatly accelerates data curation by leveraging GPUs with Dask and RAPIDS, resulting in significant time savings. The library provides a customizable and modular interface, simplifying pipeline...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 11
    PandasAI

    PandasAI

    PandasAI is a Python library that integrates generative AI

    PandasAI is a Python library that adds Generative AI capabilities to pandas, the popular data analysis and manipulation tool. It is designed to be used in conjunction with pandas, and is not a replacement for it. PandasAI makes pandas (and all the most used data analyst libraries) conversational, allowing you to ask questions to your data in natural language. For example, you can ask PandasAI to find all the rows in a DataFrame where the value of a column is greater than 5, and it will...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    Perfect Pixel

    Perfect Pixel

    Refine and quantize messy AI pixel art into clean, perfect pixels

    ...This makes it useful for game developers, sprite artists, and hobbyists who want to use AI-assisted ideation without shipping “almost pixel art” assets. It is designed to be easy to slot into an existing pipeline, whether you are batch-processing images or cleaning up a few key assets for a project.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 13
    ALSC - Advanced Linux System Cleaner

    ALSC - Advanced Linux System Cleaner

    Simplify the maintenance and cleaning of Linux systems.

    This program was developed to facilitate maintenance and cleaning of Linux systems.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 14
    sketch

    sketch

    AI code-writing assistant that understands data content

    ...The tool integrates directly into pandas dataframes through an extension, making it easy to use within existing Python workflows without requiring additional IDE plugins. Sketch supports a variety of tasks including data cleaning, feature engineering, visualization, and exploratory analysis, all driven by simple natural language prompts. It also includes advanced capabilities for generating structured outputs and applying transformations directly to datasets, reducing the need for manual coding.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 15
    Temp_Cleaner GUI

    Temp_Cleaner GUI

    A free and open-source program to free up disk space

    While most of us tend to ignore them, the truth is that browser history, cookies and cache take quite a lot of space on the disk. Deleting them does not only help you gain storage space, but it can also speed up the PC. Temp_Cleaner GUI Project is a simple and straightforward utility that enables you to clean your Windows-based computer of junk and obsolete files. The app comes with a single-window interface packed with a huge list of options. As you probably hinted, all you need to do is...
    Leader badge
    Downloads: 2,274 This Week
    Last Update:
    See Project
  • 16
    Streamline Analyst

    Streamline Analyst

    AI agent that streamlines the entire process of data analysis

    Streamline Analyst is a cutting-edge, open-source application powered by Large Language Models (LLMs) designed to revolutionize data analysis. This Data Analysis Agent effortlessly automates all the tasks such as data cleaning, preprocessing, and even complex operations like identifying target objects, partitioning test sets, and selecting the best-fit models based on your data. With Streamline Analyst, results visualization and evaluation become seamless.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17
    CMD Utilities
    This tool has various commands that can be called from the Windows Command Prompt, such as cleaning temporary files or defragmentation, etc. In addition to this, it is also capable of installing/uninstalling some browsers through CMD.
    Downloads: 7 This Week
    Last Update:
    See Project
  • 18
    Temporary File Cleaner
    ...Some files cannot be completely deleted, such as those used in real-time (for example, by browser extensions). The deleted temporary files can be both user files and Windows files (the Automatic Cleaning will delete both).
    Downloads: 4 This Week
    Last Update:
    See Project
  • 19
    transmission_cleanup

    transmission_cleanup

    Clean up of torrent files using the RPC protocal

    This application connects to the tranmission web client using the RPC interface, it allows the user to set the inital download folder for the torrents for sorting into their own folders based on the type of file it is. it also allows scheduling of the cleaning process eithe daily or weekly at a time set by you in the install process. you supply your username and password for the RPC web interface whohc is encrypted by the application and saved to the disk, The application checks if the torrent is completed, finished seeding and the sorts the files in to correct folder e.g a video into the media folder, MP3 on music folder etc. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 20
    Data Preprocessing Automate

    Data Preprocessing Automate

    Data Preprocessing Automation: A GUI for easy data cleaning & visualiz

    Data Preprocessing Automation is a Python-based GUI application designed to simplify and automate data preprocessing tasks. It allows users to upload Excel files, automatically handle missing values, remove duplicates, and detect and remove outliers using statistical methods. The application provides data visualization tools, including box plots for distribution analysis and scatter plots for exploring relationships between variables. Users can download the processed data for further...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    text-dedup

    text-dedup

    All-in-one text de-duplication

    ...This is especially useful for NLP tasks where duplicated training data can skew model performance. text-dedup scales to billions of documents and offers tools for chunking, hashing, and comparing text efficiently with low memory usage. It supports Jaccard similarity thresholding, parallel execution, and flexible deduplication strategies, making it ideal for cleaning web-scraped data, language model training datasets, or document archives.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 22
    paramspider

    paramspider

    Mine parameterized URLs from web archives for security testing

    ...These endpoints are commonly used during reconnaissance because parameters often expose inputs that may be vulnerable to issues like cross-site scripting, SQL injection, or server-side request forgery. ParamSpider automates the process of retrieving archived URLs, cleaning them, and preparing them for fuzzing or further probing. It can process a single domain or multiple domains from a list, making it useful for both targeted testing and large-scale reconnaissance.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 23
    funNLP

    funNLP

    Resources, corpora, and tools for Chinese natural language processing

    ...The repository is organized into categories such as sentiment analysis, text classification, named entity recognition, knowledge graphs, and various lexicons (e.g. sensitive words, emotion dictionaries, stopwords). It also includes links to academic papers, open-source model implementations, and practical utilities like word segmentation or text cleaning scripts. The project is highly community-oriented, frequently updated with contributions and new resources, and it’s widely used in both academic and applied NLP research. Its value lies in providing not just tools but also curated, domain-specific data, which can be hard to find elsewhere.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 24
    VAMS

    VAMS

    Virtual Assistant Maintenance System

    Virtual Assistant Maintenance System also knowns as VAMS is an AI software application, that helps users with some computer maintenance issues. Application Requirements: Operating System: Windows 8.1/10 /11 Processor: Intel Core i5 or equivalent RAM: 4GB or higher Free Disk Space: 500MB
    Downloads: 4 This Week
    Last Update:
    See Project
  • 25
    SageMaker Experiments Python SDK

    SageMaker Experiments Python SDK

    Experiment tracking and metric logging for Amazon SageMaker notebooks

    ...There is no relationship between Trial Components such as ordering. Trial Component: A description of a single step in a machine learning workflow. For example data cleaning, feature extraction, model training, model evaluation, etc.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • Next
MongoDB Logo MongoDB