dataset free download

Showing 730 open source projects for "dataset"

View related business solutions

Ship Agents Faster
Transform your applications and workflows into powerful agentic systems at global scale.

Gemini Enterprise Agent Platform lets you rapidly build, scale, govern and optimize production-ready agents grounded in your organization's data. The platform enables developers to build custom or pre-built agents for virtually any use case. New customers get $300 in free credits.

Get Started Free
$300 Free Credits for Your Google Cloud Projects
Start building on Google Cloud with $300 in free credits. No commitment, no credit card required until you're ready to scale.

Launch your next project with $300 in free Google Cloud credits—no strings attached. Test, build, and deploy without risk. Use your credits across the entire Google Cloud platform to find what works best for your needs. After your credits are used, continue with always-free tier services. Only pay when you're ready to scale. Sign up in minutes and start exploring.

Start Free Trial
1

The Unsplash Dataset

Unsplash images made available for research and machine learning

The Unsplash Dataset is made up of over 350,000+ contributing global photographers and data sourced from hundreds of millions of searches across a nearly unlimited number of uses and contexts. Due to the breadth of intent and semantics contained within the Unsplash dataset, it enables new opportunities for research and learning.

Downloads: 5 This Week

Last Update: 2026-06-12
See Project
2

Easy DataSet

A powerful tool for creating datasets for LLM fine-tuning

...The system includes automated question-generation capabilities, hierarchical label trees, and answer generation pipelines that use LLM APIs to produce coherent paired data with customizable templates. Beyond dataset creation, Easy-dataset also provides a built-in evaluation system with model testing and blind-test features, helping teams validate model performance using curated test sets.

Downloads: 4 This Week

Last Update: 2026-04-10
See Project
3

Mathematics Dataset

This dataset code generates mathematical question and answer pairs

The Mathematics Dataset, developed by Google DeepMind, is a synthetic dataset designed to evaluate and train machine learning models on mathematical reasoning and symbolic manipulation. It generates question-and-answer pairs across a wide range of mathematical topics typically found in school-level curricula, testing a model’s ability to reason about algebra, arithmetic, calculus, probability, and more.

Downloads: 3 This Week

Last Update: 2026-06-13
See Project
4

The Hypersim Dataset

Photorealistic Synthetic Dataset for Holistic Indoor Scene

Hypersim is a large-scale, photorealistic synthetic dataset and tooling suite for indoor scene understanding research. It provides richly annotated renderings—RGB, depth, surface normals, instance and semantic segmentations, and material/lighting metadata—produced from high-fidelity virtual environments. The dataset spans diverse furniture layouts, room types, and camera trajectories, enabling robust training for geometry, segmentation, and SLAM-adjacent tasks.

Downloads: 0 This Week

Last Update: 2026-01-09
See Project
Stop Cyber Threats with VM-Series Next-Gen Firewall on Azure
Native application identity and user-based security for your Azure cloud

Gain integrated visibility across all traffic in a single pass. Deploy Palo Alto Networks VM-Series to determine application identity and content while automating security policy updates via rich APIs.

Get a free trial
5

DataSet Serialize for Delphi and Lazarus

JSON to DataSet and DataSet to JSON converter for Delphi and Lazarus

DataSet Serialize is a set of features to make working with JSON and DataSet simple. It has features such as exporting or importing records into a DataSet, validate if JSON has all required attributes (previously entered in the DataSet), exporting or importing the structure of DataSet fields in JSON format. In addition to managing nested JSON through master detail or using TDataSetField (you choose the way that suits you best).

Downloads: 0 This Week

Last Update: 2026-05-04
See Project
6

Passport Index Dataset

Passport Index 2023: visa requirements for 199 countries, in .csv

There are 6 datasets with identical visa requirements data. Three datasets are matrix and three are long (tidy) formats. Each comes in 3 versions: with country codes as specified in ISO-2 (two-letter codes), ISO-3 (three-letter codes), and full country names from no particular standard. In distance matrices (files with matrix in the filename), the first column represents a passport (=from), each remaining column represents a destination (=to). Files in tidy format (with tidy in filename)...

Downloads: 1 This Week

Last Update: 2025-01-12
See Project
7

Exclusively Dark Image Dataset

ExDARK dataset is the largest collection of low-light images

The Exclusively Dark (ExDARK) dataset is one of the largest curated collections of real-world low-light images designed to support research in computer vision tasks under challenging lighting conditions. It contains 7,363 images captured across ten different low-light scenarios, ranging from extremely dark environments to twilight. Each image is annotated with both image-level labels and object-level bounding boxes for 12 object categories, making it suitable for detection and classification tasks. ...

Downloads: 4 This Week

Last Update: 3 days ago
See Project
8

Image Harmonization Dataset iHarmony4

The first large-scale public benchmark dataset for image harmonization

This repository provides the iHarmony4 dataset, which is a large-scale dataset designed for image harmonization tasks. Image harmonization involves adjusting the appearance of a foreground in a composite image so that it is consistent with the background (in color, tone, illumination, etc.). The iHarmony4 dataset comprises four sub-datasets (HCOCO, HAdobe5k, HFlickr, Hday2night), each making composite images by combining a foreground from one image with a background from another, along with associated ground truth harmonized images and foreground masks. ...

Downloads: 0 This Week

Last Update: 2026-02-24
See Project
9

IPFS GeoIP

GeoIP lookup over DAG-CBOR dataset loaded from IPFS

GeoIP lookup over IPFS. GeoIP lookup over DAG-CBOR dataset loaded from IPFS.

Downloads: 1 This Week

Last Update: 2026-02-27
See Project
MongoDB Atlas runs apps anywhere
Deploy in 115+ regions with the modern database for every enterprise.

MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.

Start Free
10

embedchain

Framework to easily create LLM powered bots over any dataset

Embedchain is a framework to easily create LLM-powered bots over any dataset. If you want a javascript version, check out embedchain-js. Embedchain empowers you to create chatbot models similar to ChatGPT, using your own evolving dataset. Start building LLM powered bots under 30 seconds.

Downloads: 0 This Week

Last Update: 3 days ago
See Project
11

CO3D (Common Objects in 3D)

Tooling for the Common Objects In 3D dataset

CO3Dv2 (Common Objects in 3D, version 2) is a large-scale 3D computer vision dataset and toolkit from Facebook Research designed for training and evaluating category-level 3D reconstruction methods using real-world data. It builds upon the original CO3Dv1 dataset, expanding both scale and quality—featuring 2× more sequences and 4× more frames, with improved image fidelity, more accurate segmentation masks, and enhanced annotations for object-centric 3D reconstruction. ...

Downloads: 3 This Week

Last Update: 6 days ago
See Project
12

Fluid

Fluid, elastic data abstraction and acceleration for BigData/AI apps

Fluid, elastic data abstraction and acceleration for BigData/AI applications in the cloud. Provide DataSet abstraction for underlying heterogeneous data sources with multidimensional management in a cloud environment. Enable dataset warmup and acceleration for data-intensive applications by using a distributed cache in Kubernetes with observability, portability, and scalability. Taking characteristics of application and data into consideration for cloud application/dataset scheduling to improve the performance.

Downloads: 0 This Week

Last Update: 2025-10-31
See Project
13

Parquet.jl

Julia implementation of Parquet columnar file format reader

A parquet file or dataset can be loaded using the read_parquet function. A parquet dataset is a directory with multiple parquet files, each of which is a partition belonging to the dataset.

Downloads: 0 This Week

Last Update: 2025-08-01
See Project
14

Datumaro

Dataset Management Framework, a Python library and a CLI tool to build

Datumaro is a flexible Python-based dataset management framework and command-line tool for building, analyzing, transforming, and converting computer vision datasets in many popular formats. It supports importing and exporting annotations and images across a wide variety of standards like COCO, PASCAL VOC, YOLO, ImageNet, Cityscapes, and many more, enabling easy integration with different training pipelines and tools.

Downloads: 0 This Week

Last Update: 2 days ago
See Project
15

Open X-Embodiment

Unified open dataset enabling cross-embodiment learning for robotics

Open X-Embodiment is a large-scale collaborative initiative led by Google DeepMind to unify robotic learning datasets into a consistent and standardized format, simplifying access and usage across the robotics research community. Its primary goal is to make all available open-source robotic data interoperable by representing them using the RLDS (Reinforcement Learning Dataset Structure) episode format. This enables seamless integration for training, evaluation, and model development across diverse robotic tasks and embodiments. The dataset aggregates contributions from multiple open-source robotic projects, all harmonized under a single unified data schema. The repository also provides Colab notebooks for dataset visualization, batching, and model inference, along with pretrained model checkpoints such as RT-1-X, a multitask robotic transformer model trained on this data.

Downloads: 2 This Week

Last Update: 6 days ago
See Project
16

DOLMA

Data and tools for generating and inspecting OLMo pre-training data

DOLMA (Data Optimization and Learning for Model Alignment) is a framework designed to manage large-scale datasets for training and fine-tuning language models efficiently.

Downloads: 0 This Week

Last Update: 2025-06-25
See Project
17

CleanVision

Automatically find issues in image datasets

CleanVision automatically detects potential issues in image datasets like images that are: blurry, under/over-exposed, (near) duplicates, etc. This data-centric AI package is a quick first step for any computer vision project to find problems in the dataset, which you want to address before applying machine learning. CleanVision is super simple -- run the same couple lines of Python code to audit any image dataset! The quality of machine learning models hinges on the quality of the data used to train them, but it is hard to manually identify all of the low-quality data in a big dataset. ...

Downloads: 4 This Week

Last Update: 2026-01-05
See Project
18

Datasets

Hub of ready-to-use datasets for ML models

Datasets is a library for easily accessing and sharing datasets, and evaluation metrics for Natural Language Processing (NLP), computer vision, and audio tasks. Load a dataset in a single line of code, and use our powerful data processing methods to quickly get your dataset ready for training in a deep learning model. Backed by the Apache Arrow format, process large datasets with zero-copy reads without any memory constraints for optimal speed and efficiency. We also feature a deep integration with the Hugging Face Hub, allowing you to easily load and share a dataset with the wider NLP community. ...

Downloads: 0 This Week

Last Update: 2026-06-05
See Project
19

Redis

An in-memory database that persists on disk

...You can run atomic operations on these types, like appending to a string; incrementing the value in a hash; pushing an element to a list; computing set intersection, union and difference; or getting the member with highest ranking in a sorted set. To achieve top performance, Redis works with an in-memory dataset. Depending on your use case, you can persist your data either by periodically dumping the dataset to disk or by appending each command to a disk-based log.

Downloads: 47 This Week

Last Update: 2026-06-04
See Project
20

In-The-Wild Jailbreak Prompts on LLMs

A dataset consists of 15,140 ChatGPT prompts from Reddit

...Researchers analyze these prompts to identify patterns, attack strategies, and techniques commonly used to trick language models into producing restricted or harmful outputs. The dataset includes thousands of prompts collected across multiple platforms and represents one of the largest collections of jailbreak attempts available for research.

Downloads: 0 This Week

Last Update: 2026-03-05
See Project
21

ARC-AGI

The Abstraction and Reasoning Corpus

...The dataset is structured as grid-based puzzles, where each task requires understanding transformations such as symmetry, counting, or spatial manipulation. Unlike traditional machine learning benchmarks, ARC emphasizes generalization and reasoning over statistical pattern recognition, making it particularly challenging for current AI systems.

Downloads: 2 This Week

Last Update: 2026-04-03
See Project
22

OpenCLIP

An open source implementation of CLIP

The goal of this repository is to enable training models with contrastive image-text supervision and to investigate their properties such as robustness to distribution shift. Our starting point is an implementation of CLIP that matches the accuracy of the original CLIP models when trained on the same dataset. Specifically, a ResNet-50 model trained with our codebase on OpenAI's 15 million image subset of YFCC achieves 32.7% top-1 accuracy on ImageNet. OpenAI's CLIP model reaches 31.3% when trained on the same subset of YFCC. For ease of experimentation, we also provide code for training on the 3 million images in the Conceptual Captions dataset, where a ResNet-50x4 trained with our codebase reaches 22.2% top-1 ImageNet accuracy. ...

Downloads: 2 This Week

Last Update: 2026-02-27
See Project
23

NYC Taxi Data

Import public NYC taxi and for-hire vehicle (Uber, Lyft)

...It also contains example analyses—spatial and temporal visualizations like maps, time-series plots, and hotspot detection—highlighting insights such as patterns of demand, peak times, and geospatial distributions. The repository is often used as a benchmark dataset and example for teaching, benchmarking, and demonstration purposes in the data science and urban analytics communities.

Downloads: 3 This Week

Last Update: 2025-10-01
See Project
24

CellTypist

A tool for semi-automatic cell type classification, harmonization

CellTypist is an automated tool for cell type classification, harmonization, and integration. Classification, transfer cell type labels from the reference to query dataset. Harmonization, match and harmonize cell types defined by independent datasets. integration, integrate cell and cell types with supervision from harmonization. CellTypist recapitulates cell type structure and biology of independent datasets. Regularised linear models with Stochastic Gradient Descent provide a fast and accurate prediction. ...

Downloads: 0 This Week

Last Update: 2025-06-25
See Project
25

all AI news

A list of online news & info sources in the AI/ML/Data Science space

...Overall, it provides a foundational dataset for tracking AI industry trends and updates.

Downloads: 0 This Week

Last Update: 2026-04-21
See Project