Join/Login
Business Software
Open Source Software
For Vendors
Blog
About
More

For Vendors Help Create Join Login

Business Software

Open Source Software

SourceForge Podcast

Resources

Articles
Case Studies
Blog

Menu

Help
Create
Join
Login

Home
Open Source Software
Software Development
Synthetic Data Generation Software

Open Source Linux Synthetic Data Generation Software

x

Sort By:

Most Popular

Clear All Filters

OS

Linux 28
Windows 26
Mac 23
More...
BSD 8
ChromeOS 6
Desktop Operating Systems 1

Category

Software Development 28
Artificial Intelligence 9
Scientific/Engineering 5
Business 3
Database 3
Formats and Protocols 2
System 2
Education 1

License

OSI-Approved Open Source 21
Other License 2
Public Domain 1

Translations

English 6
German 1

Programming Language

Python 12
Java 8
Perl 2
C# 1
More...
JavaScript 1
PL/SQL 1
Rust 1
S/R 1

Status

Beta 6
Production/Stable 5
Planning 1
Pre-Alpha 1

Synthetic Data Generation Software for Linux

View 5 business solutions

Synthetic Data Generation Linux Clear Filters

Browse free open source Synthetic Data Generation software and projects for Linux below. Use the toggles on the left to filter open source Synthetic Data Generation software by OS, license, language, programming language, and project status.

MongoDB Atlas runs apps anywhere
Deploy in 115+ regions with the modern database for every enterprise.

MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.

Start Free
$300 in Free Credit Towards Top Cloud Services
Build VMs, containers, AI, databases, storage—all in one place.

Start your project in minutes. After credits run out, 20+ products include free monthly usage. Only pay when you're ready to scale.

Get Started
1

ML for Trading

Code for machine learning for algorithmic trading, 2nd edition

On over 800 pages, this revised and expanded 2nd edition demonstrates how ML can add value to algorithmic trading through a broad range of applications. Organized in four parts and 24 chapters, it covers the end-to-end workflow from data sourcing and model development to strategy backtesting and evaluation. Covers key aspects of data sourcing, financial feature engineering, and portfolio management. The design and evaluation of long-short strategies based on a broad range of ML algorithms, how to extract tradeable signals from financial text data like SEC filings, earnings call transcripts or financial news. Using deep learning models like CNN and RNN with financial and alternative data, and how to generate synthetic data with Generative Adversarial Networks, as well as training a trading agent using deep reinforcement learning.

Downloads: 3 This Week

Last Update: 2021-11-24
See Project
2

Synthea Patient Generator

Synthetic Patient Population Simulator

SyntheaTM is an open-source, synthetic patient generator that models the medical history of synthetic patients. Our mission is to provide high-quality, synthetic, realistic but not real, patient data and associated health records covering every aspect of healthcare. The resulting data is free from cost, privacy, and security restrictions, enabling research with Health IT data that is otherwise legally or practically unavailable. The models used to generate synthetic patients are informed by numerous academic publications. Our synthetic populations provide insight into the validity of this research and encourage future studies in population health. Synthetic data establishes a risk-free environment for Health IT development and experimentation. This includes the evaluation of new treatment models, care management systems, clinical decision support, and more.

Downloads: 3 This Week

Last Update: 2026-03-05
See Project
3

Mimesis

High-performance fake data generator for Python

Mimesis is an open source high-performance fake data generator for Python, able to provide data for various purposes in various languages. It's currently the fastest fake data generator for Python, and supports many different data providers that can produce data related to people, food, transportation, internet and many more. Mimesis is really easy to use, with everything you need just an import away. Simply import an object, called a Provider, which represents the type of data you need. Mimesis currently supports 34 different locales, the specification of which when creating providers will return data that is appropriate for the language or country associated with that locale.

Downloads: 1 This Week

Last Update: 2026-01-11
See Project
4

SDGym

Benchmarking synthetic data generation methods

The Synthetic Data Gym (SDGym) is a benchmarking framework for modeling and generating synthetic data. Measure performance and memory usage across different synthetic data modeling techniques – classical statistics, deep learning and more! The SDGym library integrates with the Synthetic Data Vault ecosystem. You can use any of its synthesizers, datasets or metrics for benchmarking. You also customize the process to include your own work. Select any of the publicly available datasets from the SDV project, or input your own data. Choose from any of the SDV synthesizers and baselines. Or write your own custom machine learning model. In addition to performance and memory usage, you can also measure synthetic data quality and privacy through a variety of metrics. Install SDGym using pip or conda. We recommend using a virtual environment to avoid conflicts with other software on your device.

Downloads: 1 This Week

Last Update: 2026-04-17
See Project
Forever Free Full-Stack Observability | Grafana Cloud
Our generous forever free tier includes the full platform, including the AI Assistant, for 3 users with 10k metrics, 50GB logs, and 50GB traces.

Built on open standards like Prometheus and OpenTelemetry, Grafana Cloud includes Kubernetes Monitoring, Application Observability, Incident Response, plus the AI-powered Grafana Assistant. Get started with our generous free tier today.

Create free account
5

Synth

The Declarative Data Generator

Synth is an open-source data-as-code tool that provides a simple CLI workflow for generating consistent data in a scalable way. Use Synth to generate correct, anonymized data that looks and quacks like production. Generate test data fixtures for your development, testing, and continuous integration. Generate data that tells the story you want to tell. Specify constraints, relations, and all your semantics. Seed development and environments and CI. Anonymize sensitive production data. Create realistic data to your specifications. Synth uses a declarative configuration language that allows you to specify your entire data model as code. Synth can import data straight from existing sources and automatically create accurate and versatile data models. Synth supports semi-structured data and is database agnostic, playing nicely with SQL and NoSQL databases. Synth supports generation for thousands of semantic types such as credit card numbers, email addresses, and more.

Downloads: 1 This Week

Last Update: 2023-05-22
See Project
6

nITROGEN

Internet of Things RandOm GENerator

1 Review

Downloads: 2 This Week

Last Update: 2022-04-25
See Project
7

A Data Generator

A tool to generate synthetic test data useful to Record matchers

With growing amount of information from multiple sources it has become very hard to relate information to the correct real life entities. Record matching software try to solve this by machine learning techniques. To do this effectively, its necessary to train the record matcher with proper test data which is identical to real life data. Hence, there is a need for a data generator to create the synthetic data to be used for evaluating the quality and capability of record matching software. A data generator creates qualitative test data considering various the real life data glitches entered through various means like human data entry, voice dictation and data scanning. The data generation process is done in many steps like org data creation, data grouping, pair generation, data mutation and matching data patterns. Data generator also mangles field values of generated test data to achieve data errors and co-relate them in real life contexts like Family, Households, Organizations etc

Downloads: 0 This Week

Last Update: 2013-12-08
See Project
8

Ava: Testdata Xsl

generates Testdata on base of excel: creates xml,excel,csv,html,sql,+

this tool for test-data-generation receives an 'excel-sheet' as primary input. second important paramter is the 'number of test-records to produce'. The excel-data will be reused as long data is needed. This tool is hightly paramatrisazable by the use of 'xsl scripts'. data can be created, updated, modified and finally exported in a format of your choice Main Fuctions: (1) Generates Testdata (excel, xsl, xml) (2) Exports generated testdata in multiple formats (csv, excel, html, sql-insert, individual by xsl extension) (3) Collect all processed data in excel-files (4) plus: Xsl Executor, which let's you run xsl-scripts independently (5) plus: User Interface

Downloads: 0 This Week

Last Update: 2015-12-12
See Project
9

BlenderProc

Blender pipeline for photorealistic training image generation

A procedural Blender pipeline for photorealistic training image generation. BlenderProc has to be run inside the blender python environment, as only there we can access the blender API. Therefore, instead of running your script with the usual python interpreter, the command line interface of BlenderProc has to be used. In general, one run of your script first loads or constructs a 3D scene, then sets some camera poses inside this scene and renders different types of images (RGB, distance, semantic segmentation, etc.) for each of those camera poses. Usually, you will run your script multiple times, each time producing a new scene and rendering e.g. 5-20 images from it. With a little more experience, it is also possible to change scenes during a single script call, read here how this is done. As blenderproc runs in blenders separate python environment, debugging your blenderproc script cannot be done in the same way as with any other python script.

Downloads: 0 This Week

Last Update: 2024-10-22
See Project
Go From AI Idea to AI App Fast
One platform to build, fine-tune, and deploy ML models. No MLOps team required.

Access Gemini 3 and 200+ models. Build chatbots, agents, or custom models with built-in monitoring and scaling.

Try Free
10

CTGAN

Conditional GAN for generating synthetic tabular data

CTGAN is a collection of Deep Learning based synthetic data generators for single table data, which are able to learn from real data and generate synthetic data with high fidelity. If you're just getting started with synthetic data, we recommend installing the SDV library which provides user-friendly APIs for accessing CTGAN. The SDV library provides wrappers for preprocessing your data as well as additional usability features like constraints. When using the CTGAN library directly, you may need to manually preprocess your data into the correct format, for example, continuous data must be represented as floats. Discrete data must be represented as ints or strings. The data should not contain any missing values.

Downloads: 0 This Week

Last Update: 2026-02-13
See Project
11

Copulas

A library to model multivariate data using copulas

Copulas is a Python library for modeling multivariate distributions and sampling from them using copula functions. Given a table of numerical data, use Copulas to learn the distribution and generate new synthetic data following the same statistical properties. Choose from a variety of univariate distributions and copulas – including Archimedian Copulas, Gaussian Copulas and Vine Copulas. Compare real and synthetic data visually after building your model. Visualizations are available as 1D histograms, 2D scatterplots and 3D scatterplots. Access & manipulate learned parameters. With complete access to the internals of the model, set or tune parameters to your choosing.

Downloads: 0 This Week

Last Update: 2026-02-05
See Project
12

DATA Gen™

DATA Gen™ - Test Data Generator to generate realistic test data.

DATA Gen™ Test Data Generator offers facilities to automate the task of creating test data for new or existing data bases. It helps lower the programming effort required, while reducing manual test data generation errors and the ripple effect that they cause on production systems, users and maintenance.

2 Reviews

Downloads: 0 This Week

Last Update: 2015-07-09
See Project
13

DBFeeder

Highly Customizable Test Data Generator

DBFeeder is a great tool to generate synthetic testdata for Oracle Databases and it is ideal for companies who wants to outsource development. Thanks to his original approach, data can be highly customizable and it even fits primary and foreign keys constraints of tables.

Downloads: 0 This Week

Last Update: 2022-09-06
See Project
14

Gretel Synthetics

Synthetic data generators for structured and unstructured text

Unlock unlimited possibilities with synthetic data. Share, create, and augment data with cutting-edge generative AI. Generate unlimited data in minutes with synthetic data delivered as-a-service. Synthesize data that are as good or better than your original dataset, and maintain relationships and statistical insights. Customize privacy settings so that data is always safe while remaining useful for downstream workflows. Ensure data accuracy and privacy confidently with expert-grade reports. Need to synthesize one or multiple data types? We have you covered. Even take advantage or multimodal data generation. Synthesize and transform multiple tables or entire relational databases. Mitigate GDPR and CCPA risks, and promote safe data access. Accelerate CI/CD workflows, performance testing, and staging. Augment AI training data, including minority classes and unique edge cases. Amaze prospects with personalized product experiences.

Downloads: 0 This Week

Last Update: 2025-03-17
See Project
15

JRandO

JRandO is a test data generator or better test object generator framework. It can be used in JUnit tests or in performance test (for e.g. using JMeter). It may also be useful in anonymization of data or in a simulation environment.

Downloads: 0 This Week

Last Update: 2014-05-02
See Project
16

JRandOSample

Sample code for JRandO project. (testdata generator, test data generator, test object generator, simulation)

Downloads: 0 This Week

Last Update: 2014-04-23
See Project
17

OraMasking

Data masking tool for Oracle database

For Oracle database, mask sensitive data by replacement (static or expression), substitution (synthetic data set is included), random values (Lorum Ipsum) or deletion. Generates update/delete statements or triggers or runs directly against the database.

Downloads: 0 This Week

Last Update: 2017-03-24
See Project
18

Synthetic Data Kit

Tool for generating high quality Synthetic datasets

Synthetic Data Kit is a CLI-centric toolkit for generating high-quality synthetic datasets to fine-tune Llama models, with an emphasis on producing reasoning traces and QA pairs that line up with modern instruction-tuning formats. It ships an opinionated, modular workflow that covers ingesting heterogeneous sources (documents, transcripts), prompting models to create labeled examples, and exporting to fine-tuning schemas with minimal glue code. The kit’s design goal is to shorten the “data prep” bottleneck by turning dataset creation into a repeatable pipeline rather than ad-hoc notebooks. It supports generation of rationales/chain-of-thought variants, configurable sampling, and guardrails so outputs meet format constraints and quality checks. Examples and guides show how to target task-specific behaviors like tool use or step-by-step reasoning, then save directly into training-ready files.

Downloads: 0 This Week

Last Update: 2025-10-25
See Project
19

Synthetic Mixed Data Generator

A Synthetic Data Generator for producing mixed datasets described by relevant, irrelevant, and redundant features.

Downloads: 0 This Week

Last Update: 2021-11-17
See Project
20

TGAN

Generative adversarial training for generating synthetic tabular data

We are happy to announce that our new model for synthetic data called CTGAN is open-sourced. The new model is simpler and gives better performance on many datasets. TGAN is a tabular data synthesizer. It can generate fully synthetic data from real data. Currently, TGAN can generate numerical columns and categorical columns. TGAN has been developed and runs on Python 3.5, 3.6 and 3.7. Also, although it is not strictly required, the usage of a virtualenv is highly recommended in order to avoid interfering with other software installed in the system where TGAN is run. For development, you can use make install-develop instead in order to install all the required dependencies for testing and code listing. In order to be able to sample new synthetic data, TGAN first needs to be fitted to existing data.

Downloads: 0 This Week

Last Update: 2023-03-21
See Project
21

Tofu

Tofu is a Python tool for generating synthetic UK Biobank data

Tofu is a Python library for generating synthetic UK Biobank data. The UK Biobank is a large open-access prospective research cohort study of 500,000 middle-aged participants recruited in England, Scotland and Wales. The study has collected and continues to collect extensive phenotypic and genotypic detail about its participants, including data from questionnaires, physical measures, sample assays, accelerometry, multimodal imaging, genome-wide genotyping and longitudinal follow-up for a wide range of health-related outcomes. Tofu will generate synthetic data which conforms to the structure of the baseline data UK Biobank sends researchers by generating random values. For categorical variables (single or multiple choices), a random value will be picked from the UK Biobank data dictionary for that field. For continuous variables, a random value will be generated based on the distribution of values reported for that field on the UK Biobank showcase.

Downloads: 0 This Week

Last Update: 2023-05-22
See Project
22

Twinify

Privacy-preserving generation of a synthetic twin to a data set

twinify is a software package for the privacy-preserving generation of a synthetic twin to a given sensitive tabular data set. On a high level, twinify follows the differentially private data-sharing process introduced by Jälkö et al.. Depending on the nature of your data, twinify implements either the NAPSU-MQ approach described by Räisä et al. or finds an approximate parameter posterior for any probabilistic model you formulated using differentially private variational inference (DPVI). For the latter, twinify also offers automatic modeling for easy building of models fitting the data. If you have existing experience with NumPyro you can also implement your own model directly. Often data that would be very useful for the scientific community is subject to privacy regulations and concerns and cannot be shared. Differentially private data sharing allows generating of synthetic data that is statistically similar to the original data.

Downloads: 0 This Week

Last Update: 2023-05-22
See Project
23

YData Synthetic

Synthetic data generators for tabular and time-series data

A package to generate synthetic tabular and time-series data leveraging state-of-the-art generative models. Synthetic data is artificially generated data that is not collected from real-world events. It replicates the statistical components of real data without containing any identifiable information, ensuring individuals' privacy. This repository contains material related to Generative Adversarial Networks for synthetic data generation, in particular regular tabular data and time-series. It consists a set of different GANs architectures developed using Tensorflow 2.0. Several example Jupyter Notebooks and Python scripts are included, to show how to use the different architectures. YData synthetic has now a UI interface to guide you through the steps and inputs to generate structure tabular data. The streamlit app is available form v1.0.0 onwards.

Downloads: 0 This Week

Last Update: 2026-04-23
See Project
24

Zylthra

Zylthra: A PyQt6 app to generate synthetic datasets with DataLLM.

Welcome to Zylthra, a powerful Python-based desktop application built with PyQt6, designed to generate synthetic datasets using the DataLLM API from data.mostly.ai. This tool allows users to create custom datasets by defining columns, configuring generation parameters, and saving setups for reuse, all within a sleek, dark-themed interface.

Downloads: 0 This Week

Last Update: 2025-04-10
See Project
25

databene benerator

benerator is a framework for creating realistic and valid high-volume test data, used for load and performance testing and showcase setup. Data is generated from an easily configurable metadata model and exported to databases, XML, CSV or flat files.

8 Reviews

Downloads: 0 This Week

Last Update: 2019-01-28
See Project

Previous
You're on page 1
2
Next

Related Searches

algorithmic trading

patient management

fake traffic generator

gym software

credit card reader/writer software

iot

blender

excel vba programming

jmeter

numerical simulation software

SourceForge

Create a Project
Open Source Software
Business Software
Top Downloaded Projects

Company

About
Team
SourceForge Headquarters
1320 Columbia Street Suite 310
San Diego, CA 92101
+1 (858) 422-6466

Resources

Support
Site Documentation
Site Status
SourceForge Reviews

© 2026 Slashdot Media. All Rights Reserved.

Terms Privacy Opt Out Advertise