Apache Parquet Integrations

Ficstar

Ficstar Software Inc.

Ficstar is a fully managed web scraping and enterprise data extraction company headquartered in Toronto, Canada. Founded in 2005, Ficstar provides end-to-end data collection solutions that handle every aspect of the scraping pipeline — infrastructure, proxy management, data parsing, structuring, and delivery — so enterprise clients receive clean, accurate, real-time data without building or maintaining any in-house scraping systems. Serving 200+ major companies across industries including e-commerce, finance, retail, and market research, Ficstar specializes in large-scale, compliance-conscious web data extraction tailored to each client's specific requirements. Solutions are fully customized, scalable, and designed for seamless integration with existing business intelligence and data workflows. With over two decades of experience, Ficstar is a trusted partner for enterprises that depend on reliable, structured web data to power competitive intelligence, pricing analysis, lead genera

Starting Price: $1,000

View Software

QuerySurge

RTTS

QuerySurge is the enterprise-grade data quality platform that continuously automates the validation of data across your entire ecosystem ‐ from data warehouses and big data lakes to BI reports and enterprise applications. With AI-powered test creation, a scalable architecture, and seamless CI/CD integration, QuerySurge consistently ensures data integrity at every stage of the pipeline: accelerating delivery, reducing risk, and enabling confident decision-making. Use Cases - Data Warehouse & ETL Testing - Big Data Testing - DevOps for Data / DataOps / Continuous Testing - Data Migration Testing - BI Report Testing - Enterprise App/ERP Testing QuerySurge Features - Data Validation: enterprise-grade platform - AI: Automatically create data validation tests - BI Report Testing: Fully automated, no-code approach - DevOps for Data (DataOps): API w/60+ calls & Swagger docs, integrate continuous testing into your CI/CD pipelines - Data Connectors: For 200+ platforms

8 Ratings

View Software

StarfishETL

StarfishETL is an Integration Platform as a Service (iPaaS), and although “integration” is in the name, it’s capable of much more. An iPaaS lives in the cloud and can integrate different systems by using their APIs. This makes it adaptable beyond integration for migration, data governance, and data cleansing. Unlike traditional integration apps, StarfishETL provides low-code mapping and powerful scripting tools to manage, personalize, and manipulate data at scale. Features: - Drag and drop mapping - AI-powered connections - Purpose built integrations - Extensibility through scripting - Secure on-premises connections - Scalable data capacity

Starting Price: 400/month

View Software

Flyte

Union.ai

The workflow automation platform for complex, mission-critical data and ML processes at scale. Flyte makes it easy to create concurrent, scalable, and maintainable workflows for machine learning and data processing. Flyte is used in production at Lyft, Spotify, Freenome, and others. At Lyft, Flyte has been serving production model training and data processing for over four years, becoming the de-facto platform for teams like pricing, locations, ETA, mapping, autonomous, and more. In fact, Flyte manages over 10,000 unique workflows at Lyft, totaling over 1,000,000 executions every month, 20 million tasks, and 40 million containers. Flyte has been battle-tested at Lyft, Spotify, Freenome, and others. It is entirely open-source with an Apache 2.0 license under the Linux Foundation with a cross-industry overseeing committee. Configuring machine learning and data workflows can get complex and error-prone with YAML.

Starting Price: Free

View Software

Indexima Data Hub

Indexima

Reshape your perception of time in data analytics. Instantly access your business’ data in no time and work directly on your dashboard without going back and forth with the IT team. Meet Indexima DataHub, a new space-time where operational and functional users gain instant access to their data, in no time. With a combination of its unique indexing engine and machine learning, Indexima allows businesses to access all their data to simplify and speed up analytics. Robust and scalable, the solution allows organizations to query all their data directly at the source, in volumes of tens of billions of rows in just a few milliseconds. Our Indexima platform allows users to implement instant analytics on all their data in just one click. Thanks to Indexima’s new ROI and TCO calculator, find out in 30 seconds the ROI of your data platform. Infrastructure costs, project deployment time, and data engineering costs, while boosting your analytical performances.

Starting Price: $3,290 per month

View Software

PI.EXCHANGE

Easily connect your data to the engine, either through uploading a file or connecting to a database. Then, start analyzing your data through visualizations, or prepare your data for machine learning modeling with the data wrangling actions with repeatable recipes. Get the most out of your data by building machine learning models, using regression, classification or clustering algorithms - all without any code. Uncover insights into your data, using the feature importance, prediction explanation, and what-if tools. Make predictions and integrate them seamlessly into your existing systems through our connectors, ready to go so you can start taking action.

Starting Price: $39 per month

View Software

Tonic Ephemeral

Tonic

Stop wasting time provisioning and maintaining databases yourself. Effortlessly create isolated test databases to ship features faster. Equip your developers with the ready-to-go data they need to keep fast-paced projects on track. Spin up pre-populated databases for testing purposes as part of your CI/CD pipeline, and automatically tear them down once the tests are done. Quickly and painlessly spin up databases at the click of a button for testing, bug reproduction, demos, and more with built-in container orchestration. Use our patented subsetter to shrink PBs down to GBs without breaking referential integrity, then leverage Tonic Ephemeral to spin up a database with only the data needed for development to cut cloud costs and maximize efficiency. Pair our patented subsetted with Tonic Ephemeral to get all the data subsets you need for only as long as you need them. Maximize efficiency by getting your developers access to one-off datasets for local development.

Starting Price: $199 per month

View Software

PuppyGraph

PuppyGraph empowers you to seamlessly query one or multiple data stores as a unified graph model. Graph databases are expensive, take months to set up, and need a dedicated team. Traditional graph databases can take hours to run multi-hop queries and struggle beyond 100GB of data. A separate graph database complicates your architecture with brittle ETLs and inflates your total cost of ownership (TCO). Connect to any data source anywhere. Cross-cloud and cross-region graph analytics. No complex ETLs or data replication is required. PuppyGraph enables you to query your data as a graph by directly connecting to your data warehouses and lakes. This eliminates the need to build and maintain time-consuming ETL pipelines needed with a traditional graph database setup. No more waiting for data and failed ETL processes. PuppyGraph eradicates graph scalability issues by separating computation and storage.

Starting Price: Free

View Software

Timeplus

Timeplus is a simple, powerful, and cost-efficient stream processing platform. All in a single binary, easily deployed anywhere. We help data teams process streaming and historical data quickly and intuitively, in organizations of all sizes and industries. Lightweight, single binary, without dependencies. End-to-end analytic streaming and historical functionalities. 1/10 the cost of similar open source frameworks. Turn real-time market and transaction data into real-time insights. Leverage append-only streams and key-value streams to monitor financial data. Implement real-time feature pipelines using Timeplus. One platform for all infrastructure logs, metrics, and traces, the three pillars supporting observability. In Timeplus, we support a wide range of data sources in our web console UI. You can also push data via REST API, or create external streams without copying data into Timeplus.

Starting Price: $199 per month

View Software

Timbr.ai

Timbr is the ontology-based semantic layer used by leading enterprises to make faster, better decisions with ontologies that transform structured data into AI-ready knowledge. By unifying enterprise data into a SQL-queryable knowledge graph, Timbr makes relationships, metrics, and context explicit, enabling both humans and AI to reason over data with accuracy and speed. Its open, modular architecture connects directly to existing data sources, virtualizing and governing them without replication. The result is a dynamic, easily accessible model that powers analytics, automation, and LLMs through SQL, APIs, SDKs, and natural language. Timbr lets organizations operationalize AI on their data - securely, transparently, and without dependence on proprietary stacks - maximizing data ROI and enabling teams to focus on solving problems instead of managing complexity.

Starting Price: $599/month

View Software

Amazon Data Firehose

Amazon

Easily capture, transform, and load streaming data. Create a delivery stream, select your destination, and start streaming real-time data with just a few clicks. Automatically provision and scale compute, memory, and network resources without ongoing administration. Transform raw streaming data into formats like Apache Parquet, and dynamically partition streaming data without building your own processing pipelines. Amazon Data Firehose provides the easiest way to acquire, transform, and deliver data streams within seconds to data lakes, data warehouses, and analytics services. To use Amazon Data Firehose, you set up a stream with a source, destination, and required transformations. Amazon Data Firehose continuously processes the stream, automatically scales based on the amount of data available, and delivers it within seconds. Select the source for your data stream or write data using the Firehose Direct PUT API.

Starting Price: $0.075 per month

View Software

MLJAR Studio

MLJAR

It's a desktop app with Jupyter Notebook and Python built in, installed with just one click. It includes interactive code snippets and an AI assistant to make coding faster and easier, perfect for data science projects. We manually hand crafted over 100 interactive code recipes that you can use in your Data Science projects. Code recipes detect packages available in the current environment. Install needed modules with 1-click, literally. You can create and interact with all variables available in your Python session. Interactive recipes speed-up your work. AI Assistant has access to your current Python session, variables and modules. Broad context makes it smart. Our AI Assistant was designed to solve data problems with Python programming language. It can help you with plots, data loading, data wrangling, Machine Learning and more. Use AI to quickly solve issues with code, just click Fix button. The AI assistant will analyze the error and propose the solution.

Starting Price: $20 per month

View Software

QStudio

TimeStored

QStudio is a free, modern SQL editor supporting over 30 databases, including MySQL, PostgreSQL, and DuckDB. It offers features such as server browsing for easy viewing of tables, variables, functions, and configuration settings; SQL syntax highlighting; code completion; the ability to query servers directly from the editor; and built-in charts for data visualization. QStudio runs on Windows, Mac, and Linux, providing particularly good support for kdb+, Parquet, PRQL, and DuckDB. Additional functionalities include data pivoting similar to Excel, exporting data to Excel or CSV, and AI-powered tools like Text2SQL for generating queries from plain English, Explain-My-Query for code walkthroughs, and Explain-My-Error for debugging assistance. Simply send the query you want and select the chart type to draw a chart. Send queries straight from within the editor to any of your servers. All data structures are handled perfectly.

Starting Price: Free

View Software

Streamkap

Streamkap is a streaming data platform that makes streaming as easy as batch. Stream data from database (change data capturee) or event sources to your favorite database, data warehouse or data lake. Streamkap can be deployed as a SaaS or in a bring your own cloud (BYOC) deployment.

Starting Price: $600 per month

View Software

Tad

Tad is a free (MIT Licensed) desktop application for viewing and analyzing tabular data. It is a fast viewer for CSV and Parquet files and SQLite and DuckDb databases that support large files. It's a Pivot Table for analyzing and exploring data. Internally, Tad uses DuckDb for fast, accurate processing. Designed to fit into the workflow of data engineers and data scientists. Tad includes updates to DuckDb 1.0, the ability to export filtered tables as Parquet (as well as CSV), a fix for formatting numbers in scientific notation, and other minor bug fixes and dependent package upgrades. A packaged installer for Tad is available for macOS (x86 and Apple Silicon), Linux, and Windows.

Starting Price: Free

View Software

Apache DataFusion

Apache Software Foundation

Apache DataFusion is an extensible, high-performance query engine written in Rust that utilizes Apache Arrow as its in-memory format. Designed for developers building data-centric systems such as databases, data frames, machine learning, and streaming applications, DataFusion offers SQL and DataFrame APIs, a vectorized, multi-threaded, streaming execution engine, and support for partitioned data sources. It natively supports formats like CSV, Parquet, JSON, and Avro, and allows for seamless integration with object stores including AWS S3, Azure Blob Storage, and Google Cloud Storage. The engine features a comprehensive query planner, a state-of-the-art optimizer with capabilities like expression coercion and simplification, projection and filter pushdown, sort and distribution-aware optimizations, and automatic join reordering. DataFusion is highly customizable, enabling the addition of user-defined scalar, aggregate, and window functions, custom data sources, query languages, etc.

Starting Price: Free

View Software

OpenObserve

OpenObserve is an open source observability platform for logs, metrics, and traces that emphasizes high performance, scalability, and dramatically lower cost. It supports petabyte-scale observability thanks to features like data compression using columnar storage and the ability to use “bring your own bucket” storage (local disk, S3, GCS, Azure Blob, etc.). It is written in Rust, uses the DataFusion query engine to directly query Parquet files, and provides a stateless, horizontally scalable architecture with caching (both result and disk) to maintain speed under heavy load. It embraces open standards (OpenTelemetry compatibility, vendor-neutral APIs), so it fits into existing monitoring/logging workflows. Key modules include logs, metrics, traces, frontend monitoring, pipelines, alerts, and dashboards/visualizations.

Starting Price: $0.30 per GB

View Software

Querri

Querri is an AI-powered data analytics platform designed to make data collaboration effortless by enabling users to connect, clean, analyze, and visualize data all in one place. It features a natural-language interface that lets you ask questions in plain English and instantly get visual answers. It also includes automated data cleansing and ingestion tools that handle messy or disparate files (CSV, Excel, JSON, Parquet) and cloud-storage sources (Google Drive, OneDrive, Dropbox), joining and reformatting them so you can start analyzing without delay. A drag-and-drop dashboard builder enables the quick creation of shareable reports, while built-in integrations cover spreadsheets and business apps (Excel, Smartsheet, QuickBooks, Airtable, among others). Querri offers white-label capabilities so you can embed or brand the analytics engine within your own product.

Starting Price: $16 per month

View Software

Sliq

Sliq is an AI-powered data cleaning platform that transforms messy raw datasets into clean, analysis-ready data in minutes by automatically detecting and fixing common quality issues such as incorrect formats, missing values, schema inconsistencies, and formatting errors, so analysts and engineers spend less time on “janitor work” and more time on insights and modeling. It uses context-aware intelligence to understand the semantic domain of uploaded data (for example, whether it’s financial records, ecommerce logs, or medical data) and tailors a cleaning plan specifically for that dataset instead of applying one-size-fits-all rules. Users can upload files directly or integrate with workflows programmatically, and Sliq supports common data formats, including CSV, JSON, and Parquet, while seamlessly integrating into existing data ecosystems.

Starting Price: $30

View Software

OrcaSheets

OrcaSheets is a local-first analytics platform that enables teams to analyze large datasets using a spreadsheet-style interface combined with powerful data processing capabilities. The platform connects to multiple data sources such as databases, warehouses, APIs, and flat files, allowing organizations to unify data from different systems into a single workspace. OrcaSheets can process billions of rows directly on a user’s hardware, delivering fast query performance without relying entirely on cloud infrastructure. Users can explore data using plain English queries or switch to SQL for advanced analysis, making the platform accessible to both business users and data professionals. By combining spreadsheet simplicity with high-performance analytics, OrcaSheets helps teams run financial reporting, operational analysis, and growth analytics more efficiently.

Starting Price: $0

View Software

Warp 10

SenX

Warp 10 is a modular open source platform that collects, stores, and analyzes data from sensors. Shaped for the IoT with a flexible data model, Warp 10 provides a unique and powerful framework to simplify your processes from data collection to analysis and visualization, with the support of geolocated data in its core model (called Geo Time Series). Warp 10 is both a time series database and a powerful analytics environment, allowing you to make: statistics, extraction of characteristics for training models, filtering and cleaning of data, detection of patterns and anomalies, synchronization or even forecasts. The analysis environment can be implemented within a large ecosystem of software components such as Spark, Kafka Streams, Hadoop, Jupyter, Zeppelin and many more. It can also access data stored in many existing solutions, relational or NoSQL databases, search engines and S3 type object storage system.

View Software

Gravity Data

Gravity

Gravity's mission is to make streaming data easy from over 100 sources while only paying for what you use. Gravity removes the reliance on engineering teams to deliver streaming pipelines with a simple interface to get streaming up and running in minutes from databases, event data and APIs. Everyone in the data team can now build with simple point and click so that you can focus on building apps, services and customer experiences. Full Execution trace and detailed error messaging for quick diagnosis and resolution. We have implemented new, feature-rich ways for you to quickly get started. From bulk set-up, default schemas and data selection to different job modes and statuses. Spend less time wrangling with infrastructure and more time analysing data while allowing our intelligent engine to keep your pipelines running. Gravity integrates with your systems for notifications and orchestration.

View Software

Autymate

Our one-time, no-code integrations work with 200+ of the world’s biggest platforms. From HR and payroll to managing customers and vendors, you can connect everyone with everything without lifting a finger. We made our interface so intuitive that it looks like you are doing the automation within QuickBooks itself. Seamlessly integrate QuickBooks and your accounting systems, eliminating data entry and boosting your team's productivity. Make accounting effortless for your franchise business. Stay ahead of your competition and make your customers stay longer with a white-labeled accounting automation app. Connect your enterprise's most complex systems in one easy workflow and automate all the busy work in between. Seamlessly integrate QuickBooks and your accounting systems, eliminating data entry and boosting your team's productivity. Your accountants can do what they love doing and work on more meaningful tasks that have a more significant impact.

View Software

GribStream

GribStream is a fast and efficient historical weather forecast API that provides high-speed access to both historical and real-time weather data from the National Blend of Models (NBM) and the Global Forecast System (GFS). Designed for companies, meteorologists, and researchers, GribStream enables users to retrieve tens of thousands of data points at an hourly rate, spanning months, in a single HTTP request within seconds. The platform offers a simple API with open source clients and clear documentation, facilitating quick integration. It supports various result formats, including CSV, Parquet, JSON lines, and image formats like PNG, JPG, and TIFF. Users can specify locations by latitude and longitude, and define time ranges for data retrieval. GribStream is actively developing features such as additional datasets, result formats, aggregation methods, and notification systems.

Starting Price: $9.90 per month

View Software

CSViewer

EasyMorph

CSViewer is a fast and free Windows desktop application for viewing and analyzing large delimited text and binary data files, such as CSV, TSV, Parquet, and QVD. It can load millions of rows in seconds and offers advanced filtering, instant profiling with aggregates, null counts, and outlier detection. Users can export filtered data, save analysis views, and visualize data through charts and cross-tables. CSViewer is designed for easy exploratory analysis without sending data to the cloud. Aggregates and charts are updated immediately when a filter is applied or changed. Null counts, uniques, min/max values, aggregates, etc. for every column. Export the filtered subset of rows into another file for sharing or use in another application. Convert data from one file format to another, e.g. from CSV to QVD. When you export into the native for CSViewer .dset file format, data is saved together with filters and charts for your convenience.

Starting Price: Free

View Software

Astera Dataprep

Astera

Astera Dataprep is an AI-powered, chat-based data preparation solution that lets users clean, transform, and ready raw data for analysis, reporting, and integration using natural language commands through a simple conversational interface, eliminating the need for coding, complex formulas, or technical skills; you describe what you want in plain English and it performs actions like merging, filtering, deduplicating, reshaping, and transforming data in real time while showing an interactive Excel-like preview of changes. It connects to diverse sources such as spreadsheets, CSV files, database tables, and cloud storage, so you can combine multi-source data in one workspace, visualize data quality issues like missing values and duplicates, fix them instantly, and ensure consistent, accurate results. Users can save preparation steps as reusable workflows, schedule automated jobs to keep data up to date, and export clean data to analytics or BI tools.

View Software

Tictable

Tictable is a minimalist, AI-powered data studio designed to help users work with everything from small datasets to massive data collections through a fast, browser-based environment. It combines the familiarity of spreadsheets with the power of a built-in SQL engine, allowing users to run queries directly in the browser without server round-trips, ensuring near-instant results and smooth performance even with millions of rows. It connects to multiple data sources such as CSV, JSON, Parquet, and local databases, automatically importing, cleaning, and structuring data through its “magic import” system, which detects formatting issues and prepares datasets for immediate use. Tictable integrates an agentic AI assistant that can explore data, generate filters, create formulas, and build reports from natural language prompts, executing queries in real time to transform raw data into actionable insights.

Starting Price: $30 per month

View Software

Mage Sensitive Data Discovery

Mage Data

Uncover hidden sensitive data locations within your enterprise through Mage's patented Sensitive Data Discovery module. Find data hidden in all types of data stores in the most obscure locations, be it structured, unstructured, Big Data, or on the Cloud. Leverage the power of Artificial Intelligence and Natural Language Processing to uncover data in the most complex of locations. Ensure efficient identification of sensitive data with minimal false positives with a patented approach to data discovery. Configure any additional data classifications over and above the 70+ out of the box data classifications covering all popular PII and PHI data. Schedule sample, full, or even incremental scans through a simplified discovery process.

View Software

Hadoop

Apache Software Foundation

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. A wide variety of companies and organizations use Hadoop for both research and production. Users are encouraged to add themselves to the Hadoop PoweredBy wiki page. Apache Hadoop 3.3.4 incorporates a number of significant enhancements over the previous major release line (hadoop-3.2).

View Software

Blotout

Activate customer journeys with complete visibility using infrastructure-as-code. Blotout’s SDK offers companies all of the analytics and remarketing tools they are accustomed to, while offering best-in-class privacy preservation for the company’s users. Blotout’s SDK is out of the box compliant with GDPR, CCPA & COPPA. Blotout’s SDK uses on-device, distributed edge computing for analytics, messaging and remarketing, all without using user personal data, device IDs or IP addresses. Measure, attribute, optimize, and activate customer data with 100% customer coverage. The only stack that gives you the complete customer lifecycle by unifying event, online, and offline data sources. Establish a trusted data relationship with your customers to build loyalty and maintain compliance with the GDPR and global privacy laws.

View Software

IBM Db2 Event Store

IBM

IBM Db2 Event Store is a cloud-native database system that is designed to handle massive amounts of structured data that is stored in Apache Parquet format. Because it is optimized for event-driven data processing and analysis, this high-speed data store can capture, analyze, and store more than 250 billion events per day. The data store is flexible and scalable to adapt quickly to your changing business needs. With the Db2 Event Store service, you can create these data stores in your Cloud Pak for Data cluster so that you can govern the data and use it for more in-depth analysis. You need to rapidly ingest large amounts of streaming data (up to one million inserts per second per node) and use it for real-time analytics with integrated machine learning capabilities. Analyze incoming data from different medical devices in real time to provide better health outcomes for patients while providing cost savings for moving the data to storage.

View Software

Meltano

Meltano provides the ultimate flexibility in deployment options. Own your data stack, end to end. Ever growing connector library of 300+ connectors have been running in production for years. Run workflows in isolated environments, execute end-to-end tests, and version control everything. Open source gives you the power to build your ideal data stack. Define your entire project as code and collaborate confidently with your team. The Meltano CLI enables you to rapidly create your project, making it easy to start replicating data. Meltano is designed to be the best way to run dbt to manage your transformations. Your entire data stack is defined in your project, making it simple to deploy it to production. Validate your changes in development before moving to CI, and in staging before moving to production.

View Software

Semarchy xDI

Semarchy

Experience Semarchy’s flexible unified data platform to empower better business decisions enterprise-wide. Integrate all your data with xDI, the high-performance, agile, and extensible data integration for all styles and use cases. Its single technology federates all forms of data integration, and mapping converts business rules into deployable code. xDI has extensible and open architecture supporting on-premise, cloud, hybrid, and multi-cloud environments.

View Software

Amazon SageMaker Data Wrangler

Amazon

Amazon SageMaker Data Wrangler reduces the time it takes to aggregate and prepare data for machine learning (ML) from weeks to minutes. With SageMaker Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow (including data selection, cleansing, exploration, visualization, and processing at scale) from a single visual interface. You can use SQL to select the data you want from a wide variety of data sources and import it quickly. Next, you can use the Data Quality and Insights report to automatically verify data quality and detect anomalies, such as duplicate rows and target leakage. SageMaker Data Wrangler contains over 300 built-in data transformations so you can quickly transform data without writing any code. Once you have completed your data preparation workflow, you can scale it to your full datasets using SageMaker data processing jobs; train, tune, and deploy models.

View Software

APERIO DataWise

APERIO

Data is used in every aspect of a processing plant or facility, it is underlying most operational processes, most business decisions, and most environmental events. Failures are often attributed to this same data, in terms of operator error, bad sensors, safety or environmental events, or poor analytics. This is where APERIO can alleviate these problems. Data integrity is a key element of Industry 4.0; the foundation upon which more advanced applications, such as predictive models, process optimization, and custom AI tools are developed. APERIO DataWise is the industry-leading provider of reliable, trusted data. Automate the quality of your PI data or digital twins continuously and at scale. Ensure validated data across the enterprise to improve asset reliability. Empower the operator to make better decisions. Detect threats made to operational data to ensure operational resilience. Accurately monitor & report sustainability metrics.

View Software

3LC

Light up the black box and pip install 3LC to gain the clarity you need to make meaningful changes to your models in moments. Remove the guesswork from your model training and iterate fast. Collect per-sample metrics and visualize them in your browser. Analyze your training and eliminate issues in your dataset. Model-guided, interactive data debugging and enhancements. Find important or inefficient samples. Understand what samples work and where your model struggles. Improve your model in different ways by weighting your data. Make sparse, non-destructive edits to individual samples or in a batch. Maintain a lineage of all changes and restore any previous revisions. Dive deeper than standard experiment trackers with per-sample per epoch metrics and data tracking. Aggregate metrics by sample features, rather than just epoch, to spot hidden trends. Tie each training run to a specific dataset revision for full reproducibility.

View Software

Arroyo

Scale from zero to millions of events per second. Arroyo ships as a single, compact binary. Run locally on MacOS or Linux for development, and deploy to production with Docker or Kubernetes. Arroyo is a new kind of stream processing engine, built from the ground up to make real-time easier than batch. Arroyo was designed from the start so that anyone with SQL experience can build reliable, efficient, and correct streaming pipelines. Data scientists and engineers can build end-to-end real-time applications, models, and dashboards, without a separate team of streaming experts. Transform, filter, aggregate, and join data streams by writing SQL, with sub-second results. Your streaming pipelines shouldn't page someone just because Kubernetes decided to reschedule your pods. Arroyo is built to run in modern, elastic cloud environments, from simple container runtimes like Fargate to large, distributed deployments on the Kubernetes logo Kubernetes.

View Software

e6data

Limited competition due to deep barriers to entry, specialized know-how, massive capital needs, and long time-to-market. Existing platforms are indistinguishable in price, and performance reducing the incentive to switch. Migrating from one engine’s SQL dialect to another engine’s SQL involves months of effort. Truly format-neutral computing, interoperable with all major open standards. Enterprise data leaders are hit by an unprecedented explosion in computing demand for data intelligence. They are surprised to find that 10% of their heavy, compute-intensive use cases consume 80% of the cost, engineering effort and stakeholder complaints. Unfortunately, such workloads are also mission-critical and non-discretionary. e6data amplifies ROI on enterprises' existing data platforms and architecture. e6data’s truly format-neutral compute has the unique distinction of being equally efficient and performant across leading data lakehouse table formats.

View Software

Gable

Data contracts facilitate communication between data teams and developers. Don’t just detect problematic changes, prevent them at the application level. Detect every change, from every data source using AI-based asset registration. Drive the adoption of data initiatives with upstream visibility and impact analysis. Shift left both data ownership and management through data governance as code and data contracts. Build data trust through the timely communication of data quality expectations and changes. Eliminate data issues at the source by seamlessly integrating our AI-driven technology. Everything you need to make your data initiative a success. Gable is a B2B data infrastructure SaaS that provides a collaboration platform to author and enforce data contracts. ‘Data contracts’, refer to API-based agreements between the software engineers who own upstream data sources and data engineers/analysts that consume data to build machine learning models and analytics.

View Software

Tenzir

Tenzir is a data pipeline engine specifically designed for security teams, facilitating the collection, transformation, enrichment, and routing of security data throughout its lifecycle. It enables users to seamlessly gather data from various sources, parse unstructured data into structured formats, and transform it as needed. It optimizes data volume, reduces costs, and supports mapping to standardized schemas like OCSF, ASIM, and ECS. Tenzir ensures compliance through data anonymization features and enriches data by adding context from threats, assets, and vulnerabilities. It supports real-time detection and stores data efficiently in Parquet format within object storage systems. Users can rapidly search and materialize necessary data and reactivate at-rest data back into motion. Tension is built for flexibility, allowing deployment as code and integration into existing workflows, ultimately aiming to reduce SIEM costs and provide full control.

View Software

SDF

SDF is a developer platform for data that enhances SQL comprehension across organizations, enabling data teams to unlock the full potential of their data. It provides a transformation layer to streamline query writing and management, an analytical database engine for local execution, and an accelerator for improved transformation processes. SDF also offers proactive quality and governance features, including reports, contracts, and impact analysis, to ensure data integrity and compliance. By representing business logic as code, SDF facilitates the classification and management of data types, enhancing the clarity and maintainability of data models. It integrates seamlessly with existing data workflows, supporting various SQL dialects and cloud environments, and is designed to scale with the growing needs of data teams. SDF's open-core architecture, built on Apache DataFusion, allows for customization and extension, fostering a collaborative ecosystem for data development.

View Software

Visplore

Visplore GmbH

Visplore is a visual analytics software solution for rapid industrial troubleshooting and root-cause analysis. When KPIs and simple trends are not enough and action is time-critical, it complements dashboards with guided forensic “why” analyses that deliver insights for problem-solving and process optimization. It works across the entire IT/OT landscape, from process and asset data to quality and material data, and is easy to use for all engineers. - Guided, transparent root-cause analysis with intuitive visuals — no black boxes, no complex modeling - Works with your data, where it lives - Seamless IT/OT connectivity - From troubleshooting to standardized best practice - Proven templates, excellent expert support, and workflows that scale into automated monitoring and reporting. Compared to other data analysis tools such as Seeq and TrendMiner, Visplore is built for everyday engineering use, making industrial data analysis accessible, repeatable, and ready for action.

View Software

SSIS Integration Toolkit

KingswaySoft

Jump right to our product page to see our full range of data integration software, including solutions for SharePoint and Active Directory. With over 300 individual data integration tools for connectivity and productivity, our data integration solutions allow developers to take advantage of the flexibility and power of the SSIS ETL engine to integrate virtually any application or data source. You don't have to write a single line of code to make data integration happen so your development can be done in a matter of minutes. We make the most flexible integration solution on the market. Our software offers intuitive user interfaces that are flexible and easy to use. With a streamlined development experience and an extremely simple licensing model, our solution offers the best value for your investment. Our software offers many specifically designed features that help you achieve the best possible performance without having to hijack your budget.

View Software

Data Sentinel

As a business leader, you need to trust your data and be 100% certain that it’s well-governed, compliant, and accurate. Including all data, in all sources, and in all locations, without limitations. Understand your data assets. Audit for risk, compliance, and quality in support of your project. Catalog a complete data inventory across all sources and data types, creating a shared understanding of your data assets. Run a one-time, fast, affordable, and accurate audit of your data. PCI, PII, and PHI audits are fast, accurate, and complete. As a service, with no software to purchase. Measure and audit data quality and data duplication across all of your enterprise data assets, cloud-native and on-premises. Comply with global data privacy regulations at scale. Discover, classify, track, trace and audit privacy compliance. Monitor PII/PCI/PHI data propagation and automate DSAR compliance processes.

View Software

Mage Platform

Mage Data

Mage Data™ is the leading solutions provider of data security and data privacy software for global enterprises. Built upon a patented and award-winning solution, the Mage platform enables organizations to stay on top of privacy regulations while ensuring security and privacy of data. Top Swiss Banks, Fortune 10 organizations, Ivy League Universities, and Industry Leaders in the financial and healthcare businesses protect their sensitive data with the Mage platform for Data Privacy and Security. Deploying state-of-the-art privacy enhancing technologies for securing data, Mage Data™ delivers robust data security while ensuring privacy of individuals. Visit the website to explore the company’s solutions.

View Software

Apache Parquet Integrations

The Apache Software Foundation

45 Integrations with Apache Parquet

Ficstar

QuerySurge

StarfishETL

Flyte

Indexima Data Hub

PI.EXCHANGE

Tonic Ephemeral

PuppyGraph

Timeplus

Timbr.ai

Amazon Data Firehose

MLJAR Studio

QStudio

Streamkap

Tad

Apache DataFusion

OpenObserve

Querri

Sliq

OrcaSheets

Warp 10

Gravity Data

Autymate

GribStream

CSViewer

Astera Dataprep

Tictable

Mage Sensitive Data Discovery

Hadoop

Blotout

IBM Db2 Event Store

Meltano

Semarchy xDI

Amazon SageMaker Data Wrangler

APERIO DataWise

3LC

Arroyo

e6data

Gable

Tenzir

SDF

Visplore

SSIS Integration Toolkit

Data Sentinel

Mage Platform

Related Categories

Related Categories That Integrate With Apache Parquet