ML.NET
ML.NET is a free, open source, and cross-platform machine learning framework designed for .NET developers to build custom machine learning models using C# or F# without leaving the .NET ecosystem. It supports various machine learning tasks, including classification, regression, clustering, anomaly detection, and recommendation systems. ML.NET integrates with other popular ML frameworks like TensorFlow and ONNX, enabling additional scenarios such as image classification and object detection. It offers tools like Model Builder and the ML.NET CLI, which utilize Automated Machine Learning (AutoML) to simplify the process of building, training, and deploying high-quality models. These tools automatically explore different algorithms and settings to find the best-performing model for a given scenario.
Learn more
MLlib
Apache Spark's MLlib is a scalable machine learning library that integrates seamlessly with Spark's APIs, supporting Java, Scala, Python, and R. It offers a comprehensive suite of algorithms and utilities, including classification, regression, clustering, collaborative filtering, and tools for constructing machine learning pipelines. MLlib's high-quality algorithms leverage Spark's iterative computation capabilities, delivering performance up to 100 times faster than traditional MapReduce implementations. It is designed to operate across diverse environments, running on Hadoop, Apache Mesos, Kubernetes, standalone clusters, or in the cloud, and accessing various data sources such as HDFS, HBase, and local files. This flexibility makes MLlib a robust solution for scalable and efficient machine learning tasks within the Apache Spark ecosystem.
Learn more
Dask
Dask is open source and freely available. It is developed in coordination with other community projects like NumPy, pandas, and scikit-learn. Dask uses existing Python APIs and data structures to make it easy to switch between NumPy, pandas, scikit-learn to their Dask-powered equivalents. Dask's schedulers scale to thousand-node clusters and its algorithms have been tested on some of the largest supercomputers in the world. But you don't need a massive cluster to get started. Dask ships with schedulers designed for use on personal machines. Many people use Dask today to scale computations on their laptop, using multiple cores for computation and their disk for excess storage. Dask exposes lower-level APIs letting you build custom systems for in-house applications. This helps open source leaders parallelize their own packages and helps business leaders scale custom business logic.
Learn more
Gensim
Gensim is a free, open source Python library designed for unsupervised topic modeling and natural language processing, focusing on large-scale semantic modeling. It enables the training of models like Word2Vec, FastText, Latent Semantic Analysis (LSA), and Latent Dirichlet Allocation (LDA), facilitating the representation of documents as semantic vectors and the discovery of semantically related documents. Gensim is optimized for performance with highly efficient implementations in Python and Cython, allowing it to process arbitrarily large corpora using data streaming and incremental algorithms without loading the entire dataset into RAM. It is platform-independent, running on Linux, Windows, and macOS, and is licensed under the GNU LGPL, promoting both personal and commercial use. The library is widely adopted, with thousands of companies utilizing it daily, over 2,600 academic citations, and more than 1 million downloads per week.
Learn more