Universal Sentence Encoder
The Universal Sentence Encoder (USE) encodes text into high-dimensional vectors that can be utilized for tasks such as text classification, semantic similarity, and clustering. It offers two model variants: one based on the Transformer architecture and another on Deep Averaging Network (DAN), allowing a balance between accuracy and computational efficiency. The Transformer-based model captures context-sensitive embeddings by processing the entire input sequence simultaneously, while the DAN-based model computes embeddings by averaging word embeddings, followed by a feedforward neural network. These embeddings facilitate efficient semantic similarity calculations and enhance performance on downstream tasks with minimal supervised training data. The USE is accessible via TensorFlow Hub, enabling seamless integration into various applications.
Learn more
fastText
fastText is an open source, free, and lightweight library developed by Facebook's AI Research (FAIR) lab for efficient learning of word representations and text classification. It supports both unsupervised learning of word vectors and supervised learning for text classification tasks. A key feature of fastText is its ability to capture subword information by representing words as bags of character n-grams, which enhances the handling of morphologically rich languages and out-of-vocabulary words. The library is optimized for performance and capable of training on large datasets quickly, and the resulting models can be reduced in size for deployment on mobile devices. Pre-trained word vectors are available for 157 languages, trained on Common Crawl and Wikipedia data, and can be downloaded for immediate use. fastText also offers aligned word vectors for 44 languages, facilitating cross-lingual natural language processing tasks.
Learn more
Gensim
Gensim is a free, open source Python library designed for unsupervised topic modeling and natural language processing, focusing on large-scale semantic modeling. It enables the training of models like Word2Vec, FastText, Latent Semantic Analysis (LSA), and Latent Dirichlet Allocation (LDA), facilitating the representation of documents as semantic vectors and the discovery of semantically related documents. Gensim is optimized for performance with highly efficient implementations in Python and Cython, allowing it to process arbitrarily large corpora using data streaming and incremental algorithms without loading the entire dataset into RAM. It is platform-independent, running on Linux, Windows, and macOS, and is licensed under the GNU LGPL, promoting both personal and commercial use. The library is widely adopted, with thousands of companies utilizing it daily, over 2,600 academic citations, and more than 1 million downloads per week.
Learn more
Gemini Embedding 2
Gemini Embedding models, including the newer Gemini Embedding 2, are part of Google’s Gemini AI ecosystem and are designed to convert text, phrases, sentences, and code into numerical vector representations that capture their semantic meaning. Unlike generative models that produce new content, the embedding model transforms input data into dense vectors that represent meaning in a mathematical format, allowing computers to compare and analyze information based on conceptual similarity rather than exact wording. These embeddings enable applications such as semantic search, recommendation systems, document retrieval, clustering, classification, and retrieval-augmented generation pipelines. The model can process input in more than 100 languages and supports up to 2048 tokens per request, allowing it to embed longer pieces of text or code while maintaining strong contextual understanding.
Learn more