Showing 19 open source projects for "vocabulary"

View related business solutions
  • Go From AI Idea to AI App Fast Icon
    Go From AI Idea to AI App Fast

    One platform to build, fine-tune, and deploy ML models. No MLOps team required.

    Access Gemini 3 and 200+ models. Build chatbots, agents, or custom models with built-in monitoring and scaling.
    Try Free
  • Train ML Models With SQL You Already Know Icon
    Train ML Models With SQL You Already Know

    BigQuery automates data prep, analysis, and predictions with built-in AI assistance.

    Build and deploy ML models using familiar SQL. Automate data prep with built-in Gemini. Query 1 TB and store 10 GB free monthly.
    Try Free
  • 1
    SAM 3

    SAM 3

    Code for running inference and finetuning with SAM 3 model

    SAM 3 (Segment Anything Model 3) is a unified foundation model for promptable segmentation in both images and videos, capable of detecting, segmenting, and tracking objects. It accepts both text prompts (open-vocabulary concepts like “red car” or “goalkeeper in white”) and visual prompts (points, boxes, masks) and returns high-quality masks, boxes, and scores for the requested concepts. Compared with SAM 2, SAM 3 introduces the ability to exhaustively segment all instances of an open-vocabulary concept specified by a short phrase or exemplars, scaling to a vastly larger set of categories than traditional closed-set models. ...
    Downloads: 35 This Week
    Last Update:
    See Project
  • 2
    ML Ferret

    ML Ferret

    Refer and Ground Anything Anywhere at Any Granularity

    Ferret is Apple’s end-to-end multimodal large language model designed specifically for flexible referring and grounding: it can understand references of any granularity (boxes, points, free-form regions) and then ground open-vocabulary descriptions back onto the image. The core idea is a hybrid region representation that mixes discrete coordinates with continuous visual features, so the model can fluidly handle “any-form” referring while maintaining precise spatial localization. The repo presents the vision-language pipeline, model assets, and paper resources that show how Ferret answers questions, follows instructions, and returns grounded outputs rather than just text. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 3
    Mistral Finetune

    Mistral Finetune

    Memory-efficient and performant finetuning of Mistral's models

    ...The repo includes utilities for data preprocessing (e.g. reformat_data.py), validation scripts, and example YAML configs for training variants like 7B base or instruct models. It supports function-calling style datasets (via "messages" keys) as well as plain text formats, with guidelines on formatting, tokenization, and vocabulary extension (e.g. extending vocab to 32768 for some models) before finetuning. The project also provides tutorial notebooks (e.g. mistral_finetune_7b.ipynb) to walk through the steps.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 4
    minbpe

    minbpe

    Minimal, clean code for the Byte Pair Encoding (BPE) algorithm

    ...It operates on UTF-8 encoded bytes rather than Unicode characters, which makes it robust to arbitrary text inputs and avoids needing a language-specific character vocabulary. The repository is structured as a teaching-oriented implementation that shows how to train a tokenizer by learning merge rules, then apply those merges to encode text into token IDs and decode tokens back into text. It is intentionally small and readable so developers can understand each stage of BPE, including the mechanics of pair counting, merge application, and vocabulary growth. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • Full-stack observability with actually useful AI | Grafana Cloud Icon
    Full-stack observability with actually useful AI | Grafana Cloud

    Our generous forever free tier includes the full platform, including the AI Assistant, for 3 users with 10k metrics, 50GB logs, and 50GB traces.

    Built on open standards like Prometheus and OpenTelemetry, Grafana Cloud includes Kubernetes Monitoring, Application Observability, Incident Response, plus the AI-powered Grafana Assistant. Get started with our generous free tier today.
    Create free account
  • 5
    WeClone

    WeClone

    One-stop solution for creating your digital avatar from chat history

    ...It is intended primarily as an experimental exploration of digital personality modeling and conversational AI personalization. By processing large volumes of conversation data, WeClone can build a profile of an individual’s writing tone, vocabulary preferences, and conversational tendencies. Developers can use the resulting model to create chatbots that simulate a specific user’s communication patterns for testing or research purposes. Overall, WeClone explores the idea of digital identity replication through machine learning and conversational modeling.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 6
    seq2seq-couplet

    seq2seq-couplet

    Play couplet with seq2seq model

    ...In addition to local execution, the project includes Docker files, which make it easier to package and deploy the application in a more reproducible way. The repository also points users to an external dataset source and documents vocabulary formatting requirements for custom datasets, showing that it is meant for both experimentation and extension.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 7
    gTTS

    gTTS

    Python library and CLI tool to interface with Google Translate

    ...The library is designed to handle long texts, using a speech-specific sentence tokenizer that keeps intonation and punctuation natural while splitting requests into acceptable chunks. It supports customizable text pre-processors, which can correct pronunciations, tweak formatting, or handle domain-specific vocabulary before sending it to the API. gTTS is primarily aimed at developers who want a quick way to add cloud-backed speech to scripts, apps, or pipelines without managing any model weights locally. A small CLI utility, gtts-cli, makes it easy to test or batch-generate MP3 files right from the shell.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 8
    Chinese-LLaMA-Alpaca 2

    Chinese-LLaMA-Alpaca 2

    Chinese LLaMA-2 & Alpaca-2 Large Model Phase II Project

    ...It is the second phase of the Chinese LLaMA&Alpaca large model project. The Chinese LLaMA-2 base model and the Alpaca-2 instruction fine-tuning large model are open-sourced. These models expand and optimize the Chinese vocabulary on the basis of the original Llama-2, use large-scale Chinese data for incremental pre-training, and further improve the basic semantics and command understanding of Chinese. Performance improvements. The related model supports FlashAttention-2 training, supports 4K context and can be extended up to 18K+ through the NTK method.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 9
    Detic

    Detic

    Code release for "Detecting Twenty-thousand Classes

    Detic (“Detecting Twenty-thousand Classes using Image-level Supervision”) is a large-vocabulary object detector that scales beyond fully annotated datasets by leveraging image-level labels. It decouples localization from classification, training a strong box localizer on standard detection data while learning classifiers from weak supervision and large image-tag corpora. A shared region proposal backbone feeds a flexible classification head that can expand to tens of thousands of categories without exhaustive box annotations. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • Stop Storing Third-Party Tokens in Your Database Icon
    Stop Storing Third-Party Tokens in Your Database

    Auth0 Token Vault handles secure token storage, exchange, and refresh for external providers so you don't have to build it yourself.

    Rolling your own OAuth token storage can be a security liability. Token Vault securely stores access and refresh tokens from federated providers and handles exchange and renewal automatically. Connected accounts, refresh exchange, and privileged worker flows included.
    Try Auth0 for Free
  • 10
    Chinese-LLaMA-Alpaca-2 v2.0

    Chinese-LLaMA-Alpaca-2 v2.0

    Chinese LLaMA & Alpaca large language model + local CPU/GPU training

    This project has open-sourced the Chinese LLaMA model and the Alpaca large model with instruction fine-tuning to further promote the open research of large models in the Chinese NLP community. Based on the original LLaMA , these models expand the Chinese vocabulary and use Chinese data for secondary pre-training, which further improves the basic semantic understanding of Chinese. At the same time, the Chinese Alpaca model further uses Chinese instruction data for fine-tuning, which significantly improves the model's ability to understand and execute instructions.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 11
    CPT

    CPT

    CPT: A Pre-Trained Unbalanced Transformer

    A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation. We replace the old BERT vocabulary with a larger one of size 51271 built from the training data, in which we 1) add missing 6800+ Chinese characters (most of them are traditional Chinese characters); 2) remove redundant tokens (e.g. Chinese character tokens with ## prefix); 3) add some English tokens to reduce OOV. Position Embeddings We extend the max_position_embeddings from 512 to 1024. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 12
    XLM (Cross-lingual Language Model)

    XLM (Cross-lingual Language Model)

    PyTorch original implementation of Cross-lingual Language Model

    ...It popularized objectives like Masked Language Modeling (MLM) across many languages and Translation Language Modeling (TLM) that jointly trains on parallel sentence pairs to tighten cross-lingual alignment. Using a shared subword vocabulary, XLM learns language-agnostic features that work well for classification and sequence labeling tasks such as XNLI, NER, and POS without target-language supervision. The repository provides preprocessing pipelines, training code, and fine-tuning scripts so you can reproduce benchmark results or adapt models to your own multilingual corpora. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 13
    Synonyms

    Synonyms

    Chinese synonyms, chat robot, intelligent question and answer toolkit

    Chinese Synonyms for natural language processing and understanding. Better Chinese synonyms, chatbot, intelligent question and answer toolkit. synonymsCan be used for many tasks in natural language understanding, text alignment, recommendation algorithms, similarity calculation, semantic shifting, keyword extraction, concept extraction, automatic summarization, search engines, etc. Print synonyms in a friendly way for easy debugging. "Synonyms Cilin" was compiled by Mei Jiaju and others in...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 14
    fastNLP

    fastNLP

    fastNLP: A Modularized and Extensible NLP Framework

    fastNLP is a lightweight framework for natural language processing (NLP), the goal is to quickly implement NLP tasks and build complex models. A unified Tabular data container simplifies the data preprocessing process. Built-in Loader and Pipe for multiple datasets, eliminating the need for preprocessing code. Various convenient NLP tools, such as Embedding loading (including ELMo and BERT), intermediate data cache, etc.. Provide a variety of neural network components and recurrence models...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    Magnitude

    Magnitude

    A fast, efficient universal vector embedding utility package

    ...It is primarily intended to be a simpler / faster alternative to Gensim but can be used as a generic key-vector store for domains outside NLP. It offers unique features like out-of-vocabulary lookups and streaming of large models over HTTP. Published in our paper at EMNLP 2018 and available on arXiv.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16
    PyTorch Natural Language Processing

    PyTorch Natural Language Processing

    Basic Utilities for PyTorch Natural Language Processing (NLP)

    ...Now you've setup your pipeline, you may want to ensure that some functions run deterministically. Wrap any code that's random, with fork_rng and you'll be good to go. Now that you've computed your vocabulary, you may want to make use of pre-trained word vectors to set your embeddings.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17

    PyVocabularyTree

    A vocabulary tree for image classification using OpenCV

    A vocabulary tree for image classification have been designed to be integrated in mobile robotic applications. It is a learning schema based on decission trees, bags of features and inverted files. The design provides training and optimization parameters that have been characterized using several detectors and descriptors for several input datasets.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18
    A python implementation of Fresnel, a display vocabulary for the Resource Description Framework (RDF). See http://www.w3.org/2005/04/fresnel-info/.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19
    Top Engine the Semantic Web Engine for the Enterprise. Top Engine is a Business Rule Engine that utilize OWL DL ontologies for vocabulary primitive to write rules on top of ontology. Top Engine support forward and backward chaining with truth maintenance
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • Next
MongoDB Logo MongoDB