Showing 100 open source projects for "text t"

View related business solutions
  • MongoDB Atlas runs apps anywhere Icon
    MongoDB Atlas runs apps anywhere

    Deploy in 115+ regions with the modern database for every enterprise.

    MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
    Start Free
  • Earn up to 16% annual interest with Nexo. Icon
    Earn up to 16% annual interest with Nexo.

    Let your crypto work for you

    Put idle assets to work with competitive interest rates, borrow without selling, and trade with precision. All in one platform. Geographic restrictions, eligibility, and terms apply.
    Get started with Nexo.
  • 1
    PaddleOCR

    PaddleOCR

    Awesome multilingual OCR toolkits based on PaddlePaddle

    PaddleOCR offers exceptional, multilingual, and practical Optical Character Recognition (OCR) tools that can help users train better models and apply them into practice. Inspired by PaddlePaddle, PaddleOCR is an ultra lightweight OCR system, with multilingual recognition, digit recognition, vertical text recognition, as well as long text recognition. It features a PPOCR series of high-quality pre-trained models, which includes: ultra lightweight ppocr_mobile series models, general ppocr_server series models, and ultra lightweight compression ppocr_mobile_slim series models. PaddleOCR is easy to install and easy to use on Windows, Linux, MacOS and other systems.
    Downloads: 71 This Week
    Last Update:
    See Project
  • 2
    Wan2.2

    Wan2.2

    Wan2.2: Open and Advanced Large-Scale Video Generative Model

    ...The model is trained on significantly larger datasets than its predecessor, greatly enhancing motion complexity, semantic understanding, and aesthetic diversity. Wan2.2 also open-sources a 5-billion parameter high-compression VAE-based hybrid text-image-to-video (TI2V) model that supports 720P video generation at 24fps on consumer-grade GPUs like the RTX 4090. It supports multiple video generation tasks including text-to-video.
    Downloads: 169 This Week
    Last Update:
    See Project
  • 3
    Wan2.1

    Wan2.1

    Wan2.1: Open and Advanced Large-Scale Video Generative Model

    Wan2.1 is a foundational open-source large-scale video generative model developed by the Wan team, providing high-quality video generation from text and images. It employs advanced diffusion-based architectures to produce coherent, temporally consistent videos with realistic motion and visual fidelity. Wan2.1 focuses on efficient video synthesis while maintaining rich semantic and aesthetic detail, enabling applications in content creation, entertainment, and research. The model supports text-to-video and image-to-video generation tasks with flexible resolution options suitable for various GPU hardware configurations. ...
    Downloads: 97 This Week
    Last Update:
    See Project
  • 4
    SAM 3

    SAM 3

    Code for running inference and finetuning with SAM 3 model

    SAM 3 (Segment Anything Model 3) is a unified foundation model for promptable segmentation in both images and videos, capable of detecting, segmenting, and tracking objects. It accepts both text prompts (open-vocabulary concepts like “red car” or “goalkeeper in white”) and visual prompts (points, boxes, masks) and returns high-quality masks, boxes, and scores for the requested concepts. Compared with SAM 2, SAM 3 introduces the ability to exhaustively segment all instances of an open-vocabulary concept specified by a short phrase or exemplars, scaling to a vastly larger set of categories than traditional closed-set models. ...
    Downloads: 64 This Week
    Last Update:
    See Project
  • Full-stack observability with actually useful AI | Grafana Cloud Icon
    Full-stack observability with actually useful AI | Grafana Cloud

    Our generous forever free tier includes the full platform, including the AI Assistant, for 3 users with 10k metrics, 50GB logs, and 50GB traces.

    Built on open standards like Prometheus and OpenTelemetry, Grafana Cloud includes Kubernetes Monitoring, Application Observability, Incident Response, plus the AI-powered Grafana Assistant. Get started with our generous free tier today.
    Create free account
  • 5
    Qwen3-TTS

    Qwen3-TTS

    Qwen3-TTS is an open-source series of TTS models

    Qwen3-TTS is an open-source text-to-speech (TTS) project built around the Qwen3 large language model family, focused on generating high-quality, natural-sounding speech from plain text input. It provides researchers and developers with tools to transform text into expressive, intelligible audio, supporting multiple languages and voice characteristics tuned for clarity and fluidity.
    Downloads: 21 This Week
    Last Update:
    See Project
  • 6
    FLUX.2

    FLUX.2

    Official inference repo for FLUX.2 models

    ...FLUX.2 is built with a modern architecture (a flow-matching transformer + a revamped VAE + a strong vision-language encoder), enabling strong prompt adherence, correct rendering of text/typography in images, reliable lighting, layout, and physical realism, and consistent style/character/product identity across multiple generations or edits.
    Downloads: 53 This Week
    Last Update:
    See Project
  • 7
    HeartMuLa

    HeartMuLa

    A Family of Open Sourced Music Foundation Models

    ...The project also includes HeartCodec, a music codec optimized for high reconstruction fidelity, enabling efficient tokenization and reconstruction workflows that are critical for training and generation pipelines. For text extraction from audio, it provides HeartTranscriptor, a Whisper-based model tuned specifically for lyrics transcription, which helps bridge generated or recorded audio back into structured text. It also introduces HeartCLAP, which aligns audio and text into a shared embedding space.
    Downloads: 16 This Week
    Last Update:
    See Project
  • 8
    HunyuanImage-3.0

    HunyuanImage-3.0

    A Powerful Native Multimodal Model for Image Generation

    HunyuanImage-3.0 is a powerful, native multimodal text-to-image generation model released by Tencent’s Hunyuan team. It unifies multimodal understanding and generation in a single autoregressive framework, combining text and image modalities seamlessly rather than relying on separate image-only diffusion components. It uses a Mixture-of-Experts (MoE) architecture with many expert subnetworks to scale efficiently, deploying only a subset of experts per token, which allows large parameter counts without linear inference cost explosion. ...
    Downloads: 7 This Week
    Last Update:
    See Project
  • 9
    CogVideo

    CogVideo

    Text and image to video generation: CogVideoX and CogVideo

    CogVideo is an open-source family of advanced video generation models that can create videos from text, images, or existing video inputs. Built on large-scale Transformer and diffusion architectures, it enables multimodal generation across text-to-video, image-to-video, and video continuation tasks. The latest CogVideoX models offer higher resolution outputs, longer video durations, and improved controllability through prompt engineering.
    Downloads: 20 This Week
    Last Update:
    See Project
  • Add Two Lines of Code. Get Full APM. Icon
    Add Two Lines of Code. Get Full APM.

    AppSignal installs in minutes and auto-configures dashboards, alerts, and error tracking.

    Works out of the box for Rails, Django, Express, Phoenix, and more. Monitoring exceptions and performance in no time.
    Start Free
  • 10
    DeepSeek VL2

    DeepSeek VL2

    Mixture-of-Experts Vision-Language Models for Advanced Multimodal

    DeepSeek-VL2 is DeepSeek’s vision + language multimodal model—essentially the next-gen successor to their first vision-language models. It combines image and text inputs into a unified embedding / reasoning space so that you can query with text and image jointly (e.g. “What’s going on in this scene?” or “Generate a caption appropriate to context”). The model supports both image understanding (vision tasks) and multimodal reasoning, and is likely used as a component in agent systems to process visual inputs as context for downstream tasks. ...
    Downloads: 9 This Week
    Last Update:
    See Project
  • 11
    FLUX.1

    FLUX.1

    Official inference repo for FLUX.1 models

    FLUX.1 repository contains inference code and tooling for the FLUX.1 text-to-image diffusion models, enabling developers and researchers to generate and edit images from natural-language prompts using open-weight versions of the model on their own hardware or within custom applications. The project is part of a larger family of FLUX models developed by Black Forest Labs, designed to produce high-quality, detailed visuals from text descriptions with competitive prompt adherence and artistic fidelity. ...
    Downloads: 30 This Week
    Last Update:
    See Project
  • 12
    GLM-OCR

    GLM-OCR

    Accurate × Fast × Comprehensive

    GLM-OCR is an open-source multimodal optical character recognition (OCR) model built on a GLM-V encoder–decoder foundation that brings robust, accurate document understanding to complex real-world layouts and modalities. Designed to handle text recognition, table parsing, formula extraction, and general information retrieval from documents containing mixed content, GLM-OCR excels across major benchmarks while remaining highly efficient with a relatively compact parameter size (~0.9B), enabling deployment in high-concurrency services and edge environments. The model’s multimodal capabilities allow it to reason across image and text content holistically, capturing structured and unstructured information from pages that include dense tables, seals, code snippets, and varied document graphics. ...
    Downloads: 23 This Week
    Last Update:
    See Project
  • 13
    DeepSeek-OCR

    DeepSeek-OCR

    Contexts Optical Compression

    DeepSeek-OCR is an open-source optical character recognition solution built as part of the broader DeepSeek AI vision-language ecosystem. It is designed to extract text from images, PDFs, and scanned documents, and integrates with multimodal capabilities that understand layout, context, and visual elements beyond raw character recognition. The system treats OCR not simply as “read the text” but as “understand what the text is doing in the image”—for example distinguishing captions from body text, interpreting tables, or recognizing handwritten versus printed words. ...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 14
    Kitten TTS

    Kitten TTS

    State-of-the-art TTS model under 25MB

    KittenTTS is an open-source, ultra-lightweight, and high-quality text-to-speech model featuring just 15 million parameters and a binary size under 25 MB. It is designed for real-time CPU-based deployment across diverse platforms. Ultra-lightweight, model size less than 25MB. CPU-optimized, runs without GPU on any device. High-quality voices, several premium voice options available. Fast inference, optimized for real-time speech synthesis.
    Downloads: 17 This Week
    Last Update:
    See Project
  • 15
    GLM-TTS

    GLM-TTS

    Controllable & emotion-expressive zero-shot TTS

    GLM-TTS is an advanced text-to-speech synthesis system built on large language model technologies that focuses on producing high-quality, expressive, and controllable spoken output, including features like emotion modulation and zero-shot voice cloning. It uses a two-stage architecture where a generative LLM first converts text into intermediate speech token sequences and then a Flow-based neural model converts those tokens into natural audio waveforms, enabling rich prosody and voice character even for unseen speakers. ...
    Downloads: 4 This Week
    Last Update:
    See Project
  • 16
    IndexTTS2

    IndexTTS2

    Industrial-level controllable zero-shot text-to-speech system

    IndexTTS is a modern, zero-shot text-to-speech (TTS) system engineered to deliver high-quality, natural-sounding speech synthesis with few requirements and strong voice-cloning capabilities. It builds on state-of-the-art models such as XTTS and other modern neural TTS backbones, improving them with a conformer-based speech conditional encoder and upgrading the decoder to a high-quality vocoder (BigVGAN2), leading to clearer and more natural audio output.
    Downloads: 8 This Week
    Last Update:
    See Project
  • 17
    Qwen-Image

    Qwen-Image

    Qwen-Image is a powerful image generation foundation model

    Qwen-Image is a powerful 20-billion parameter foundation model designed for advanced image generation and precise editing, with a particular strength in complex text rendering across diverse languages, especially Chinese. Built on the MMDiT architecture, it achieves remarkable fidelity in integrating text seamlessly into images while preserving typographic details and layout coherence. The model excels not only in text rendering but also in a wide range of artistic styles, including photorealistic, impressionist, anime, and minimalist aesthetics. ...
    Downloads: 7 This Week
    Last Update:
    See Project
  • 18
    MedGemma

    MedGemma

    Collection of Gemma 3 variants that are trained for performance

    MedGemma is a collection of specialized open-source AI models created by Google as part of its Health AI Developer Foundations initiative, built on the Gemma 3 family of transformer models and trained for medical text and image comprehension tasks that help accelerate the development of healthcare-focused AI applications. It includes multiple variants such as a 4 billion-parameter multimodal model that can process both medical images and text and a 27 billion-parameter text-only (and multimodal) model that offers deeper clinical reasoning and understanding at higher capacity, making it suitable for complex tasks like medical question answering, summarization of clinical notes, or generating reports from radiology images. ...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 19
    HunyuanWorld 1.0

    HunyuanWorld 1.0

    Generating Immersive, Explorable, and Interactive 3D Worlds

    ...The architecture integrates panoramic proxy generation, semantic layering, and hierarchical 3D reconstruction to produce high-quality scene-scale 3D worlds from both text and images. HunyuanWorld-1.0 surpasses existing open-source methods in visual quality and geometric consistency, demonstrated by superior scores in BRISQUE, NIQE, Q-Align, and CLIP metrics.
    Downloads: 7 This Week
    Last Update:
    See Project
  • 20
    Step-Video-T2V

    Step-Video-T2V

    State-of-the-art (SoTA) text-to-video pre-trained model

    Step-Video-T2V is a state-of-the-art text-to-video foundation model developed to generate videos from natural-language prompts; its 30B-parameter architecture is designed to produce coherent, temporally extended video sequences — up to around 204 frames — based on input text. Under the hood it uses a compressed latent representation (a Video-VAE) to reduce spatial and temporal redundancy, and a denoising diffusion (or similar) process over that latent space to generate smooth, plausible motion and visuals. ...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 21
    CLIP

    CLIP

    CLIP, Predict the most relevant text snippet given an image

    CLIP (Contrastive Language-Image Pretraining) is a neural model that links images and text in a shared embedding space, allowing zero-shot image classification, similarity search, and multimodal alignment. It was trained on large sets of (image, caption) pairs using a contrastive objective: images and their matching text are pulled together in embedding space, while mismatches are pushed apart. Once trained, you can give it any text labels and ask it to pick which label best matches a given image—even without explicit training for that classification task. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 22
    Tiktoken

    Tiktoken

    tiktoken is a fast BPE tokeniser for use with OpenAI's models

    tiktoken is a high-performance, tokenizer library (based on byte-pair encoding, BPE) designed for use with OpenAI’s models. It handles encoding and decoding text to token IDs efficiently, with minimal overhead. Because tokenization is a fundamental step in preparing text for models, tiktoken is optimized for speed, memory, and correctness in model contexts (e.g. matching OpenAI’s internal tokenization). The repo supports multiple encodings (e.g. “cl100k_base”) and lets users switch encoding names to match different model contexts. ...
    Downloads: 7 This Week
    Last Update:
    See Project
  • 23
    Qwen3-Omni

    Qwen3-Omni

    Qwen3-omni is a natively end-to-end, omni-modal LLM

    Qwen3-Omni is a natively end-to-end multilingual omni-modal foundation model that processes text, images, audio, and video and delivers real-time streaming responses in text and natural speech. It uses a Thinker-Talker architecture with a Mixture-of-Experts (MoE) design, early text-first pretraining, and mixed multimodal training to support strong performance across all modalities without sacrificing text or image quality. The model supports 119 text languages, 19 speech input languages, and 10 speech output languages. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 24
    Qwen2.5-Omni

    Qwen2.5-Omni

    Capable of understanding text, audio, vision, video

    Qwen2.5-Omni is an end-to-end multimodal flagship model in the Qwen series by Alibaba Cloud, designed to process multiple modalities (text, images, audio, video) and generate responses both as text and natural speech in streaming real-time. It supports “Thinker-Talker” architecture, and introduces innovations for aligning modalities over time (for example synchronizing video/audio), robust speech generation, and low-VRAM/quantized versions to make usage more accessible. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 25
    Qwen-2.5-VL

    Qwen-2.5-VL

    Qwen2.5-VL is the multimodal large language model series

    ...The models are available in various sizes, including 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B parameters, catering to diverse computational requirements. Trained on a comprehensive dataset of up to 18 trillion tokens, Qwen2.5 models exhibit significant improvements in instruction following, long-text generation (exceeding 8,000 tokens), and structured data comprehension, such as tables and JSON formats. They support context lengths up to 128,000 tokens and offer multilingual capabilities in over 29 languages, including Chinese, English, French, Spanish, and more. The models are open-source under the Apache 2.0 license, with resources and documentation available on platforms like Hugging Face and ModelScope.
    Downloads: 16 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • 3
  • 4
  • Next
MongoDB Logo MongoDB