Showing 80 open source projects for "encoder"

View related business solutions
  • $300 in Free Credit Towards Top Cloud Services Icon
    $300 in Free Credit Towards Top Cloud Services

    Build VMs, containers, AI, databases, storage—all in one place.

    Start your project in minutes. After credits run out, 20+ products include free monthly usage. Only pay when you're ready to scale.
    Get Started
  • Full-stack observability with actually useful AI | Grafana Cloud Icon
    Full-stack observability with actually useful AI | Grafana Cloud

    Our generous forever free tier includes the full platform, including the AI Assistant, for 3 users with 10k metrics, 50GB logs, and 50GB traces.

    Built on open standards like Prometheus and OpenTelemetry, Grafana Cloud includes Kubernetes Monitoring, Application Observability, Incident Response, plus the AI-powered Grafana Assistant. Get started with our generous free tier today.
    Create free account
  • 1
    VGGT

    VGGT

    [CVPR 2025 Best Paper Award] VGGT

    VGGT is a transformer-based framework aimed at unifying classic visual geometry tasks—such as depth estimation, camera pose recovery, point tracking, and correspondence—under a single model. Rather than training separate networks per task, it shares an encoder and leverages geometric heads/decoders to infer structure and motion from images or short clips. The design emphasizes consistent geometric reasoning: outputs from one head (e.g., correspondences or tracks) reinforce others (e.g., pose or depth), making the system more robust to challenging viewpoints and textures. The repo provides inference pipelines to estimate geometry from monocular inputs, stereo pairs, or brief sequences, together with evaluation harnesses for common geometry benchmarks. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 2
    LatentSync

    LatentSync

    Taming Stable Diffusion for Lip Sync

    ...In effect, given a source video (with masked or reference frames) and an audio track, LatentSync directly generates frames whose lip motions and expressions align with the audio, producing convincing talking-head or animated lip-sync output. The system leverages a U-Net diffusion backbone, with cross-attention of audio embeddings (via an audio encoder) and reference video frames to guide generation, and applies a set of loss functions (temporal, perceptual, sync-net based) to enforce lip-sync accuracy, visual fidelity, and temporal consistency. Over versions, LatentSync has improved temporal stability and lowered resource requirements — making inference more practical (e.g. 8 GB VRAM for earlier versions, somewhat higher for latest models).
    Downloads: 2 This Week
    Last Update:
    See Project
  • 3
    Multimodal

    Multimodal

    TorchMultimodal is a PyTorch library

    ...It includes a collection of ready model classes—like ALBEF, CLIP, BLIP-2, COCA, FLAVA, MDETR, and Omnivore—that serve as reference implementations you can adopt or adapt. The design emphasizes composability: you can mix and match encoder, fusion, and decoder components rather than starting from monolithic models. The repository also includes example scripts and datasets for common multimodal tasks (e.g. retrieval, visual question answering, grounding) so you can test and compare models end to end. Installation supports both CPU and CUDA, and the codebase is versioned, tested, and maintained.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 4
    NVIDIA Isaac GR00T

    NVIDIA Isaac GR00T

    NVIDIA Isaac GR00T N1.5 is the world's first open foundation model

    ...The vision-language model remains frozen during both pretraining and finetuning, preserving language understanding and improving generalization. Streamlined MLP connection between vision encoder and LLM with added layer normalization.
    Downloads: 0 This Week
    Last Update:
    See Project
  • AI-generated apps that pass security review Icon
    AI-generated apps that pass security review

    Stop waiting on engineering. Build production-ready internal tools with AI—on your company data, in your cloud.

    Retool lets you generate dashboards, admin panels, and workflows directly on your data. Type something like “Build me a revenue dashboard on my Stripe data” and get a working app with security, permissions, and compliance built in from day one. Whether on our cloud or self-hosted, create the internal software your team needs without compromising enterprise standards or control.
    Try Retool free
  • 5
    Step-Audio 2

    Step-Audio 2

    Multi-modal large language model designed for audio understanding

    Step-Audio2 is an advanced, end-to-end multimodal large language model designed for high-fidelity audio understanding and natural speech conversation: unlike many pipelines that separate speech recognition, processing, and synthesis, Step-Audio2 processes raw audio, reasons about semantic and paralinguistic content (like emotion, speaker characteristics, non-verbal cues), and can generate contextually appropriate responses — including potentially generating or transforming audio output. It integrates a latent-space audio encoder, discrete acoustic tokens, and reinforcement-learning–based training (CoT + RL) to enhance its ability to capture and reproduce voice styles, intonations, and subtle vocal cues. Moreover, Step-Audio2 supports tool-calling and retrieval-augmented generation (RAG), allowing it to access external knowledge sources or audio/text databases, thus reducing hallucinations and improving coherence in complex dialogues.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 6
    HunyuanOCR

    HunyuanOCR

    OCR expert VLM powered by Hunyuan's native multimodal architecture

    HunyuanOCR is an open-source, end-to-end OCR (optical character recognition) Vision-Language Model (VLM) developed by Tencent‑Hunyuan. It’s designed to unify the entire OCR pipeline, detection, recognition, layout parsing, information extraction, translation, and even subtitle or structured output generation, into a single model inference instead of a cascade of separate tools. Despite being fairly lightweight (about 1 billion parameters), it delivers state-of-the-art performance across a...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 7
    PI-Based Image Encoder / Converter

    PI-Based Image Encoder / Converter

    Python code able to convert / compress image to PI (3.14, π) Indexes

    Image processing tool that encodes pixel data as indices within the first 16.7 million digits of PI (π). Features high-performance Numba-accelerated search and a signature 'film-grain' aesthetic upon reconstruction. ZIP also include 16 MB file with 16,7 mil numbers of PI Benchmark(Single-Thread): Hardware & Environment Apple Silicon: Apple M2 (Mac mini/MacBook) x86_64 Platform: Intel Core Ultra 5 225F (Arrow Lake, 10 Cores) OS 1: Fedora 43 (GNOME) OS 2: Windows 11 Pro...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 8
    StyleTTS 2

    StyleTTS 2

    Towards Human-Level Text-to-Speech through Style Diffusion

    ...The repository includes training scripts, configuration files, and pre-trained auxiliary modules such as a text aligner, pitch extractor, and PL-BERT-based linguistic encoder.
    Downloads: 5 This Week
    Last Update:
    See Project
  • 9
    Bert-VITS2

    Bert-VITS2

    VITS2 backbone with multilingual-bert

    Bert-VITS2 is a neural text-to-speech project that combines a VITS2 backbone with a multilingual BERT front-end to produce high-quality speech in multiple languages. The core idea is to use BERT-style contextual embeddings for text encoding while relying on a refined VITS2 architecture for acoustic generation and vocoding. The repository includes everything needed to train, fine-tune, and run the model, from configuration files to preprocessing scripts, spectrogram utilities, and training...
    Downloads: 6 This Week
    Last Update:
    See Project
  • Gemini 3 and 200+ AI Models on One Platform Icon
    Gemini 3 and 200+ AI Models on One Platform

    Access Google's best plus Claude, Llama, and Gemma. Fine-tune and deploy from one console.

    Build generative AI apps with Vertex AI. Switch between models without switching platforms.
    Start Free
  • 10
    LAME (Lame Aint an MP3 Encoder)

    LAME (Lame Aint an MP3 Encoder)

    A high quality MP3 encoder

    LAME is an educational tool to be used for learning about MP3 encoding. The goal of the LAME project is to improve the psycho acoustics, quality and speed of MP3 encoding. Note: we provide source code only!
    Leader badge
    Downloads: 21,105 This Week
    Last Update:
    See Project
  • 11
    Coqui TTS

    Coqui TTS

    A deep learning toolkit for Text-to-Speech, battle-tested in research

    ...TTS comes with pre-trained models, tools for measuring dataset quality and is already used in 20+ languages for products and research projects. High-performance Deep Learning models for Text2Speech tasks. Text2Spec models (Tacotron, Tacotron2, Glow-TTS, SpeedySpeech). Speaker Encoder to compute speaker embeddings efficiently. Vocoder models (MelGAN, Multiband-MelGAN, GAN-TTS, ParallelWaveGAN, WaveGrad, WaveRNN) Fast and efficient model training. Detailed training logs on the terminal and Tensorboard. Support for Multi-speaker TTS. Efficient, flexible, and lightweight but feature complete Trainer API. Released and ready-to-use models. ...
    Downloads: 38 This Week
    Last Update:
    See Project
  • 12
    Shap-E

    Shap-E

    Generate 3D objects conditioned on text or images

    The shap-e repository provides the official code and model release for Shap-E, a conditional generative model designed to produce 3D assets (implicit functions, meshes, neural radiance fields) from text or image prompts. The model is built with a two-stage architecture: first an encoder that maps existing 3D assets into parameterizations of implicit functions, and then a conditional diffusion model trained on those parameterizations to generate new assets. Because it works at the level of implicit functions, Shap-E can render output both as textured meshes and NeRF-style volumetric renderings. The repository contains sample notebooks (e.g. sample_text_to_3d.ipynb, sample_image_to_3d.ipynb) so users can try out text → 3D or image → 3D generation. ...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 13
    eCxx

    eCxx

    A C++ library for AVR and NodeMCU

    NOTE: This project is marked with 'Status: Abandoned' on SourceForge because not enough time can be dedicated to this project. However it may still get sporadic commits to the repository. eCxx is a library for AVR and NodeMCU tailored for micro LED displays and lighting effects. eCxx is utilizing Makefile build system. Java and Python based applications/tools are also included to ease the development and debugging process using the host PC. On one side, eCxx supports the original...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 14
    OpenFlamingo

    OpenFlamingo

    An open-source framework for training large multimodal models

    ...If you have any questions, please feel free to open an issue. We also welcome contributions! We provide an initial OpenFlamingo 9B model using a CLIP ViT-Large vision encoder and a LLaMA-7B language model. In general, we support any CLIP vision encoder. For the language model, we support LLaMA, OPT, GPT-Neo, GPT-J, and Pythia models. OpenFlamingo is a multimodal language model that can be used for a variety of tasks. It is trained on a large multimodal dataset.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    Basaran

    Basaran

    Basaran, an open-source alternative to the OpenAI text completion API

    ...The open source community will eventually witness the Stable Diffusion moment for large language models (LLMs), and Basaran allows you to replace OpenAI's service with the latest open-source model to power your application without modifying a single line of code. Stream generation using various decoding strategies. Support both decoder-only and encoder-decoder models. Detokenizer that handles surrogates and whitespace. Multi-GPU support with optional 8-bit quantization. Real-time partial progress using server-sent events. Compatible with OpenAI API and client libraries. Comes with a fancy web-based playground. Docker images are available on Docker Hub and GitHub Packages.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 16
    MetaTransformer

    MetaTransformer

    Meta-Transformer for Unified Multimodal Learning

    We're thrilled to present OneLLM, an ensembling Meta-Transformer framework with Multimodal Large Language Models, which performs multimodal joint training, supports more modalities including fMRI, Depth, and Normal Maps, and demonstrates very impressive performances on 25 benchmarks.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17
    OpenNMT-tf

    OpenNMT-tf

    Neural machine translation and sequence learning using TensorFlow

    ...Models are described with code to allow training custom architectures and overriding default behavior. For example, the following instance defines a sequence-to-sequence model with 2 concatenated input features, a self-attentional encoder, and an attentional RNN decoder sharing its input and output embeddings. Sequence to sequence models can be trained with guided alignment and alignment information are returned as part of the translation API.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18
    iJEPA

    iJEPA

    Official codebase for I-JEPA

    i-JEPA (Image Joint-Embedding Predictive Architecture) is a self-supervised learning framework that predicts missing high-level representations rather than reconstructing pixels. A context encoder sees visible regions of an image and predicts target embeddings for masked regions produced by a slowly updated target encoder, focusing learning on semantics instead of texture. This objective sidesteps generative pixel losses and avoids heavy negative sampling, producing features that transfer strongly with linear probes and minimal fine-tuning. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19
    lora-svc

    lora-svc

    Singing voice change based on whisper, lora for singing voice clone

    singing voice change based on whisper, and lora for singing voice clone. You will feel the beauty of the code from this project. Uni-SVC main branch is for singing voice clone based on whisper with speaker encoder and speaker adapter. Uni-SVC main target is to develop lora for SVC. With lora, maybe clone a singer just need 10 stence after 10 minutes train. Each singer is a plug-in of the base model.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 20
    Stable-Dreamfusion

    Stable-Dreamfusion

    Text-to-3D & Image-to-3D & Mesh Exportation with NeRF + Diffusion

    ...Different from Imagen, Stable-Diffusion is a latent diffusion model, which diffuses in a latent space instead of the original image space. Therefore, we need the loss to propagate back from the VAE's encoder part too, which introduces extra time costs in training. We use the multi-resolution grid encoder to implement the NeRF backbone (implementation from torch-ngp), which enables much faster rendering.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 21
    CPT

    CPT

    CPT: A Pre-Trained Unbalanced Transformer

    A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation. We replace the old BERT vocabulary with a larger one of size 51271 built from the training data, in which we 1) add missing 6800+ Chinese characters (most of them are traditional Chinese characters); 2) remove redundant tokens (e.g. Chinese character tokens with ## prefix); 3) add some English tokens to reduce OOV. Position Embeddings We extend the max_position_embeddings from 512 to 1024. We...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 22
    Karlo

    Karlo

    Text-conditional image generation model based on OpenAI's unCLIP

    ...In the case of Prior and Decoder, we use ViT-L/14 provided by OpenAI’s CLIP repository. Unlike the original implementation of unCLIP, we replace the trainable transformer in the decoder into the text encoder in ViT-L/14 for efficiency. In the case of the SR module, we first train the model using the DDPM objective in 1M steps, followed by additional 234K steps to fine-tune the additional component.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 23
    Stable Diffusion

    Stable Diffusion

    A latent text-to-image diffusion model

    Stable Diffusion is a widely used open-source latent text-to-image diffusion model developed by the CompVis group for generating high-quality images from natural language prompts. The model operates by conditioning a diffusion process on text embeddings produced by a CLIP text encoder, enabling detailed and controllable image synthesis. It was trained on large-scale image datasets and later fine-tuned to produce 512×512 images with strong visual fidelity. Because the system runs efficiently on consumer hardware compared to earlier generative models, it helped popularize local AI image generation workflows. The repository includes reference scripts and model configurations that allow researchers and developers to reproduce, modify, or extend the architecture. ...
    Downloads: 36 This Week
    Last Update:
    See Project
  • 24
    EnCodec

    EnCodec

    State-of-the-art deep learning based audio codec

    ...Unlike traditional codecs (like MP3 or Opus), Encodec uses a learned quantizer and decoder to reconstruct complex waveforms with remarkable accuracy at bitrates as low as 1.5 kbps. It employs a convolutional encoder–decoder architecture trained with perceptual loss functions that optimize for human auditory quality rather than raw waveform distance. The model can operate in real time and supports variable bandwidths, bitrates, and multi-band audio. Encodec has applications in speech and music compression, generative modeling, and efficient data transmission for communication systems. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 25

    SimpleVideoEncoder

    Simple video encoder

    Simple video encoder is GUI for ffmpeg designed to encode video files. The application is designed so that the process of starting the encoding of one or more videofiles takes 2-3 clicks.
    Downloads: 0 This Week
    Last Update:
    See Project
MongoDB Logo MongoDB