22 projects for "batch text processing" with 2 filters applied:

  • Gemini 3 and 200+ AI Models on One Platform Icon
    Gemini 3 and 200+ AI Models on One Platform

    Access Google's best plus Claude, Llama, and Gemma. Fine-tune and deploy from one console.

    Build generative AI apps with Vertex AI Studio. Switch between models without switching platforms.
    Start Free
  • Cut Cloud Costs with Google Compute Engine Icon
    Cut Cloud Costs with Google Compute Engine

    Save up to 91% with Spot VMs and get automatic sustained-use discounts. One free VM per month, plus $300 in credits.

    Save on compute costs with Compute Engine. Reduce your batch jobs and workload bill 60-91% with Spot VMs. Compute Engine's committed use offers customers up to 70% savings through sustained use discounts. Plus, you get one free e2-micro VM monthly and $300 credit to start.
    Try Compute Engine
  • 1
    DeepSeek-OCR

    DeepSeek-OCR

    Contexts Optical Compression

    DeepSeek-OCR is an open-source optical character recognition solution built as part of the broader DeepSeek AI vision-language ecosystem. It is designed to extract text from images, PDFs, and scanned documents, and integrates with multimodal capabilities that understand layout, context, and visual elements beyond raw character recognition. The system treats OCR not simply as “read the text” but as “understand what the text is doing in the image”—for example distinguishing captions from body text, interpreting tables, or recognizing handwritten versus printed words. ...
    Downloads: 6 This Week
    Last Update:
    See Project
  • 2
    DeepSeek-OCR 2

    DeepSeek-OCR 2

    Visual Causal Flow

    DeepSeek-OCR-2 is the second-generation optical character recognition system developed to improve document understanding by introducing a “visual causal flow” mechanism, enabling the encoder to reorder visual tokens in a way that better reflects semantic structure rather than strict raster scan order. It is designed to handle complex layouts and noisy documents by giving the model causal reasoning capabilities that mimic human visual scanning behavior, enhancing OCR performance on documents...
    Downloads: 8 This Week
    Last Update:
    See Project
  • 3
    LTX-Video

    LTX-Video

    Official repository for LTX-Video

    ...The toolkit is built with both real-time and offline workflows in mind, enabling applications from consumer editing to professional content creation and batch processing. Internally optimized for multi-core processors and hardware acceleration where available, LTX-Video makes it feasible to work with high-resolution content and complex timelines without sacrificing responsiveness.
    Downloads: 5 This Week
    Last Update:
    See Project
  • 4
    Qwen3-ASR

    Qwen3-ASR

    Qwen3-ASR is an open-source series of ASR models

    Qwen3-ASR is an automatic speech recognition system in the QwenLM family, developed to convert spoken language into text with strong accuracy and real-time performance. As a specialized ASR variant of the broader Qwen language model ecosystem, it focuses on capturing reliable transcriptions from audio sources such as recordings, live streams, or conversational inputs while supporting low latency use cases. The architecture combines advanced neural acoustic modeling with context-aware...
    Downloads: 3 This Week
    Last Update:
    See Project
  • $300 in Free Credit Across 150+ Cloud Services Icon
    $300 in Free Credit Across 150+ Cloud Services

    VMs, containers, AI, databases, storage | build anything. No commitment to start.

    Start your project in minutes. After credits run out, 20+ products include free monthly usage. Only pay when you're ready to scale with Google Cloud.
    Start Building Free
  • 5
    FireRed-Image-Edit

    FireRed-Image-Edit

    General-purpose image editing model that delivers high-fidelity

    ...In addition to editing single images, FireRed supports multi-image editing scenarios such as virtual try-on or batch transformations, making it suitable for both creative and practical workflows.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 6
    VibeVoice

    VibeVoice

    Open-source multi-speaker long-form text-to-speech model

    VibeVoice-1.5B is Microsoft’s frontier open-source text-to-speech (TTS) model designed for generating expressive, long-form, multi-speaker conversational audio such as podcasts. Unlike traditional TTS systems, it excels in scalability, speaker consistency, and natural turn-taking for up to 90 minutes of continuous speech with as many as four distinct speakers. A key innovation is its use of continuous acoustic and semantic speech tokenizers operating at an ultra-low frame rate of 7.5 Hz, enabling high audio fidelity with efficient processing of long sequences. ...
    Downloads: 19 This Week
    Last Update:
    See Project
  • 7
    MiniMax-01

    MiniMax-01

    Large-language-model & vision-language-model based on Linear Attention

    MiniMax-01 is the official repository for two flagship models: MiniMax-Text-01, a long-context language model, and MiniMax-VL-01, a vision-language model built on top of it. MiniMax-Text-01 uses a hybrid attention architecture that blends Lightning Attention, standard softmax attention, and Mixture-of-Experts (MoE) routing to achieve both high throughput and long-context reasoning. It has 456 billion total parameters with 45.9 billion activated per token and is trained with advanced parallel...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 8
    HunyuanDiT

    HunyuanDiT

    Diffusion Transformer with Fine-Grained Chinese Understanding

    HunyuanDiT is a high-capability text-to-image diffusion transformer with bilingual (Chinese/English) understanding and multi-turn dialogue capability. It trains a diffusion model in latent space using a transformer backbone and integrates a Multimodal Large Language Model (MLLM) to refine captions and support conversational image generation. It supports adapters like ControlNet, IP-Adapter, LoRA, and can run under constrained VRAM via distillation versions. LoRA, ControlNet (pose, depth,...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 9
    Vidi2

    Vidi2

    Large Multimodal Models for Video Understanding and Editing

    Vidi is a family of large multimodal models developed for deep video understanding and editing tasks, integrating vision, audio, and language to allow sophisticated querying and manipulation of video content. It’s designed to process long-form, real-world videos and answer complex queries such as “when in this clip does X happen?” or “where in the frame is object Y during that moment?” — offering temporal retrieval, spatio-temporal grounding (i.e. locating objects over time + space), and...
    Downloads: 3 This Week
    Last Update:
    See Project
  • Catch Bugs Before Your Customers Do Icon
    Catch Bugs Before Your Customers Do

    Real-time error alerts, performance insights, and anomaly detection across your full stack. Free 30-day trial.

    Move from alert to fix before users notice. AppSignal monitors errors, performance bottlenecks, host health, and uptime—all from one dashboard. Instant notifications on deployments, anomaly triggers for memory spikes or error surges, and seamless log management. Works out of the box with Rails, Django, Express, Phoenix, Next.js, and dozens more. Starts at $23/month with no hidden fees.
    Try AppSignal Free
  • 10
    Step3-VL-10B

    Step3-VL-10B

    Multimodal model achieving SOTA performance

    ...It achieves this efficiency and strong performance through unified pre-training on a massive 1.2 trillion-token multimodal corpus that jointly optimizes a language-aligned perception encoder with a powerful decoder, creating deep synergy between image processing and text understanding.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 11
    Kimi-Audio

    Kimi-Audio

    Audio foundation model excelling in audio understanding

    Kimi-Audio is an ambitious open-source audio foundation model designed to unify a wide array of audio processing tasks — from speech recognition and audio understanding to generative conversation and sound event classification — within a single cohesive architecture. Instead of fragmenting work across specialized models, Kimi-Audio handles automatic speech recognition (ASR), audio question answering, automatic audio captioning, speech emotion recognition, and audio-to-text chat in one system, enabling developers to build rich, multimodal audio applications without stitching together disparate components. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    DreamCraft3D

    DreamCraft3D

    Official implementation of DreamCraft3D

    DreamCraft3D is DeepSeek’s generative 3D modeling framework / model family that likely extends their earlier 3D efforts (e.g. Shap-E or Point-E style models) with more capability, control, or expression. The name suggests a “dream crafting” metaphor—users probably supply textual or image prompts and generate 3D assets (point clouds, meshes, scenes). The repository includes model code, inference scripts, sample prompts, and possibly dataset preparation pipelines. It may integrate rendering or...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 13
    GLM-4.5V

    GLM-4.5V

    GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning

    GLM-4.5V is the preceding iteration in the GLM-V series that laid much of the groundwork for general multimodal reasoning and vision-language understanding. It embodies the design philosophy of mixing visual and textual modalities into a unified model capable of general-purpose reasoning, content understanding, and generation, while already supporting a wide variety of tasks: from image captioning and visual question answering to content recognition, GUI-based agents, video understanding,...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    GPT-2 Output Dataset

    GPT-2 Output Dataset

    Dataset of GPT-2 outputs for research in detection, biases, and more

    The GPT-2 Output Dataset is a large collection of model-generated text, released by OpenAI alongside the GPT-2 research paper to study the behaviors and limitations of large language models. It contains 250,000 samples of GPT-2 outputs, generated with different sampling strategies such as top-k truncation, to highlight the diversity and quality of model completions. The dataset also includes corresponding human-written text for comparison, enabling researchers to explore methods for...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 15
    CSM (Conversational Speech Model)

    CSM (Conversational Speech Model)

    A Conversational Speech Generation Model

    The CSM (Conversational Speech Model) is a speech generation model developed by Sesame AI that creates RVQ audio codes from text and audio inputs. It uses a Llama backbone and a smaller audio decoder to produce audio codes for realistic speech synthesis. The model has been fine-tuned for interactive voice demos and is hosted on platforms like Hugging Face for testing. CSM offers a flexible setup and is compatible with CUDA-enabled GPUs for efficient execution.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 16
    Universal Sentence Encoder

    Universal Sentence Encoder

    Encoder of greater-than-word length text trained on a variety of data

    The Universal Sentence Encoder (USE) is a pre-trained deep learning model designed to encode sentences into fixed-length embeddings for use in various natural language processing (NLP) tasks. It leverages Transformer and Deep Averaging Network (DAN) architectures to generate embeddings that capture the semantic meaning of sentences. The model is designed for tasks like sentiment analysis, semantic textual similarity, and clustering, and provides high-quality sentence representations in a...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 17
    Denoiser

    Denoiser

    Real Time Speech Enhancement in the Waveform Domain (Interspeech 2020)

    ...The implementation includes data augmentation techniques applied to the raw waveforms (e.g. noise mixing, reverberation) to improve model robustness and generalization to diverse noise types. The project supports both offline denoising (batch inference) and live audio processing (e.g. via loopback audio interfaces), making it practical for real-time use in calls or recording. The codebase includes training and evaluation scripts, configuration management via Hydra, and pretrained models on standard noise datasets.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 18
    Bio_ClinicalBERT

    Bio_ClinicalBERT

    ClinicalBERT model trained on MIMIC notes for clinical NLP tasks

    Bio_ClinicalBERT is a domain-specific language model tailored for clinical natural language processing (NLP), extending BioBERT with additional training on clinical notes. It was initialized from BioBERT-Base v1.0 and further pre-trained on all clinical notes from the MIMIC-III database (~880M words), which includes ICU patient records. The training focused on improving performance in tasks like named entity recognition and natural language inference within the healthcare domain. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19
    mms-300m-1130-forced-aligner

    mms-300m-1130-forced-aligner

    CTC-based forced aligner for audio-text in 158 languages

    ...The alignment pipeline includes audio processing, emission generation, tokenization, and span detection, making it suitable for speech analysis, transcription syncing, and dataset creation. This model is especially useful for researchers and developers working with low-resource languages or building multilingual speech systems.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 20
    Ministral 3 8B Instruct 2512

    Ministral 3 8B Instruct 2512

    Compact 8B multimodal instruct model optimized for edge deployment

    Ministral 3 8B Instruct 2512 is a balanced, efficient model in the Ministral 3 family, offering strong multimodal capabilities within a compact footprint. It combines an 8.4B-parameter language model with a 0.4B vision encoder, enabling both text reasoning and image understanding. This FP8 instruct-fine-tuned variant is optimized for chat, instruction following, and structured outputs, making it ideal for daily assistant tasks and lightweight agentic workflows. Designed for edge deployment,...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    Ministral 3 3B Base 2512

    Ministral 3 3B Base 2512

    Small 3B-base multimodal model ideal for custom AI on edge hardware

    ...It supports dozens of languages, making it practical for multilingual, global, or distributed environments. With a large 256k token context window, it can handle long documents, extended inputs, or multi-step processing workflows even at its small size.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 22
    Ministral 3 14B Instruct 2512

    Ministral 3 14B Instruct 2512

    Efficient 14B multimodal instruct model with edge deployment and FP8

    Ministral 3 14B Instruct 2512 is the largest model in the Ministral 3 family, delivering frontier performance comparable to much larger systems while remaining optimized for edge-level deployment. It combines a 13.5B-parameter language model with a 0.4B-parameter vision encoder, enabling strong multimodal understanding in both text and image tasks. This FP8 instruct-tuned variant is designed specifically for chat, instruction following, and agentic workflows with robust system-prompt...
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • Next
MongoDB Logo MongoDB