Showing 15 open source projects for "spatial"

View related business solutions
  • $300 Free Credits to Build on Google Cloud Icon
    $300 Free Credits to Build on Google Cloud

    New to Google Cloud? Get $300 in credits to explore Compute Engine, BigQuery, Cloud Run, Gemini Enterprise Agent Platform, and more.

    Start your next project with $300 in free Google Cloud credit. Spin up VMs, run containers, query petabytes in BigQuery, or build agents with Gemini Enterprise Agent Platform. Once your credits are used, keep building with 20+ always-free tier products including Compute Engine, Cloud Storage, GKE, and Cloud Run functions. No commitment required—just sign up and start building.
    Claim $300 Free
  • Build Agents and Models on One Platform Icon
    Build Agents and Models on One Platform

    Everything you need to build production-ready agents and models. Access 200+ Google and third-party AI models and tools.

    Gemini Enterprise Agent Platform is Google Cloud's comprehensive platform for developers to build, scale, govern, and optimize agents and models. Choose from Google's most advanced models and third-party models like Anthropic's Claude Model Family.
    Try It Free
  • 1
    Qwen3-VL

    Qwen3-VL

    Qwen3-VL, the multimodal large language model series by Alibaba Cloud

    ...Qwen3-VL is built for complex tasks such as GUI automation, multimodal coding (converting images or videos into HTML, CSS, JS, or Draw.io diagrams), long-context reasoning with support up to 1M tokens, and comprehensive video understanding. It also brings advanced perception capabilities, including spatial grounding, object recognition, OCR across 32 languages, and robust handling of challenging inputs like low-light or distorted text.
    Downloads: 6 This Week
    Last Update:
    See Project
  • 2
    Lyra 2

    Lyra 2

    Project Lyra: Open Generative 3D World Models

    ...It enables the creation of fully explorable 3D environments from minimal inputs such as a single image or video, leveraging self-distillation methods to generate consistent spatial representations. The system evolves across versions, with newer iterations introducing long-horizon generation and improved 3D consistency across frames. It combines elements of computer vision, generative modeling, and spatial intelligence to produce dynamic and navigable virtual worlds. The architecture is designed to handle both 3D and 4D scene generation, making it suitable for applications such as simulation, gaming, and virtual environments. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 3
    SAM 3D Objects

    SAM 3D Objects

    Models for object and human mesh reconstruction

    SAM 3D Objects is a foundation model that reconstructs full 3D geometry, texture, and spatial layout of objects and scenes from a single image. Given one RGB image and object masks (for example, from the Segment Anything family), it can generate a textured 3D mesh for each object, including pose and approximate scene layout. The model is specifically designed to be robust in real-world images with clutter, occlusions, small objects, and unusual viewpoints, where many earlier 3D-from-image systems struggle. ...
    Downloads: 9 This Week
    Last Update:
    See Project
  • 4
    DeepSeek-OCR 2

    DeepSeek-OCR 2

    Visual Causal Flow

    ...It is designed to handle complex layouts and noisy documents by giving the model causal reasoning capabilities that mimic human visual scanning behavior, enhancing OCR performance on documents with rich spatial structure. The repository provides model code and inference scripts that let researchers and developers run and benchmark the system on both images and PDFs, with support for batch evaluation and optimized pipelines leveraging vLLM and transformers.
    Downloads: 1 This Week
    Last Update:
    See Project
  • Compliant and Reliable File Transfers Backed by Top Security Certifications Icon
    Compliant and Reliable File Transfers Backed by Top Security Certifications

    Cerberus FTP Server delivers SOC 2 Type II certified security and FIPS 140-2 validated encryption.

    Stop relying on non-certified, legacy file transfer tools that creak under the weight of modern security demands. Get full audit trails, advanced access controls and more supported by an award-winning team of experts. Start your free 25-day trial today.
    Start Free Trial
  • 5
    HY-World 1.5

    HY-World 1.5

    A Systematic Framework for Interactive World Modeling

    ...It blends advanced reasoning with multimodal synthesis, enabling agents to describe scenes, generate context-appropriate responses, and contribute to narrative or gameplay flows. The underlying framework typically supports large-context state tracking across extended interactions, blending temporal and spatial multimodal signals.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 6
    Qwen-2.5-VL

    Qwen-2.5-VL

    Qwen2.5-VL is the multimodal large language model series

    Qwen2.5 is a series of large language models developed by the Qwen team at Alibaba Cloud, designed to enhance natural language understanding and generation across multiple languages. The models are available in various sizes, including 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B parameters, catering to diverse computational requirements. Trained on a comprehensive dataset of up to 18 trillion tokens, Qwen2.5 models exhibit significant improvements in instruction following, long-text generation...
    Downloads: 7 This Week
    Last Update:
    See Project
  • 7
    SlowFast

    SlowFast

    Video understanding codebase from FAIR for reproducing video models

    SlowFast is a video understanding framework that captures both spatial semantics and temporal dynamics efficiently by processing video frames at two different temporal resolutions. The slow pathway encodes semantic context by sampling frames sparsely, while the fast pathway captures motion and fine temporal cues by operating on densely sampled frames with fewer channels. Together, these two pathways complement each other, allowing the network to model both appearance and motion without excessive computational cost. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 8
    Seamless Communication

    Seamless Communication

    Foundational Models for State-of-the-Art Speech and Text Translation

    Seamless Communication is a research project focused on building more integrated, low-latency multimodal communication between humans and AI agents. The motivation is to move beyond “text in, text out” and enable direct, live, multi-turn exchange involving language, gesture, gaze, vision, and modality switching without user friction. The system architecture includes a real-time multimodal signal pipeline for audio, video, and sensor data, a dialog manager that can decide when to act (speak,...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 9
    Qwen-Image-Layered

    Qwen-Image-Layered

    Qwen-Image-Layered: Layered Decomposition for Inherent Editablity

    Qwen-Image-Layered is an extension of the Qwen series of multimodal models that introduces layered image understanding, enabling the model to reason about hierarchical visual structures — such as separating foreground, background, objects, and contextual layers within an image. This architecture allows richer semantic interpretation, enabling use cases such as scene decomposition, object-level editing, layered captioning, and more fine-grained multimodal reasoning than with flat image...
    Downloads: 0 This Week
    Last Update:
    See Project
  • Stop Cyber Threats with VM-Series Next-Gen Firewall on Azure Icon
    Stop Cyber Threats with VM-Series Next-Gen Firewall on Azure

    Native application identity and user-based security for your Azure cloud

    Gain integrated visibility across all traffic in a single pass. Deploy Palo Alto Networks VM-Series to determine application identity and content while automating security policy updates via rich APIs.
    Get a free trial
  • 10
    ControlNet

    ControlNet

    Let us control diffusion models

    ControlNet is a neural network architecture designed to add conditional control to text-to-image diffusion models. Rather than training from scratch, ControlNet “locks” the weights of a pre-trained diffusion model and introduces a parallel trainable branch that learns additional conditions—like edges, depth maps, segmentation, human pose, scribbles, or other guidance signals. This allows the system to control where and how the model should focus during generation, enabling users to steer...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 11
    Mask2Former

    Mask2Former

    Code release for "Masked-attention Mask Transformer

    ...Its core idea is to cast segmentation as mask classification: a transformer decoder predicts a set of mask queries, each with an associated class score, eliminating the need for task-specific heads. A pixel decoder fuses multi-scale features and feeds masked attention in the transformer so each query focuses computation on its current spatial support. This leads to accurate masks with sharp boundaries and strong small-object performance while remaining efficient on high-resolution inputs. The project provides extensive configurations and pretrained models across popular benchmarks like COCO, ADE20K, and Cityscapes. Built on top of Detectron2, it includes training scripts, inference tools, and visualization utilities that make experimentation straightforward.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    TimeSformer

    TimeSformer

    The official pytorch implementation of our paper

    ...TimeSformer was influential in showing that pure transformer architectures—without convolutional backbones—can perform strongly on video classification tasks. Its flexible attention design allows experimenting with different factoring (spatial-then-temporal, joint, etc.) to trade off compute, memory, and accuracy.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 13
    SG2Im

    SG2Im

    Code for "Image Generation from Scene Graphs", Johnson et al, CVPR 201

    ...The pipeline typically predicts object layouts (bounding boxes and masks) from the graph, then renders a realistic image conditioned on those layouts. This separation lets the model reason about geometry and composition before committing to texture and color, improving spatial fidelity. The repository includes training code, datasets, and evaluation scripts so researchers can reproduce baselines and extend components such as the graph encoder or image generator. In practice, sg2im demonstrates how structured semantics can guide generative models to produce controllable, compositional imagery.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    layoutlm-base-uncased

    layoutlm-base-uncased

    Multimodal Transformer for document image understanding and layout

    ...This base version has 113 million parameters and is pre-trained on 11 million documents from the IIT-CDIP dataset. LayoutLM enables better performance in tasks where the spatial arrangement of text plays a crucial role. The model uses a standard BERT-like architecture but enriches input with 2D positional embeddings. It achieves state-of-the-art results in form understanding and information extraction benchmarks. This model is particularly useful for document AI applications like document classification, question answering, and named entity recognition.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    Qwen2.5-VL-3B-Instruct

    Qwen2.5-VL-3B-Instruct

    Qwen2.5-VL-3B-Instruct: Multimodal model for chat, vision & video

    ...The model can serve as an intelligent visual agent capable of interacting with digital interfaces and understanding long-form videos by dynamically sampling resolution and frame rate. It uses a SwiGLU and RMSNorm-enhanced ViT architecture and introduces mRoPE updates for robust temporal and spatial understanding. The model supports flexible image input (file path, URL, base64) and outputs structured responses like bounding boxes or JSON, making it highly versatile in commercial and research settings. It excels in a wide range of benchmarks such as DocVQA, InfoVQA, and AndroidWorld control tasks.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • Next
Auth0 Logo