Alternatives to Azure AI Custom Vision

Compare Azure AI Custom Vision alternatives for your business or organization using the curated list below. SourceForge ranks the best alternatives to Azure AI Custom Vision in 2026. Compare features, ratings, user reviews, pricing, and more from Azure AI Custom Vision competitors and alternatives in order to make an informed decision for your business.

  • 1
    Google Cloud Vision AI
    Derive insights from your images in the cloud or at the edge with AutoML Vision or use pre-trained Vision API models to detect emotion, understand text, and more. Google Cloud offers two computer vision products that use machine learning to help you understand your images with industry-leading prediction accuracy. Automate the training of your own custom machine learning models. Simply upload images and train custom image models with AutoML Vision’s easy-to-use graphical interface; optimize your models for accuracy, latency, and size; and export them to your application in the cloud, or to an array of devices at the edge. Google Cloud’s Vision API offers powerful pre-trained machine learning models through REST and RPC APIs. Assign labels to images and quickly classify them into millions of predefined categories. Detect objects and faces, read printed and handwritten text, and build valuable metadata into your image catalog.
  • 2
    Ango Hub

    Ango Hub

    iMerit

    Ango Hub is a quality-focused, enterprise-ready data annotation platform for AI teams, available on cloud and on-premise. It supports computer vision, medical imaging, NLP, audio, video, and 3D point cloud annotation, powering use cases from autonomous driving and robotics to healthcare AI. Built for AI fine-tuning, RLHF, LLM evaluation, and human-in-the-loop workflows, Ango Hub boosts throughput with automation, model-assisted pre-labeling, and customizable QA while maintaining accuracy. Features include centralized instructions, review pipelines, issue tracking, and consensus across up to 30 annotators. With nearly twenty labeling tools—such as rotated bounding boxes, label relations, nested conditional questions, and table-based labeling—it supports both simple and complex projects. It also enables annotation pipelines for chain-of-thought reasoning and next-gen LLM training and enterprise-grade security with HIPAA compliance, SOC 2 certification, and role-based access controls.
  • 3
    Dataloop AI

    Dataloop AI

    Dataloop AI

    Manage unstructured data and pipelines to develop AI solutions at amazing speed. Enterprise-grade data platform for vision AI. Dataloop is a one-stop shop for building and deploying powerful computer vision pipelines data labeling, automating data ops, customizing production pipelines and weaving the human-in-the-loop for data validation. Our vision is to make machine learning-based systems accessible, affordable and scalable for all. Explore and analyze vast quantities of unstructured data from diverse sources. Rely on automated preprocessing and embeddings to identify similarities and find the data you need. Curate, version, clean, and route your data to wherever it’s needed to create exceptional AI applications.
  • 4
    Roboflow

    Roboflow

    Roboflow

    Roboflow has everything you need to build and deploy computer vision models. Connect Roboflow at any step in your pipeline with APIs and SDKs, or use the end-to-end interface to automate the entire process from image to inference. Whether you’re in need of data labeling, model training, or model deployment, Roboflow gives you building blocks to bring custom computer vision solutions to your business.
  • 5
    Ailiverse NeuCore
    Build & scale with ease. With NeuCore you can develop, train and deploy your computer vision model in a few minutes and scale it to millions. A one-stop platform that manages the model lifecycle, including development, training, deployment, and maintenance. Advanced data encryption is applied to protect your information at all stages of the process, from training to inference. Fully integrable vision AI models fit into your existing workflows and systems, or even edge devices easily. Seamless scalability accommodates your growing business needs and evolving business requirements. Divides an image into segments of different objects within the image. Extracts text from images, making it machine-readable. This model also works on handwriting. With NeuCore, building computer vision models is as easy as drag-and-drop and one-click. For more customization, advanced users can access provided code scripts and follow tutorial videos.
  • 6
    Eyewey

    Eyewey

    Eyewey

    Train your own models, get access to pre-trained computer vision models and app templates, learn how to create AI apps or solve a business problem using computer vision in a couple of hours. Start creating your own dataset for detection by adding the images of the object you need to train. You can add up to 5000 images per dataset. After images are added to your dataset, they are pushed automatically into training. Once the model is finished training, you will be notified accordingly. You can simply download your model to be used for detection. You can also integrate your model to our pre-existing app templates for quick coding. Our mobile app which is available on both Android and IOS utilizes the power of computer vision to help people with complete blindness in their day-to-day lives. It is capable of alerting hazardous objects or signs, detecting common objects, recognizing text as well as currencies and understanding basic scenarios through deep learning.
    Starting Price: $6.67 per month
  • 7
    AI Verse

    AI Verse

    AI Verse

    When real-life data capture is challenging, we generate diverse, fully labeled image datasets. Our procedural technology ensures the highest quality, unbiased, labeled synthetic datasets that will improve your computer vision model’s accuracy. AI Verse empowers users with full control over scene parameters, ensuring you can fine-tune the environments for unlimited image generation, giving you an edge in the competitive landscape of computer vision development.
  • 8
    Hive Data
    Create training datasets for computer vision models with our fully managed solution. We believe that data labeling is the most important factor in building effective deep learning models. We are committed to being the field's leading data labeling platform and helping companies take full advantage of AI's capabilities. Organize your media with discrete categories. Identify items of interest with one or many bounding boxes. Like bounding boxes, but with additional precision. Annotate objects with accurate width, depth, and height. Classify each pixel of an image. Mark individual points in an image. Annotate straight lines in an image. Measure, yaw, pitch, and roll of an item of interest. Annotate timestamps in video and audio content. Annotate freeform lines in an image.
    Starting Price: $25 per 1,000 annotations
  • 9
    Manot

    Manot

    Manot

    Your insight management platform for computer vision model performance. Pinpoint precisely where, how, and why models fail, bridging the gap between product managers and engineers through actionable insights. Manot provides an automated and continuous feedback loop for product managers to effectively communicate with engineering teams. Manot's simple user interface allows both technical and non-technical team members to benefit from the platform. Manot is designed with product managers in mind. Our platform provides actionable insights in the form of images pinpointing how, where, and why your model will perform poorly.
  • 10
    PaliGemma 2
    PaliGemma 2, the next evolution in tunable vision-language models, builds upon the performant Gemma 2 models, adding the power of vision and making it easier than ever to fine-tune for exceptional performance. With PaliGemma 2, these models can see, understand, and interact with visual input, opening up a world of new possibilities. It offers scalable performance with multiple model sizes (3B, 10B, 28B parameters) and resolutions (224px, 448px, 896px). PaliGemma 2 generates detailed, contextually relevant captions for images, going beyond simple object identification to describe actions, emotions, and the overall narrative of the scene. Our research demonstrates leading performance in chemical formula recognition, music score recognition, spatial reasoning, and chest X-ray report generation, as detailed in the technical report. Upgrading to PaliGemma 2 is a breeze for existing PaliGemma users.
  • 11
    Qwen2.5-VL

    Qwen2.5-VL

    Alibaba

    Qwen2.5-VL is the latest vision-language model from the Qwen series, representing a significant advancement over its predecessor, Qwen2-VL. This model excels in visual understanding, capable of recognizing a wide array of objects, including text, charts, icons, graphics, and layouts within images. It functions as a visual agent, capable of reasoning and dynamically directing tools, enabling applications such as computer and phone usage. Qwen2.5-VL can comprehend videos exceeding one hour in length and can pinpoint relevant segments within them. Additionally, it accurately localizes objects in images by generating bounding boxes or points and provides stable JSON outputs for coordinates and attributes. The model also supports structured outputs for data like scanned invoices, forms, and tables, benefiting sectors such as finance and commerce. Available in base and instruct versions across 3B, 7B, and 72B sizes, Qwen2.5-VL is accessible through platforms like Hugging Face and ModelScope.
  • 12
    Qwen2-VL

    Qwen2-VL

    Alibaba

    Qwen2-VL is the latest version of the vision language models based on Qwen2 in the Qwen model familities. Compared with Qwen-VL, Qwen2-VL has the capabilities of: SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc. Understanding videos of 20 min+: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc. Agent that can operate your mobiles, robots, etc.: with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions. Multilingual Support: to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images
  • 13
    Supervisely

    Supervisely

    Supervisely

    The leading platform for entire computer vision lifecycle. Iterate from image annotation to accurate neural networks 10x faster. With our best-in-class data labeling tools transform your images / videos / 3d point cloud into high-quality training data. Train your models, track experiments, visualize and continuously improve model predictions, build custom solution within the single environment. Our self-hosted solution guaranties data privacy, powerful customization capabilities, and easy integration into your technology stack. A turnkey solution for Computer Vision: multi-format data annotation & management, quality control at scale and neural networks training in end-to-end platform. Inspired by professional video editing software, created by data scientists for data scientists — the most powerful video labeling tool for machine learning and more.
  • 14
    GPT-4V (Vision)
    GPT-4 with vision (GPT-4V) enables users to instruct GPT-4 to analyze image inputs provided by the user, and is the latest capability we are making broadly available. Incorporating additional modalities (such as image inputs) into large language models (LLMs) is viewed by some as a key frontier in artificial intelligence research and development. Multimodal LLMs offer the possibility of expanding the impact of language-only systems with novel interfaces and capabilities, enabling them to solve new tasks and provide novel experiences for their users. In this system card, we analyze the safety properties of GPT-4V. Our work on safety for GPT-4V builds on the work done for GPT-4 and here we dive deeper into the evaluations, preparation, and mitigation work done specifically for image inputs.
  • 15
    IBM Maximo Visual Inspection
    IBM Maximo Visual Inspection puts the power of computer vision AI capabilities into the hands of your quality control and inspection teams. It makes computer vision, deep learning, and automation more accessible to your technicians as it’s an intuitive toolset for labeling, training, and deploying artificial intelligence vision models. Built for easy and rapid deployment, simply train your model using our drag-and-drop visual user interface or import a custom model, and you’re ready to activate when and where you need it using mobile and edge devices. With IBM Maximo Visual Inspection, you can create your own detect and correct solution, with self-learning machine algorithms. Watch the demo below to understand how easy it is to automate your inspection processes with visual inspection tools.
  • 16
    SolVision

    SolVision

    Solomon

    SolVision is an advanced AI vision system developed by Solomon 3D, designed to enhance industrial automation through rapid and accurate visual inspections. Leveraging Solomon’s proprietary rapid AI model training technology, SolVision enables users to train AI models in minutes, significantly reducing setup time compared to traditional systems. It excels in various applications, including defect detection, item classification, optical character recognition, and presence/absence checks, making it suitable for industries such as manufacturing, food & beverage, textiles, and electronics. A standout feature is its ability to learn from as few as 1–5 image samples, streamlining the training process and minimizing the need for extensive data annotation. SolVision's intuitive user interface allows for simultaneous labeling of multiple defect types, facilitating complex classification tasks.
  • 17
    Strong Analytics

    Strong Analytics

    Strong Analytics

    Our platforms provide a trusted foundation upon which to design, build, and deploy custom machine learning and artificial intelligence solutions. Build next-best-action applications that learn, adapt, and optimize using reinforcement-learning based algorithms. Custom, continuously-improving deep learning vision models to solve your unique challenges. Predict the future using state-of-the-art forecasts. Enable smarter decisions throughout your organization with cloud based tools to monitor and analyze. The process of taking a modern machine learning application from research and ad-hoc code to a robust, scalable platform remains a key challenge for experienced data science and engineering teams. Strong ML simplifies this process with a complete suite of tools to manage, deploy, and monitor your machine learning applications.
  • 18
    Rosepetal AI

    Rosepetal AI

    Rosepetal AI

    Rosepetal AI is an innovative technology company specializing in advanced artificial vision and deep-learning solutions designed specifically for industrial quality control. Our platform integrates dataset handling, automated labelling and training of adaptive neural networks, enabling real-time defect detection without requiring advanced technical expertise. This intuitive, no-code SaaS solution democratizes access to sophisticated AI, significantly enhancing efficiency, reducing waste, and driving operational excellence across multiple industries such as automotive, food processing, pharmaceuticals, plastics, and electronics. The unique strength of Rosepetal AI lies in its dynamic adaptability and scalability. Our system allows industrial companies to quickly deploy robust AI models directly onto their production lines, continuously adjusting to new product variations and emerging defects. This capability ensures consistent quality, minimizes downtime.
  • 19
    Cloneable

    Cloneable

    Cloneable

    Cloneable packs sophisticated logic into an incredibly easy-to-use, no-code builder to develop custom, deep-tech applications compatible with any device. Cloneable integrates deep tech with your unique business logic, so you can create and deploy tailored apps to any edge device. Apps can be built in minutes, making it perfect for non-technical audiences to make instant process changes and for engineers who want to rapidly develop and iterate on complex field tools. Launch, update and test your AI and computer vision models on any device (phone, IoT, cloud, robot). Apps are instantly deployable from the Cloneable builder. Bring your own model or build from one of our templates to move any data collection process to the edge. Cloneable was built with unlimited flexibility, so you can count, measure, inspect, and track assets across any location. Intelligent apps can digitize manual processes, scale human expertise, increase transparency, improve auditability, and much more.
  • 20
    Datature

    Datature

    Datature

    Datature is a comprehensive, end-to-end, no-code computer vision and MLOps platform that simplifies the entire deep-learning lifecycle by letting users manage data, annotate images and videos, train models, evaluate performance, and deploy AI vision solutions, all within one unified environment without coding. Its intuitive visual interface and workflow tools guide you through dataset onboarding and annotation (including bounding boxes, segmentation, and advanced labeling), let you build automated training pipelines, monitor model training, and assess model accuracy with rich performance analytics, and then deploy models via API or for edge use so trained models can be used in real-world applications. Designed to democratize access to AI vision, Datature accelerates project timelines by reducing manual coding and debugging, supports collaboration across teams, and accommodates tasks like object detection, classification, semantic segmentation, and video analysis.
  • 21
    Florence-2

    Florence-2

    Microsoft

    Florence-2-large is an advanced vision foundation model developed by Microsoft, capable of handling a wide variety of vision and vision-language tasks, such as captioning, object detection, segmentation, and OCR. Built with a sequence-to-sequence architecture, it uses the FLD-5B dataset containing over 5 billion annotations and 126 million images to master multi-task learning. Florence-2-large excels in both zero-shot and fine-tuned settings, providing high-quality results with minimal training. The model supports tasks including detailed captioning, object detection, and dense region captioning, and can process images with text prompts to generate relevant responses. It offers great flexibility by handling diverse vision-related tasks through prompt-based approaches, making it a competitive tool in AI-powered visual tasks. The model is available on Hugging Face with pre-trained weights, enabling users to quickly get started with image processing and task execution.
  • 22
    Clarifai

    Clarifai

    Clarifai

    Clarifai is a leading AI platform for modeling image, video, text and audio data at scale. Our platform combines computer vision, natural language processing and audio recognition as building blocks for developing better, faster and stronger AI. We help our customers create innovative solutions for visual search, content moderation, aerial surveillance, visual inspection, intelligent document analysis, and more. The platform comes with the broadest repository of pre-trained, out-of-the-box AI models built with millions of inputs and context. Our models give you a head start; extending your own custom AI models. Clarifai Community builds upon this and offers 1000s of pre-trained models and workflows from Clarifai and other leading AI builders. Users can build and share models with other community members. Founded in 2013 by Matt Zeiler, Ph.D., Clarifai has been recognized by leading analysts, IDC, Forrester and Gartner, as a leading computer vision AI platform. Visit clarifai.com
  • 23
    Black.ai

    Black.ai

    Black.ai

    Respond to events and make better decisions with the help of AI and your existing IP camera infrastructure. Cameras are almost exclusively used for security and surveillance purposes. We add cutting-edge Machine Vision models to unlock a high-impact resource available to your team daily. We help you to improve operations for your staff and customers without compromising privacy. No facial recognition, or long-term tracking, no exceptions. Fewer people in the loop. A reliance on staff compiling and watching footage is invasive and unscalable. We help you to review only the things that matter and only at the right time. Black.ai creates a privacy layer that sits between security cameras and operations teams, so you can build a better experience for people without breaching their trust. Black.ai interfaces with your existing cameras using parallel streaming protocols. Our system is installed without additional infrastructure cost or any risk of obstructing operations.
  • 24
    alwaysAI

    alwaysAI

    alwaysAI

    alwaysAI provides developers with a simple and flexible way to build, train, and deploy computer vision applications to a wide variety of IoT devices. Select from a catalog of deep learning models or upload your own. Use our flexible and customizable APIs to quickly enable core computer vision services. Quickly prototype, test and iterate with a variety of camera-enabled ARM-32, ARM-64 and x86 devices. Identify objects in an image by name or classification. Identify and count objects appearing in a real-time video feed. Follow the same object across a series of frames. Find faces or full bodies in a scene to count or track. Locate and define borders around separate objects. Separate key objects in an image from background visuals. Determine human body poses, fall detection, emotions. Use our model training toolkit to train an object detection model to identify virtually any object. Create a model tailored to your specific use-case.
  • 25
    Aya Vision
    Aya Vision is a research model advancing in multilingual multimodal AI through innovative synthetic data generation, cross-modal model merging, and a comprehensive benchmark suite. It achieves state-of-the-art performance across 23 languages, surpassing larger models while efficiently addressing data scarcity and catastrophic forgetting by reducing computational overhead up to 40% via optimized training techniques.
  • 26
    Ultralytics

    Ultralytics

    Ultralytics

    Ultralytics offers a full-stack vision-AI platform built around its flagship YOLO model suite that enables teams to train, validate, and deploy computer-vision models with minimal friction. The platform allows you to drag and drop datasets, select from pre-built templates or fine-tune custom models, then export to a wide variety of formats for cloud, edge or mobile deployment. With support for tasks including object detection, instance segmentation, image classification, pose estimation and oriented bounding-box detection, Ultralytics’ models deliver high accuracy and efficiency and are optimized for both embedded devices and large-scale inference. The product also includes Ultralytics HUB, a web-based tool where users can upload their images/videos, train models online, preview results (even on a phone), collaborate with team members, and deploy via an inference API.
  • 27
    Azure AI Content Safety
    Azure AI Content Safety is a content moderation platform that uses AI to keep your content safe. Create better online experiences for everyone with powerful AI models that detect offensive or inappropriate content in text and images quickly and efficiently. Language models analyze multilingual text, in both short and long form, with an understanding of context and semantics. Vision models perform image recognition and detect objects in images using state-of-the-art Florence technology. AI content classifiers identify sexual, violent, hate, and self-harm content with high levels of granularity. Content moderation severity scores indicate the level of content risk on a scale of low to high.
  • 28
    LLaVA

    LLaVA

    LLaVA

    LLaVA (Large Language-and-Vision Assistant) is an innovative multimodal model that integrates a vision encoder with the Vicuna language model to facilitate comprehensive visual and language understanding. Through end-to-end training, LLaVA exhibits impressive chat capabilities, emulating the multimodal functionalities of models like GPT-4. Notably, LLaVA-1.5 has achieved state-of-the-art performance across 11 benchmarks, utilizing publicly available data and completing training in approximately one day on a single 8-A100 node, surpassing methods that rely on billion-scale datasets. The development of LLaVA involved the creation of a multimodal instruction-following dataset, generated using language-only GPT-4. This dataset comprises 158,000 unique language-image instruction-following samples, including conversations, detailed descriptions, and complex reasoning tasks. This data has been instrumental in training LLaVA to perform a wide array of visual and language tasks effectively.
  • 29
    OpenCV

    OpenCV

    OpenCV

    OpenCV (Open Source Computer Vision Library) is an open-source computer vision and machine learning software library. OpenCV was built to provide a common infrastructure for computer vision applications and to accelerate the use of machine perception in commercial products. Being a BSD-licensed product, OpenCV makes it easy for businesses to utilize and modify the code. The library has more than 2500 optimized algorithms, which includes a comprehensive set of both classic and state-of-the-art computer vision and machine learning algorithms. These algorithms can be used to detect and recognize faces, identify objects, classify human actions in videos, track camera movements, track moving objects, extract 3D models of objects, produce 3D point clouds from stereo cameras, and stitch images together to produce a high-resolution image of an entire scene, find similar images from an image database, remove red eyes from images taken using flash, follow eye movements, recognize scenery, etc.
  • 30
    VisionSense
    Real-time computer vision and advanced image processing solution that leverages advanced models of convolutional neural networks. The top application of the product has been in building management, identity verification and fraud detection, manufacturing and quality control. Winjit is one of India’s leading technology providers with over a decade of experience in innovating engineering solutions across industries.
  • 31
    inferdo

    inferdo

    inferdo

    Easily integrate our Computer Vision API to add some Machine Learning magic to your app. At inferdo, we pride ourselves in our ability to offer state-of-the-art, pre-trained deep learning models, but also our ability to efficiently serve them at scale. That means we can pass the savings on to you! Simply provide an image URL to our API and we'll handle the rest. Use our Content Moderation API to flag possible inappropriate content in your images. This model is trained to detect nudity and NSFW content in images, both real and drawn. Check out our API cost comparisons here vs our competitors. Use our Image Labeling API to add semantic labels to your images. This model is trained to classify thousands of unique labels across a wide verity of categories. Use our Face Detection API to detect the location of human faces in your images. Need more information? Then use our Face Details API to detect faces, gender, age, and other facial features.
    Starting Price: $0.0005 per month
  • 32
    DeepSeek-VL

    DeepSeek-VL

    DeepSeek

    DeepSeek-VL is an open source Vision-Language (VL) model designed for real-world vision and language understanding applications. Our approach is structured around three key dimensions: We strive to ensure our data is diverse, scalable, and extensively covers real-world scenarios, including web screenshots, PDFs, OCR, charts, and knowledge-based content, aiming for a comprehensive representation of practical contexts. Further, we create a use case taxonomy from real user scenarios and construct an instruction tuning dataset accordingly. The fine-tuning with this dataset substantially improves the model's user experience in practical applications. Considering efficiency and the demands of most real-world scenarios, DeepSeek-VL incorporates a hybrid vision encoder that efficiently processes high-resolution images (1024 x 1024), while maintaining a relatively low computational overhead.
  • 33
    Azure AI Services
    Build cutting-edge, market-ready AI applications with out-of-the-box and customizable APIs and models. Quickly infuse generative AI into production workloads using studios, SDKs, and APIs. Gain a competitive edge by building AI apps powered by foundation models, including those from OpenAI, Meta, and Microsoft. Detect and mitigate harmful use with built-in responsible AI, enterprise-grade Azure security, and responsible AI tooling. Build your own copilot and generative AI applications with cutting-edge language and vision models. Retrieve the most relevant data using keyword, vector, and hybrid search. Monitor text and images to detect offensive or inappropriate content. Translate documents and text in real time across more than 100 languages.
  • 34
    Viso Suite

    Viso Suite

    Viso Suite

    Viso Suite is the world’s only end-to-end platform for computer vision. It enables teams to rapidly train, create, deploy and manage computer vision applications – without writing code from scratch. Use Viso Suite to deliver industry-leading computer vision and real-time deep learning systems with low-code and automated software infrastructure. The use of traditional development methods, fragmented software tools, and the lack of experienced engineers are costing organizations lots of time and leading to inefficient, low-performing, and expensive computer vision systems. Build and deploy better computer vision applications faster by abstracting and automating the entire lifecycle with Viso Suite, the all-in-one enterprise vision platform.​ Collect data for computer vision annotation with Viso Suite. Use automated collection capabilities to gather high-quality training data. Control and secure all data collection. Enable continuous data collection to further improve your AI models.
  • 35
    GeoSpy

    GeoSpy

    GeoSpy

    GeoSpy is an AI-powered platform that transforms pixels into actionable location intelligence by converting low-context photo data into precise GPS location predictions without relying on EXIF data. Trusted by over 1,000 organizations worldwide, GeoSpy offers global coverage, deploying its services in over 120 countries. The platform processes over 200,000 images daily and can scale to billions, providing fast, secure, and accurate geolocation services. GeoSpy Pro, designed for government and law enforcement agencies, integrates advanced AI location models to deliver meter-level accuracy through state-of-the-art computer vision models in an easy-to-use interface. Additionally, GeoSpy has introduced SuperBolt, a new AI model that enhances visual place recognition, offering improved accuracy in geolocation predictions.
  • 36
    Rupert AI

    Rupert AI

    Rupert AI

    Rupert AI envisions a world where marketing is not just about reaching audiences but engaging them in the most personalized and effective way. Our AI-driven solutions are designed to make this vision a reality for businesses of all sizes. Key Features - AI model training: You can train your vision model, an object, style or a character. - AI workflows: Multiple AI workflows for marketing and creative material creation. Benefits of AI Model Training - Custom Solutions: Train models to recognize specific objects, styles, or characters that match your needs. - Higher Accuracy: Get better results tailored to your unique requirements. - Versatility: Useful for different industries like design, marketing, and gaming. - Faster Prototyping: Quickly test new ideas and concepts. - Brand Differentiation: Build unique visual styles and assets that stand out.
  • 37
    Amazon Lookout for Vision
    Easily create a machine learning (ML) model to spot anomalies from your live process line with as few as 30 images. Identify visual anomalies in real time to reduce and prevent defects and improve product quality. Prevent unplanned downtime and reduce operational costs by using visual inspection data to spot potential issues and take corrective action. Spot damage to a product’s surface quality, color, and shape during the fabrication and assembly process. Determine what’s missing based on the absence, presence, or placement of objects, like a missing capacitor in a printed circuit board. Detect defects with repeating patterns, such as repeated scratches in the same spot on a silicon wafer. Amazon Lookout for Vision is an ML service that uses computer vision to spot defects in manufactured products at scale. Spot product defects using computer vision to automate quality inspection.
  • 38
    CloudSight API

    CloudSight API

    CloudSight

    Image recognition technology that provides true understanding of your digital media. With our on-device computer vision model, users can expect an average response time of less than 250ms. This is more than 4x faster than using our API and does not require an internet connection. Users can recognize objects in a space by simply scanning their phone around a room, eliminating the need to take individual pictures. This feature is unique to our on-device model. By removing the need for data to leave the end-user device, privacy concerns are virtually eliminated. While our API takes every precaution possible to protect your privacy and data, our on-device model raises the bar on security substantially. Send CloudSight your visual content, and our API will generate a natural language description in response. Filter and categorize images, monitor for inappropriate content, and automatically assign labels for all of your digital media.
  • 39
    Voxel51

    Voxel51

    Voxel51

    FiftyOne by Voxel51 - the most powerful visual AI and computer vision data platform. Without the right data, even the smartest AI models fail. FiftyOne gives machine learning engineers the power to deeply understand and evaluate their visual datasets—across images, videos, 3D point clouds, geospatial, and medical data. With over 2.8 million open source installs and customers like Walmart, GM, Bosch, Medtronic, and the University of Michigan Health, FiftyOne is an indispensable tool for building computer vision systems that work in the real world, not just in the lab. FiftyOne streamlines visual data curation and model analysis with workflows to simplify the labor-intensive processes of visualizing and analyzing insights during data curation and model refinement—addressing a major challenge in large-scale data pipelines with billions of samples. Proven impact with FiftyOne: ⬆️30% increase in model accuracy ⏱️5+ months of development time saved 📈30% boost in productivity
  • 40
    Pixtral Large

    Pixtral Large

    Mistral AI

    Pixtral Large is a 124-billion-parameter open-weight multimodal model developed by Mistral AI, building upon their Mistral Large 2 architecture. It integrates a 123-billion-parameter multimodal decoder with a 1-billion-parameter vision encoder, enabling advanced understanding of documents, charts, and natural images while maintaining leading text comprehension capabilities. With a context window of 128,000 tokens, Pixtral Large can process at least 30 high-resolution images simultaneously. The model has demonstrated state-of-the-art performance on benchmarks such as MathVista, DocVQA, and VQAv2, surpassing models like GPT-4o and Gemini-1.5 Pro. Pixtral Large is available under the Mistral Research License for research and educational use, and under the Mistral Commercial License for commercial applications.
  • 41
    Intel Geti
    Intel® Geti™ software simplifies the process of building computer vision models by enabling fast, accurate data annotation and training. With capabilities like smart annotations, active learning, and task chaining, users can create models for classification, object detection, and anomaly detection without writing additional code. The platform also provides built-in optimizations, hyperparameter tuning, and production-ready models optimized for Intel’s OpenVINO™ toolkit. Designed to support collaboration, Geti™ helps teams streamline model development, from data labeling to model deployment.
  • 42
    IBM Video Explorer Platform
    Video Explorer Platform is a full functionality platform for video analytics (computer vision) application development and deployment. It provides an application framework that could be configured and customized to adapt to customers’ business requirements and further integrate with customers’ business systems. It could enable an enterprise to land a video analytics solution in a very short time. Co-worked with another asset the IBM Visual Builder (IVB), the customer could benefit from one-station video analytics application development and deployment, which include image labeling, image augmentation, training, validation, and publishing to Video Explorer Platform. Provides a full functionality platform of video analytics application development and deployment, including data source management (video devices, images, offline video materials), real-time video browsing, image / slip extraction, storage, model mapping, event processing rule configuration, etc.
  • 43
    EyePop.ai

    EyePop.ai

    EyePop.ai

    Streamlining visual data analysis for easy, accessible AI-powered insights, regardless of industry or technical knowledge. Build your tailored AI application with EyePop. Embark on your project journey today, leveraging our advanced computer vision technology. Discover the untapped potential in your images and videos. Our platform delivers deep insights into your media, enhancing user experiences and boosting engagement. Building a custom application is a breeze with our intuitive no/low code platform. Anyone can easily create Pops that work with existing images, videos, or even real-time streams. Develop powerful, tailored computer vision solutions and make the most of your visual data. Empower decision-making with AI-driven insights, revolutionizing computer vision interaction. Build custom computer vision apps effortlessly with EyePop.ai’s no/low code platform for all skill levels.
  • 44
    GLM-5V-Turbo
    GLM-5V-Turbo is a multimodal coding foundation model designed for vision-based coding tasks, capable of natively processing inputs such as images, video, text, and files while producing text outputs. It is optimized for agent workflows, enabling a full loop of understanding environments, planning actions, and executing tasks, and integrates seamlessly with agent frameworks like Claude Code and OpenClaw. It supports long-context interactions with a context length of 200K tokens and up to 128K output tokens, making it suitable for complex, long-horizon tasks. It offers multiple thinking modes for different scenarios, strong vision comprehension across images and video, real-time streaming output for improved interaction, and advanced function-calling capabilities for integrating external tools. It also includes context caching to enhance performance in extended conversations. In practical use, it can reconstruct frontend projects from design mockups.
  • 45
    Ray2

    Ray2

    Luma AI

    Ray2 is a large-scale video generative model capable of creating realistic visuals with natural, coherent motion. It has a strong understanding of text instructions and can take images and video as input. Ray2 exhibits advanced capabilities as a result of being trained on Luma’s new multi-modal architecture scaled to 10x compute of Ray1. Ray2 marks the beginning of a new generation of video models capable of producing fast coherent motion, ultra-realistic details, and logical event sequences. This increases the success rate of usable generations and makes videos generated by Ray2 substantially more production-ready. Text-to-video generation is available in Ray2 now, with image-to-video, video-to-video, and editing capabilities coming soon. Ray2 brings a whole new level of motion fidelity. Smooth, cinematic, and jaw-dropping, transform your vision into reality. Tell your story with stunning, cinematic visuals. Ray2 lets you craft breathtaking scenes with precise camera movements.
    Starting Price: $9.99 per month
  • 46
    Flexible Vision

    Flexible Vision

    Flexible Vision

    Flexible Vision is an AI machine vision software and hardware solution that enables your team to quickly and easily solve difficult visual inspections. The cloud portal allows your teams to collaborate and share vision inspection programs across factory floors. Collect 5-10 images of good parts and bad parts. Our software will optionally increase this sample size with augmentation. With a click of a button, your model will begin to be created. Your model will be ready for production in a matter of minutes. Your AI model will automatically deploy and be ready for validation. Download or sync the model to as many on-prem production lines as needed. Our high speed industrial processors quickly process your images. Simply select the ai model from a dropdown and watch the detections live on screen. Our systems are designed for either manual inspection stations or incorporated into traditional factory automation. Our systems are IO and field-bus compatible.
  • 47
    Palmyra LLM
    Palmyra is a suite of Large Language Models (LLMs) engineered for precise, dependable performance in enterprise applications. These models excel in tasks such as question-answering, image analysis, and support for over 30 languages, with fine-tuning available for industries like healthcare and finance. Notably, Palmyra models have achieved top rankings in benchmarks like Stanford HELM and PubMedQA, and Palmyra-Fin is the first model to pass the CFA Level III exam. Writer ensures data privacy by not using client data to train or modify their models, adopting a zero data retention policy. The Palmyra family includes specialized models such as Palmyra X 004, featuring tool-calling capabilities; Palmyra Med, tailored for healthcare; Palmyra Fin, designed for finance; and Palmyra Vision, which offers advanced image and video processing. These models are available through Writer's full-stack generative AI platform, which integrates graph-based Retrieval Augmented Generation (RAG).
    Starting Price: $18 per month
  • 48
    AskUI

    AskUI

    AskUI

    AskUI is an innovative platform that enables AI agents to visually perceive and interact with any computer interface, facilitating seamless automation across various operating systems and applications. Leveraging advanced vision models, AskUI's PTA-1 prompt-to-action model allows users to execute AI-driven actions on Windows, macOS, Linux, and mobile devices without the need for jailbreaking. This technology is particularly beneficial for tasks such as desktop and mobile automation, visual testing, and document or data processing. By integrating with tools like Jira, Jenkins, GitLab, and Docker, AskUI enhances workflow efficiency and reduces the burden on developers. Companies like Deutsche Bahn have reported significant improvements in internal processes, citing over a 90% increase in efficiency through the use of AskUI's test automation capabilities.
  • 49
    DecentAI

    DecentAI

    Catena Labs

    DecentAI provides: - Anonymized mobile access to hundreds of generative AI models: Explore models for text, image, audio, and vision. - Model Mixes and flexible model routing: Mix and match models, choose specific favorites, or let DecentAI select the best for you. - If one model is slow or unavailable, DecentAI seamlessly switches to another provider, ensuring a smooth and efficient experience. - Privacy-first design: Chats are stored on your device, not on our servers. - AI internet access: Allow models to pull in the latest information through anonymized web search. - Soon, you’ll be able to run models locally on your device and connect your own private models.
  • 50
    GPT-4o

    GPT-4o

    OpenAI

    GPT-4o (“o” for “omni”) is a step towards much more natural human-computer interaction—it accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs. It can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time (opens in a new window) in a conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models.