Full-stack observability with actually useful AI | Grafana Cloud
Our generous forever free tier includes the full platform, including the AI Assistant, for 3 users with 10k metrics, 50GB logs, and 50GB traces.
Built on open standards like Prometheus and OpenTelemetry, Grafana Cloud includes Kubernetes Monitoring, Application Observability, Incident Response, plus the AI-powered Grafana Assistant. Get started with our generous free tier today.
Create free account
$300 in Free Credit Towards Top Cloud Services
Build VMs, containers, AI, databases, storage—all in one place.
Start your project in minutes. After credits run out, 20+ products include free monthly usage. Only pay when you're ready to scale.
CPU/GPU inference server for Hugging Face transformer models
...Then, if you spend some time, you can build something over ONNX Runtime and Triton inference server. You will usually get from 2X to 4X faster inference compared to vanilla Pytorch. It's cool! However, if you want the best in class performances on GPU, there is only a single possible combination: Nvidia TensorRT and Triton. You will usually get 5X faster inference compared to vanilla Pytorch.
BGE-Large v1.5: High-accuracy English embedding model for retrieval
BAAI/bge-large-en-v1.5 is a powerful English sentence embedding model designed by the Beijing Academy of Artificial Intelligence to enhance retrieval-augmented language model systems. It uses a BERT-based architecture fine-tuned to produce high-quality dense vector representations optimized for sentence similarity, search, and retrieval. This model is part of the BGE (BAAI General Embedding) family and delivers improved similarity distribution and state-of-the-art results on the MTEB...