Showing 12 open source projects for "parallel corpus"

View related business solutions
  • MongoDB Atlas runs apps anywhere Icon
    MongoDB Atlas runs apps anywhere

    Deploy in 115+ regions with the modern database for every enterprise.

    MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
    Start Free
  • Full-stack observability with actually useful AI | Grafana Cloud Icon
    Full-stack observability with actually useful AI | Grafana Cloud

    Our generous forever free tier includes the full platform, including the AI Assistant, for 3 users with 10k metrics, 50GB logs, and 50GB traces.

    Built on open standards like Prometheus and OpenTelemetry, Grafana Cloud includes Kubernetes Monitoring, Application Observability, Incident Response, plus the AI-powered Grafana Assistant. Get started with our generous free tier today.
    Create free account
  • 1
    Step3-VL-10B

    Step3-VL-10B

    Multimodal model achieving SOTA performance

    Step3-VL-10B is an open-source multimodal foundation model developed by StepFun AI that pushes the boundaries of what compact models can achieve by combining visual and language understanding in a single architecture. Despite having only about 10 billion parameters, it delivers performance that rivals or even surpasses much larger models (10×–20× larger) on a wide range of multimodal benchmarks covering reasoning, perception, and complex tasks, positioning it as one of the most powerful...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 2

    PADIC

    A multilingual Parallel Arabic DIalectal Corpus

    PADIC (Parallel Arabic DIalectal Corpus) is a multi-dialectal corpus built in the framework of the National Research Project "TORJMAN", led by Scientific and Technical Research Center for the Development of Arabic Language and funded by the Algerian Ministry of Higher Education and Scientific Research. PADIC is composed of 6 dialects: two Algerian dialects (Algiers and Annaba cities), Palestinian, Syrian, Tunisian, Moroccan) and MSA.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 3
    The English-Vietnamese Bilingual Corpus (EVBCorpus) is a collection of English and Vietnamese parallel translations and bitexts.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 4
    Osman Arabic Text Readability

    Osman Arabic Text Readability

    Open Source tool for Arabic text readability

    We present OSMAN (Open Source Metric for Measuring Arabic Narratives) - a novel open source Arabic readability metric and tool. The open source Java tool allows users to calculate readability for Arabic text (with and without diacritics). The tool provides methods to split the text into words and sentence, count syllables, Faseeh letters, hard and complex words in addition to adding diacritics (vocalise text). This makes the tool useful for researchers and educators working with Arabic text....
    Downloads: 0 This Week
    Last Update:
    See Project
  • Custom VMs From 1 to 96 vCPUs With 99.95% Uptime Icon
    Custom VMs From 1 to 96 vCPUs With 99.95% Uptime

    General-purpose, compute-optimized, or GPU/TPU-accelerated. Built to your exact specs.

    Live migration and automatic failover keep workloads online through maintenance. One free e2-micro VM every month.
    Try Free
  • 5

    Transml

    Phrase based Statistical Machine Transltion system for English Languag

    ...Statistical Machine Translation (SMT) is a machine translation paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The SMT is a corpus based approach, where a massive parallel corpus is required for training the SMT systems.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 6
    Various tools for creating annotated parallel corpora including pre-trained tagging and parsing models for various languages, sentence alignment tools and word alignment tools. Uplug also includes a web-based interface for interactive sentence and word alignment and scripts for indexing and querying parallel corpora using the Corpus Work Bench CWB. Download 'uplug-main' first and then add other packages.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 7

    Khmer Automatic Translation

    Khmer-English-Khmer Automatic Translation

    The project attempts to develop a parallel-corpus-based hybrid high quality English-Khmer-English automatic translation system based on statistical analysis and enhanced with part-of-speech analysis.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 8

    English-Khmer S. Machine Translation

    English-Khmer Automatic Statistic Machine Translation (SMT)

    Automatic Machine Translation from English to Khmer project is the first effort in Natural Language Processing field for translating English to Khmer (Cambodian) language. This project uses Domy CE, an open source SMT toolkit, for training parallel corpus and web technologies such as Python, Apache2, HTML, XML, and XSLT for developing web-based application. This project is developed by Ms. Kim Sokphyrum (DU) and Ms. Suos Samak (Jamia), under Supervision of Mr. Javier Sola, a Program Manager at Open Institute (OI), Cambodia, Dr. Vasudha Bhatnagar, an Assistant professor and a Head of Computer Science at University of Delhi (DU), New Delhi, India. and Dr. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 9
    CRFSharp

    CRFSharp

    CRFSharp is a .NET(C#) implementation of Conditional Random Field

    ...CRF#'s mainly algorithm is the same as CRF++ written by Taku Kudo. It encodes model parameters by L-BFGS. Moreover, it has many significant improvement than CRF++, such as totally parallel encoding, optimizing memory usage and so on. Currently, when training corpus, compared with CRF++, CRF# can make full use of multi-core CPUs and only uses very low memory, and memory grow is very smoothly and slowly while amount of training corpus, tags increase. with multi-threads process, CRF# is more suitable for large data and tags training than CRF++ now. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • Build Securely on AWS with Proven Frameworks Icon
    Build Securely on AWS with Proven Frameworks

    Lay a foundation for success with Tested Reference Architectures developed by Fortinet’s experts. Learn more in this white paper.

    Moving to the cloud brings new challenges. How can you manage a larger attack surface while ensuring great network performance? Turn to Fortinet’s Tested Reference Architectures, blueprints for designing and securing cloud environments built by cybersecurity experts. Learn more and explore use cases in this white paper.
    Download Now
  • 10
    Sanchay
    Sanchay is a collection of tools and APIs for language researchers. It has some implementations of NLP algorithms, some flexible APIs, several user friendly annotation interfaces and Sanchay Query Language for language resources.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 11
    This is a project implemented to build a parallel text search system with the main emphasize being placed on searching for the provided query within the shortest possible time. The query will be searched within the GigaWord corpus.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    GigaChat 3 Ultra

    GigaChat 3 Ultra

    High-performance MoE model with MLA, MTP, and multilingual reasoning

    ...It leverages Multi-head Latent Attention to compress the KV cache into latent vectors, dramatically reducing memory demand and improving inference speed at scale. The model also employs Multi-Token Prediction, enabling multi-step token generation in a single pass for up to 40% faster output through speculative and parallel decoding techniques. Its training corpus incorporates ten languages, enriched with books, academic sources, code datasets, mathematical tasks, and more than 5.5 trillion tokens of high-quality synthetic data. This combination significantly boosts reasoning, coding, and multilingual performance across modern benchmarks. Designed for high-performance deployment, GigaChat 3 Ultra supports major inference engines and offers optimized BF16 and FP8 execution paths for cluster-grade hardware.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • Next
MongoDB Logo MongoDB