Showing 10 open source projects for "byte on"

View related business solutions
  • MongoDB Atlas runs apps anywhere Icon
    MongoDB Atlas runs apps anywhere

    Deploy in 115+ regions with the modern database for every enterprise.

    MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
    Start Free
  • Custom VMs From 1 to 96 vCPUs With 99.95% Uptime Icon
    Custom VMs From 1 to 96 vCPUs With 99.95% Uptime

    General-purpose, compute-optimized, or GPU/TPU-accelerated. Built to your exact specs.

    Live migration and automatic failover keep workloads online through maintenance. One free e2-micro VM every month.
    Try Free
  • 1
    minbpe

    minbpe

    Minimal, clean code for the Byte Pair Encoding (BPE) algorithm

    minbpe is a minimal, clean implementation of byte-level Byte Pair Encoding (BPE), the tokenization approach widely used in modern language models. It operates on UTF-8 encoded bytes rather than Unicode characters, which makes it robust to arbitrary text inputs and avoids needing a language-specific character vocabulary. The repository is structured as a teaching-oriented implementation that shows how to train a tokenizer by learning merge rules, then apply those merges to encode text into token IDs and decode tokens back into text. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 2
    SentencePiece

    SentencePiece

    Unsupervised text tokenizer for Neural Network-based text generation

    SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) [Sennrich et al.]) and unigram language model [Kudo.]) with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing. Purely data driven, sentencePiece trains tokenization and detokenization models from sentences. ...
    Downloads: 4 This Week
    Last Update:
    See Project
  • 3
    autoresearch-mlx

    autoresearch-mlx

    Apple Silicon (MLX) port of Karpathy's autoresearch

    ...It maintains the core autoresearch structure, where an AI agent iteratively edits a training script, executes experiments under a fixed time budget, and evaluates results based on a defined metric such as validation bits per byte. The system is tailored for Apple hardware, leveraging unified memory and MLX capabilities to achieve efficient training on Mac devices. It includes a minimal and focused project structure consisting of data preparation utilities, a modifiable training file, and a program specification that governs the agent’s behavior. The framework logs experiment results and supports continuous iteration, enabling long-running optimization cycles that can reveal hardware-specific performance patterns.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 4
    Tiktoken

    Tiktoken

    tiktoken is a fast BPE tokeniser for use with OpenAI's models

    tiktoken is a high-performance, tokenizer library (based on byte-pair encoding, BPE) designed for use with OpenAI’s models. It handles encoding and decoding text to token IDs efficiently, with minimal overhead. Because tokenization is a fundamental step in preparing text for models, tiktoken is optimized for speed, memory, and correctness in model contexts (e.g. matching OpenAI’s internal tokenization).
    Downloads: 0 This Week
    Last Update:
    See Project
  • Full-stack observability with actually useful AI | Grafana Cloud Icon
    Full-stack observability with actually useful AI | Grafana Cloud

    Our generous forever free tier includes the full platform, including the AI Assistant, for 3 users with 10k metrics, 50GB logs, and 50GB traces.

    Built on open standards like Prometheus and OpenTelemetry, Grafana Cloud includes Kubernetes Monitoring, Application Observability, Incident Response, plus the AI-powered Grafana Assistant. Get started with our generous free tier today.
    Create free account
  • 5
    LaMDA-pytorch

    LaMDA-pytorch

    Open-source pre-training implementation of Google's LaMDA in PyTorch

    Open-source pre-training implementation of Google's LaMDA research paper in PyTorch. The totally not sentient AI. This repository will cover the 2B parameter implementation of the pre-training architecture as that is likely what most can afford to train. You can review Google's latest blog post from 2022 which details LaMDA here. You can also view their previous blog post from 2021 on the model.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 6
    YouTokenToMe

    YouTokenToMe

    Unsupervised text tokenizer focused on computational efficiency

    YouTokenToMe is a fast and efficient unsupervised text tokenization library designed for training subword embeddings, particularly useful for NLP models.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 7

    Merge PDF Files

    It is a Windows library that merges standard PDFs into a final PDF

    ...There are lots of SDKs on the market creating (merging) PDFs (almost all of them have limitations). Our Windows library (MergePDFByNMI.dll) only merges standard PDF files (there are several PDF formats). You can send the input PDFs (by file name or by byte array) and you can have the final PDF (saved on a file or get back on a byte array). The library calls can be synchronous or asynchronous. We want to give you a benchmark, the library was used to create a PDF from single page(scanned) image by an OCR SDK (it is not included in our library, you can use any on the market): 20,000 Images (the OCR SDK creates single page PDF text searchable, running 50 threads) in 80 minutes. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 8

    Immutable Sparse Wave Trees (WaveTree)

    Realtime bigdata tool for bit strings up to 2^63 based on AVL forest

    ...Main object is a sparse bit string (Bits) that efficiently scales up to 2^63 bits normally compressed as forest has duplicated substrings. Bits objects support reading bit, byte, short, int, or long (Java primitives) at any bit index in 64 bit range. Example: instead of building a class to hold a header and then data, represent all of that as Bits, subranges of them, and ints for sizes of its parts. Expansion ability for other kinds of compression, since Bits is a Java interface. Main functions on bits are substring, concat, number of 0 or 1 bits, and number of bits (size). ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 9

    ANNFiD

    A forensic file identification tool using neural networks

    Just carved a bunch of bytes and have no idea what they could be? Maybe ANNFiD can help. ANNFiD uses neural network to identify byte patterns. It can be trained and has a GUI to help in the process. The tool is still on a very early stage, but could improve exponentially with the help of the developer community
    Downloads: 0 This Week
    Last Update:
    See Project
  • Try Google Cloud Risk-Free With $300 in Credit Icon
    Try Google Cloud Risk-Free With $300 in Credit

    No hidden charges. No surprise bills. Cancel anytime.

    Use your credit across every product. Compute, storage, AI, analytics. When it runs out, 20+ products stay free. You only pay when you choose to.
    Start Free
  • 10
    roberta-base

    roberta-base

    Robust BERT-based model for English with improved MLM training

    roberta-base is a robustly optimized variant of BERT, pretrained on a significantly larger corpus of English text using dynamic masked language modeling. Developed by Facebook AI, RoBERTa improves on BERT by removing the Next Sentence Prediction objective, using longer training, larger batches, and more data, including BookCorpus, English Wikipedia, CC-News, OpenWebText, and Stories. It captures contextual representations of language by masking 15% of input tokens and predicting them....
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • Next
MongoDB Logo MongoDB