Showing 47 open source projects for "tokenizer"

View related business solutions
  • Get Advanced Threat Protection for Your Azure Workloads Icon
    Get Advanced Threat Protection for Your Azure Workloads

    FortiGate NGFW on Azure Enables You to Protect Your Workloads Beyond Basic Azure Security Services

    FortiGate NGFW identifies and stops advanced threats with powerful application control, malware protection, web filtering, antivirus, and IPS technology. As the attack surface expands, FortiGate provides integrated and automated protection against emerging and sophisticated threats while securing hybrid or multi-cloud environments. Deploy today in Azure Marketplace.
  • AI-based, Comprehensive Service Management for Businesses and IT Providers Icon
    AI-based, Comprehensive Service Management for Businesses and IT Providers

    Modular solutions for change management, asset management and more

    ChangeGear provides IT staff with the functions required to manage everything from ticketing to incident, change and asset management and more. ChangeGear includes a virtual agent, self-service portals and AI-based features to support analyst and end user productivity.
  • 1
    Tokenizer

    Tokenizer

    A small library for converting tokenized PHP source code into XML

    A small library for converting tokenized PHP source code into XML. You can add this library as a local, per-project dependency to your project using Composer. If you only need this library during development, for instance to run your project's test suite, then you should add it as a development-time dependency.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 2
    SentencePiece

    SentencePiece

    Unsupervised text tokenizer for Neural Network-based text generation

    SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) [Sennrich et al.]) and unigram language model [Kudo.]) with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 3
    llm

    llm

    An ecosystem of Rust libraries for working with large language models

    llm is an ecosystem of Rust libraries for working with large language models - it's built on top of the fast, efficient GGML library for machine learning. The primary entry point for developers is the llm crate, which wraps the llm-base and the supported model crates. Documentation for the released version is available on Docs.rs. For end-users, there is a CLI application, llm-cli, which provides a convenient interface for interacting with supported models. Text generation can be done as a...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 4
    torchtext

    torchtext

    Data loaders and abstractions for text and NLP

    We recommend Anaconda as a Python package management system. Please refer to pytorch.org for the details of PyTorch installation. LTS versions are distributed through a different channel than the other versioned releases. Alternatively, you might want to use the Moses tokenizer port in SacreMoses (split from NLTK). You have to install SacreMoses. To build torchtext from source, you need git, CMake and C++11 compiler such as g++. When building from source, make sure that you have the same C...
    Downloads: 0 This Week
    Last Update:
    See Project
  • Gain insights and build data-powered applications Icon
    Gain insights and build data-powered applications

    Your unified business intelligence platform. Self-service. Governed. Embedded.

    Chat with your business data with Looker. More than just a modern business intelligence platform, you can turn to Looker for self-service or governed BI, build your own custom applications with trusted metrics, or even bring Looker modeling to your existing BI environment.
  • 5
    IK Analysis for Elasticsearch

    IK Analysis for Elasticsearch

    A plugin that integrates Lucene IK analyzer into elasticsearch

    ..., independent of the Lucene project, and at the same time provides a default optimized implementation of Lucene. In the 2012 version, IK implemented a simple word segmentation ambiguity elimination algorithm, marking the evolution of the IK tokenizer from pure dictionary word segmentation to analog semantic word segmentation.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 6
    LaMDA-pytorch

    LaMDA-pytorch

    Open-source pre-training implementation of Google's LaMDA in PyTorch

    Open-source pre-training implementation of Google's LaMDA research paper in PyTorch. The totally not sentient AI. This repository will cover the 2B parameter implementation of the pre-training architecture as that is likely what most can afford to train. You can review Google's latest blog post from 2022 which details LaMDA here. You can also view their previous blog post from 2021 on the model.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 7
    Tensorflow Transformers

    Tensorflow Transformers

    State of the art faster Transformer with Tensorflow 2.0

    ... speech recognition and audio classification. Faster AutoReggressive Decoding, TFlite support, creating TFRecords is simple. Auto-Batching tf.data.dataset or tf.ragged tensors. Everything is dictionary (inputs and outputs) Multiple mask modes like causal, user-defined, prefix. tensorflow-text tokenizer support. Supports GPU, TPU, multi-GPU trainer with wandb, multiple callbacks, auto tensorboard.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 8
    RE/flex lexical analyzer generator

    RE/flex lexical analyzer generator

    The regex-centric, fast lexical analyzer generator for C++

    RE/flex is the fast lexical analyzer generator (faster than Flex) with full Unicode support, indent/nodent/dedent anchors, lazy quantifiers, and many other modern features. Accepts Flex lexer specification syntax and is compatible with Bison/Yacc parsers. Generates reusable source code that is easy to understand. Supports fast scanning of UTF-8/16/32 files, strings, and streams. The reflex scanner generator tool generates clean lexer class code that is thread-safe. Generates Graphviz files...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 9
    GPT Neo

    GPT Neo

    An implementation of model parallel GPT-2 and GPT-3-style models

    An implementation of model & data parallel GPT3-like models using the mesh-tensorflow library. If you're just here to play with our pre-trained models, we strongly recommend you try out the HuggingFace Transformer integration. Training and inference is officially supported on TPU and should work on GPU as well. This repository will be (mostly) archived as we move focus to our GPU-specific repo, GPT-NeoX. NB, while neo can technically run a training step at 200B+ parameters, it is very...
    Downloads: 8 This Week
    Last Update:
    See Project
  • Automated quote and proposal software for IT solution providers. | ConnectWise CPQ Icon
    Automated quote and proposal software for IT solution providers. | ConnectWise CPQ

    Create IT quote templates, automate workflows, add integrations & price catalogs to save time & reduce errors on manual data entry & updates.

    ConnectWise CPQ, formerly ConnectWise Sell, is a professional quote and proposal automation software for IT solution providers. ConnectWise CPQ offers a wide range of tools that enables IT solution providers to save time, quote more, and win big. Top features include professional quote or proposal templates, product catalog and sourcing, workflow automation, sales reporting, and integrations with best-in-breed solutions like Cisco, Dell, HP, and Salesforce.
  • 10
    GPT2 for Multiple Languages

    GPT2 for Multiple Languages

    GPT2 for Multiple Languages, including pretrained models

    With just 2 clicks (not including Colab auth process), the 1.5B pretrained Chinese model demo is ready to go. The contents in this repository are for academic research purpose, and we do not provide any conclusive remarks. Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC) Simplifed GPT2 train scripts(based on Grover, supporting TPUs). Ported bert tokenizer, multilingual corpus compatible. 1.5B GPT2 pretrained Chinese model (~15G corpus, 10w steps). Batteries...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 11
    javalang

    javalang

    Pure Python Java parser and tools

    javalang is a pure Python library for working with Java source code. javalang provides a lexer and parser targeting Java 8. The implementation is based on the Java language spec.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    Ganja.js

    Ganja.js

    Javascript Geometric Algebra Generator for Javascript, c++

    ... is a code generator producing classes that reificate algebraic literals and expressions by using reflection, a built-in tokenizer and a simple AST translator to rewrite functions containing algebraic constructs to their procedural counterparts. ganja.js now has a nodejs based templated source generator that allows the creation of arbitrary algebras for C++, C#, python and rust. The generated code provides in a flat multivector format and operator overloading.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 13
    The C++ String Toolkit Library (StrTk) consists of robust, optimized and portable string processing algorithms for the C++ language. StrTk is designed to be easy to use and integrate within existing code bases. http://strtk.partow.net
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    The C++ String Toolkit Library (StrTk) consists of robust, optimized and portable string processing algorithms for the C++ language. StrTk is designed to be easy to use and integrate within existing code bases. http://strtk.partow.net
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15

    cyqlite

    enhanced SQLite

    100% Upwards compatible variant of SQLite. Provides win32/win64 versions of sqlite3.dll, which work better (smaller/faster/longer paths) than the dll's provided by sqlite.org.
    Leader badge
    Downloads: 8 This Week
    Last Update:
    See Project
  • 16
    WikiSQL

    WikiSQL

    A large annotated semantic parsing corpus for developing NL interfaces

    A large crowd-sourced dataset for developing natural language interfaces for relational databases. WikiSQL is the dataset released along with our work Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning. Regarding tokenization and Stanza, when WikiSQL was written 3-years ago, it relied on Stanza, a CoreNLP python wrapper that has since been deprecated. If you'd still like to use the tokenizer, please use the docker image. We do not anticipate switching...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17

    flex: the fast lexical analyser

    flex is a tool for generating scanners

    flex is a tool for generating scanners. A scanner, sometimes called a tokenizer, is a program which recognizes lexical patterns in text. The flex program reads user-specified input files, or its standard input if no file names are given, for a description of a scanner to generate. The description is in the form of pairs of regular expressions and C code, called rules. Flex generates a C source file named, "lex.yy.c", which defines the function yylex(). The file "lex.yy.c" can be compiled...
    Leader badge
    Downloads: 2,796 This Week
    Last Update:
    See Project
  • 18
    We have implemented a core summarizer of scientific articles written in Spanish, with the following components: a tokenizer, a grammar checker, a clarity checker, a cohesion-coherence checker, a common-topic extractor and an output formatter.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19

    Persianp Toolbox

    A toolbox for Persian texts preprocessing

    A toolbox for preprocessing Persian texts including: Normalizer Tokenizer Sentencizer POS tagger Lemmatizer Stopword detector For more information please visit: www.persianp.ir/toolbox.html
    Downloads: 0 This Week
    Last Update:
    See Project
  • 20

    bookofangelscost

    A lanuage system like BASIC to be used in Video Games

    COST (Children Of the Sun Tokenizer); is a BASIC-like language; based on QuickBASIC; that is meant to be used in video games. I'm writing a shareware video game; but I wanted the language system to be Open Source. It is written in ISO C++ and is meant to be porable
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21

    tokenizer

    Transforms arithmetic expressions (cstrings) into a sequence of tokens

    A c-string that represents an arithmetic expression ist transformed into a sequence of tokens ( functions, constants, variables, operators, brackets, commas ) and stored on a stack.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 22

    slug_dev

    developing plattform for SLUG projects

    This is the plattform for the developers of SLUG (solr and lucene user group). Here we are hosting our projects related to solr and lucene. log4jSolr - log4j appender (and more) to index all log events solr_core - solr analysis extensions such as filters oder tokenizer
    Downloads: 0 This Week
    Last Update:
    See Project
  • 23
    This is a simple tokenizer for converting source code in ascii text files into a ZX Spectrum loadable image file
    Downloads: 0 This Week
    Last Update:
    See Project
  • 24
    BASIC interpreter for the 16bit PIC microcontroller 24FJ64GA002. The interpreter runs on the chip only, no compiler/tokenizer is needed. Communication with PC is done by USB-to-serial converter cable.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 25
    This is a Linux command line tokenizer utility. It accepts input from both console as well as file and asks the user to enter a "Tokenizer Index Character".
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • Next