Showing 25 open source projects for "corpus bbc arabic"

View related business solutions
  • Bright Data - All in One Platform for Proxies and Web Scraping Icon
    Bright Data - All in One Platform for Proxies and Web Scraping

    Say goodbye to blocks, restrictions, and CAPTCHAs

    Bright Data offers the highest quality proxies with automated session management, IP rotation, and advanced web unlocking technology. Enjoy reliable, fast performance with easy integration, a user-friendly dashboard, and enterprise-grade scaling. Powered by ethically-sourced residential IPs for seamless web scraping.
    Get Started
  • Top-Rated Free CRM Software Icon
    Top-Rated Free CRM Software

    216,000+ customers in over 135 countries grow their businesses with HubSpot

    HubSpot is an AI-powered customer platform with all the software, integrations, and resources you need to connect your marketing, sales, and customer service. HubSpot's connected platform enables you to grow your business faster by focusing on what matters most: your customers.
    Get started free
  • 1
    43 queries of various topics for the Information Retrieval Collection . The corpus is created from the OSAC corpus of journalistic texts consisting of 4763 articles recovered from the Arabic BBC News. https://sourceforge.net/projects/ar-text-mining/files/Arabic-Corpora/
    Downloads: 0 This Week
    Last Update:
    See Project
  • 2
    Downloads: 1 This Week
    Last Update:
    See Project
  • 3

    Linguistic Analyzer

    The Linguistic Analyzer is a tool for corpus analysis and comparison

    The Linguistic Analyzer (Almuhalil Alloghawy) is a free tool designed by a team from Al-Imam Muhammad bin Saud islamic university that can be used for corpus analysis and comparison in terms of the several linguistic characteristics, such as frequency lists generation, concordances, collocation extraction, the difference between two words, and keyword identification.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 4
    In this corpus: 10 essays containing 752 sentences (with a total of 4,160 words). The essays were selected from different collections of partially or totally diacritic Arabic texts, all of which are available in the Tashkeela corpus. Texts in this corpus have been used in the evaluation of AGD checker. There are two types of texts in this corpus: 1- Texts without errors to evaluate AGD in terms of detecting and correcting errors that we do not know about before the checking process 2...
    Downloads: 0 This Week
    Last Update:
    See Project
  • Save hundreds of developer hours with components built for SaaS applications. Icon
    Save hundreds of developer hours with components built for SaaS applications.

    The #1 Embedded Analytics Solution for SaaS Teams.

    Whether you want full self-service analytics or simpler multi-tenant security, Qrvey’s embeddable components and scalable data management remove the guess work.
    Try Developer Playground
  • 5

    KSUCCA Corpus

    A 50 million tokens corpus of Classical Arabic.

    King Saud University Corpus of Classical Arabic (KSUCCA) is a pioneering 50 million tokens annotated corpus of Classical Arabic texts from the period of pre-Islamic era until the fourth Hijri century (equivalent to the period from the seventh until early eleventh century CE), which is the period of pure classical Arabic. The main aim of this corpus is to be used for studying the distributional lexical semantics of The Quran words. However, it can be used for other research purposes...
    Downloads: 8 This Week
    Last Update:
    See Project
  • 6

    Queries for OSAC (Arabic) Corpus

    43 Queries for Arabic Information Retrieval Collection

    43 queries of various topics for the Information Retrieval Collection . The corpus is created from the OSAC corpus of journalistic texts consisting of 4763 articles recovered from the Arabic BBC News. https://sourceforge.net/projects/ar-text-mining/files/Arabic-Corpora/
    Downloads: 0 This Week
    Last Update:
    See Project
  • 7

    Arabic Corpus

    Text categorization, arabic language processing, language modeling

    The Arabic Corpus {compiled by Dr. Mourad Abbas ( http://sites.google.com/site/mouradabbas9/corpora ) The corpus Khaleej-2004 contains 5690 documents. It is divided to 4 topics (categories). The corpus Watan-2004 contains 20291 documents organized in 6 topics (categories). Researchers who use these two corpora would mention the two main references: (1) For Watan-2004 corpus ---------------------- M. Abbas, K. Smaili, D. Berkani, (2011) Evaluation of Topic Identification Methods...
    Leader badge
    Downloads: 8 This Week
    Last Update:
    See Project
  • 8
    Tashkeela: Arabic diacritization corpus

    Tashkeela: Arabic diacritization corpus

    Tashkeela: Arabic discritization Corpus (Vocalized texts)

    Tashkeela: Arabic discritization Corpus, Resource, Arabic vocalized texts: نصوص عربية مشكولة =========== Contains Arabic text vocalized . Text -format; 75.6 millions words Please cite this resource as: T. Zerrouki, A. Balla, Tashkeela: Novel corpus of Arabic vocalized texts, data for auto-diacritization systems, Data in Brief (2017), http://dx.doi.org/10.1016/j.dib.2017.01.011 Data in Brief ∎ ( ∎∎∎∎ ) ∎∎∎ – ∎∎∎
    Leader badge
    Downloads: 4 This Week
    Last Update:
    See Project
  • 9

    Arabic business corpora

    Arabic business and management corpus

    This corpora is made up of 3 sub corpora as follows: 1) Management Corpus: 400 articles by Chairmans and CEOs of Arabic companies in the Middle East. 2) Economics News: 400 news articles from different Arabic online newspapers. 3) Stock market news, 400 articles collected from investing.com. The main corpora contains 1200 articles. The articles have been tagged using Stanford Arabic Part of Speech Tagger. Both plain text and tagged corpora are available to download, check the Files...
    Downloads: 5 This Week
    Last Update:
    See Project
  • Build Securely on Azure with Proven Frameworks Icon
    Build Securely on Azure with Proven Frameworks

    Lay a foundation for success with Tested Reference Architectures developed by Fortinet’s experts. Learn more in this white paper.

    Moving to the cloud brings new challenges. How can you manage a larger attack surface while ensuring great network performance? Turn to Fortinet’s Tested Reference Architectures, blueprints for designing and securing cloud environments built by cybersecurity experts. Learn more and explore use cases in this white paper.
    Download Now
  • 10

    PADIC

    A multilingual Parallel Arabic DIalectal Corpus

    PADIC (Parallel Arabic DIalectal Corpus) is a multi-dialectal corpus built in the framework of the National Research Project "TORJMAN", led by Scientific and Technical Research Center for the Development of Arabic Language and funded by the Algerian Ministry of Higher Education and Scientific Research. PADIC is composed of 6 dialects: two Algerian dialects (Algiers and Annaba cities), Palestinian, Syrian, Tunisian, Moroccan) and MSA. Mourad Abbas Computational Linguistics Department...
    Downloads: 5 This Week
    Last Update:
    See Project
  • 11
    Osman Arabic Text Readability

    Osman Arabic Text Readability

    Open Source tool for Arabic text readability

    We present OSMAN (Open Source Metric for Measuring Arabic Narratives) - a novel open source Arabic readability metric and tool. The open source Java tool allows users to calculate readability for Arabic text (with and without diacritics). The tool provides methods to split the text into words and sentence, count syllables, Faseeh letters, hard and complex words in addition to adding diacritics (vocalise text). This makes the tool useful for researchers and educators working with Arabic text...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    AFEWC corpus is a multilingual comparable text articles in Arabic, French, and English languages. Each triple article is related to the same topic (aligned at article level). AFEWC corpus is collected from Wikipedia. The corpus is available for free for research purposes only. It is composed of 40K aligned articles, 91.3M English words, 57.8M French words, 22M Arabic words, 2.8M English unique words, 1.9M French unique words, and 1.5M Arabic unique words. Wikipedia text is available...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 13

    Arabic Named Entity Gazetteer

    Arabic Named Entity Gazetteer

    Arabic Named Entity Gazetteer (WIKIFANE_Gazet) is an Arabic "fine-grained" gazetteer that has been automatically compiled from the Arabic Wikipedia. This gazetteer is compiled using an xml tags such as <class_name>Arabic Named Entity</class_name>. Each line has an Arabic entity (UTF-8 encoding). This release of WikiFANE_Gazet consists of 68343 entities categorised into 50 classes. To use this corpus, please cite the following publication: F. Alotaibi and M. Lee, "Automatically Developing...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 14
    The Arabic corpus has been developed as part of a research project named "A New Approach of Semi-Indexing of Text Documents". This corpus consists of more than 460 Arab books. Arabic corpus can be used for the development of language engineering applications, information retrieval and information extraction. The total corpus size is 137 MB It contains 23,264,785 words and more than 128,584,458 letters.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 15
    Arabic Wikipedia into Named Entity Taxonomy” is a dataset consists of 4000 of Arabic Wikipedia articles that classified into coarse-grained NE taxonomy. This dataset can be used in document classification tasks in relation to NER. To use this corpus, please cite the following publication: F. Alotaibi and M. Lee, "Mapping Arabic Wikipedia into the Named Entities Taxonomy", In Proceedings of COLING 2012: Posters, p43-52, IIT, Mumbai, India, December 8-15. 2012. Author URL: http...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16

    Fine-grained Arabic Named Entity Corpora

    Fine-grained Arabic Named Entity Corpora

    The gold-standard and automatically-developed fine-grained Arabic named entity corpora are resources created by annotating Named Entities into 50 fine-grained classes. The annotation uses two-levels taxonomy in which an entity has been annotated into coarse- and fine-grained classes. A) Manually gold-standard: 1) WikiFANE_Gold: Gold standard Wikipedia-based Fine-grained Arabic Named Entity Corpus, ~500K tokens and 2) NewsFANE_Gold: Gold standard Newswire-based Fine-grained Arabic...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17

    InAra Plagiarism Detection Corpus

    A corpus for the Arabic Intrinsic Plagiarism Detection evaluation

    ARAbic INtrinsic plagiarism detection corpus (InAra Corpus 2013) InAra corpus it the first corpus for the evaluation of Arabic Intrinsic plagiarism detection. The Intrinsic Plagiarism Detection consists in uncovering the plagiarized passages on the basis of the writing style inconsistency in a given suspicious document. As opposed to the external approach, the intrinsic approach does not necessitate any comparison of the suspicious document against the potential sources of plagiarism...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18

    KALIMAT Multipurpose Arabic Corpus

    A corpus that could be of help for researchers working on Arabic NLP

    KALIMAT a Multipurpose Arabic Corpus We are pleased to announce the immediate availability of KALIMAT 1.0, KALIMAT is an Arabic natural language resource that consists of: 1) 20,291 Arabic articles collected from the Omani newspaper Alwatan by (Abbas et al. 2011). 2) 20,291 Extractive Single-document system summaries. 3) 2,057 Extractive Multi-document system summaries. 4) 20,291 Named Entity Recognised articles. 5) 20,291 Part of Speech Tagged articles. 6) 20,291...
    Leader badge
    Downloads: 36 This Week
    Last Update:
    See Project
  • 19

    EASC (Essex Arabic Summaries Corpus)

    Arabic natural language resources

    The EASC is an Arabic natural language resources. It contains 153 Arabic articles and 765 human-generated extractive summaries of those articles. These summaries were generated using Mechanical Turk (http://www.mturk.com/). Among the major features of EASC are: Names and extensions are formatted to be compatible with current evaluation systems such as ROUGE and AutoSummENG. Available in two encoding formats UTF-8 and ISO-8859-6 (Arabic). The Essex Arabic Summaries Corpus (EASC) uses...
    Downloads: 7 This Week
    Last Update:
    See Project
  • 20

    Arabic Obsolete Words

    A list of obsolete words in the Buckwalter Morphological Analyser

    This is a list of obsolete words, or words that are outdated or not in contemporary use, in the Buckwalter Morphological Analyser database. This list is developed according to a threshold of frequency on the web and the Arabic gigaword corpus. The list contain about 8,400 words that fell out of current use with a margin error of 1%. The threshold is defined like this. All the lemmas in Buckwalter queried in three news web sites (al-Jazeera, Arabic BBC and Arabic Wikipedia) and if the lemma...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21

    AADRTE

    Automatic Arabic Domain-Relevant Term Extraction

    In this research we propose a model for automatic domain-relevant term extraction from Arabic text corpus. The proposed model uses a hybrid approach composed of linguistic and statistical methods to extract terms relevant to specific domains depending on prevalence and tendency term ranking mechanism. This increases precision and recall as a measures of relevancy of extracted terms to a specific domain.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 22

    Arabic Multiword Expressions

    Multiword expression resources for Arabic, totalling 34,658 MWEs

    Multiword expression resources for Arabic, totalling 34,658 MWEs. These MWEs are extracted from the Arabic wikipedia,from the Arabic Gigaword corpus (4th Edition), and from the English Princeton WordNet translated into Arabic.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 23

    Arabic Broken Plurals

    List of Arabic Broken Plurals

    This is the List of Arabic Broken Plurals automatically extracted by Mohammed Attia from a large contemporary corpus, provided with morphological patterns for both the singular forms and the plural forms. It contains 2562 broken plural forms.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 24
    A word count of Modern Standard Arabic from a 1 billion word corpus, sorted according to frequency counts
    Downloads: 1 This Week
    Last Update:
    See Project
  • 25
    An Arabic word Corpus, which contains a huge list of words, starting by 1.5 million words, usefull for naturel language processing.
    Downloads: 2 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • Next