Generating n-grams from the dataset that contain different languages.

A widely used tool for visual exploration of scientific literature.

Brought to you by: citespace

Generating n-grams from the dataset that contain different languages.

Forum: General Discussion

Creator: Bohdan Baliuk

Created: 2021-05-20

Updated: 2021-05-20

Bohdan Baliuk - 2021-05-20

I want to generate n-grams from the WoS RSCI (Russian Science Citation Index) database in CiteSpace. It contains abstracts both in English and Russian languages, but after generating from bibliometrics appears only English n-grams.

I used the following order of operations - 1) configured types of n-grams in 'Edit properties' - Maximum and Minimum words noun phrases, 2) selected 'Noun phrases' at 'Text processing' field on the main page, 3) selected 'Create POS-tags in pop-up windows', 4) clicked 'Go' and processed the dataset, 5) Chose 'Create List terms by tf*idf' in Text tab on the Main page.

By the way - I also tried to generate n-grams from full-text, - but only unigrams available as output (they include both English and Russian unigrams which imply that the logic of generating n-grams shared among different languages) but I want to define types of n-grams which important for my research (not only unigrams but bigrams and trigrams).

So maybe there is some approach on how I can get n-grams in Russian only OR at least in both languages, not only in English?.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.