| Name | Modified | Size | Downloads / Week |
|---|---|---|---|
| JapaneseTextAnalysisTool_v5.4 | 2015-10-04 | ||
| JapaneseTextAnalysisTool_v5.3 | 2015-08-03 | ||
| JapaneseTextAnalysisTool_v5.2 | 2015-07-04 | ||
| JapaneseTextAnalysisTool_v5.1 | 2015-05-13 | ||
| JapaneseTextAnalysisTool_v5.0 | 2014-07-20 | ||
| JapaneseTextAnalysisTool_v4.4 | 2013-12-08 | ||
| JapaneseTextAnalysisTool_v4.3 | 2013-10-27 | ||
| JapaneseTextAnalysisTool_v4.2 | 2013-07-12 | ||
| JapaneseTextAnalysisTool_v4.1 | 2013-07-05 | ||
| JapaneseTextAnalysisTool_v4.0 | 2013-07-04 | ||
| JapaneseTextAnalysisTool_v3.0 | 2013-06-30 | ||
| readme.txt | 2015-10-04 | 8.9 kB | |
| Totals: 12 Items | 8.9 kB | 4 |
cb's Japanese Text Analysis Tool
--------------------------------
Description:
------------
cb's Japanese Text Analysis Tool allows users to generate 4 kinds of reports:
1) Word Frequency Report
2) Kanji Frequency Report
3) Formula-based Readability Report
4) User-based Readability Report
You may analyze a single text file or an entire directory of text
files (including sub-directories).
Before text analysis begins, all Aozora formatting is removed (if present) and
a user-specified number of lines are trimmed from the beginning and end of the
text files.
All generated reports will be placed in the specified output directory.
In the Tools menu, you can launch tools that allow you to combine multiple
frequency reports or compare two frequency reports.
Word Frequency Report:
----------------------
Name: word_freq_report.txt
Format:
Field 1: Number of times word was encountered
Field 2: Word
Field 3: Frequency Group (see explanation below)
Field 4: Frequency Rank (see explanation below)
Field 5: Percentage (Field 1 / Total number of words)
Field 6: Cumulative percentage
Field 7: Part-of-speech
Frequency Group: All words in the analysis that share the exact same frequency (Field 1)
will be assigned to a numbered Frequency Group, with group 1 containing the most common
word(s), group 2 containing the next most common word(s), and so on.
Frequency Rank: For a given word, the Frequency Rank is the total number of words
in the analysis that are more frequent that the given word + 1. For example, if the
given word has a Frequency Rank of 500, then there are 499 other words in the analysis
that are more frequent than the given word.
Report is sorted from most frequent word to least frequent word.
You have two methods of generating a report: MeCab or JParser.
MeCab is widely used morphological analyzer and is quite fast.
It utilizes a CRF statistical learning model with a Viterbi search algorithm.
This is the recommended option.
JParser is an alternate method that uses a larger dictionary (EDICT + ENAMDICT)
and thus recognizes more words and may have better support for names
and short expressions. However, it is much slower than MeCab and may not
tokenize text as accurately as MeCab.
The Part-of-speech field is printed verbatim from Mecab/JParser.
Kanji Frequency Report:
-----------------------
Name: kanji_freq_report.txt
Format:
Field 1: Number of times kanji was encountered
Field 2: Kanji
Field 3: Frequency Group (see explanation below)
Field 4: Frequency Rank (see explanation below)
Field 5: Percentage (Field 1 / Total number of kanji)
Field 6: Cumulative percentage
Frequency Group: All kanji in the analysis that share the exact same frequency (Field 1)
will be assigned to a numbered Frequency Group, with group 1 containing the most common
kanji(s), group 2 containing the next most common kanji(s), and so on.
Frequency Rank: For a given kanji, the Frequency Rank is the total number of kanji
in the analysis that are more frequent that the given kanji + 1. For example, if the
given kanji has a Frequency Rank of 500, then there are 499 other kanji in the analysis
that are more frequent than the given kanji.
Report is sorted from most frequent kanji to least frequent kanji.
Formula-based Readability Report:
---------------------------------
Readability report generated based on Hayashi and OBI-2 readability calculations.
Name: formula_based_readability_report.txt
Format:
Field 1: OBI-2 Grade Level (1-13, where 1 is the most readable)
Field 2: Hayashi Score (0-100, where 100 is the most readable)
Field 3: Filename
Report is sorted from most readable to least readable.
Hayashi Score Information:
http://www.ideosity.com/ideosphere/seo-information/readability-tests#Hayashi
OBI-2 Grade Level Information:
http://kotoba.nuee.nagoya-u.ac.jp/sc/obi2/obi_e.html
User-based Readability Report:
------------------------------
Using a list of words that the user already knows, this report can help to determine readability
of a text based on the percentage of words in the text that the user already knows.
Name: user_based_readability_report.txt
Format:
Field 1: Readability expressed as a percentage (0-100) of the total number
of non-unique known words vs. the total number of non-unique words.
Field 2: Total number of non-unique words
Field 3: Total number of non-unique known words
Field 4: Total number of non-unique unknown words
Field 5: Readability expressed as a percentage (0-100) of the total number
of unique known words vs. the total number of unique words.
Field 6: Total number of unique words
Field 7: Total number of unique known words
Field 8: Total number of unique unknown words
Field 9: Filename
Report is sorted based on Readability (Field 1).
To generate this report, the "File that contains a list of words that you already know" option must
be filled in. If a line contains multiple tab-separated columns, then the word is assumed to be in
the first column.
How to Install/Launch:
--------------------------------------
1) Unzip cb's Japanese Text Analysis Tool.
2) In the unzipped directory, simply double-click JapaneseTextAnalysisTool.exe.
Note: Requires .Net Framework 4.5.
Contact:
--------
Christopher Brochtrup
cb4960@gmail.com
Official Website:
-----------------
https://sourceforge.net/projects/japanesetextana/
Support Thread:
---------------
http://forum.koohii.com/viewtopic.php?pid=176875#p176875
Special thanks to the authors of the following tools:
-----------------------------------------------------
Mecab (http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html)
Morphological analyzer used to help generate the Word Frequency report.
obi-2.304 (http://kotoba.nuee.nagoya-u.ac.jp/sc/obi2/obi_e.html)
Ruby script that calculates OBI-2.
Translation Aggregator (http://www.hongfire.com/forum/showthread.php/94395-Translation-Aggregator-v0-4-9)
Original JParser code.
az2tex.rb (http://a2k.aill.org/)
Contains table of Gaiji codes and UTF-8 equivalents.
--------------------------------------------------------------------------------
Version History:
----------------
[Version 5.4 (03-October-2015)]
- Added the "Frequency Group" and "Frequency Rank" fields to both the
Word Frequency Report and Kanji Frequency Report.
[Version 5.3 (17-July-2015)]
- Added some optimizations. Should now have fewer memory spikes.
- Added analysis time to the Complete dialog.
- Cancellation should now be faster.
- Fixed bug in user-readability report not using the Use Enhanced Dictionary option.
- Added the max_tasks option to settings.txt.
- Now targets .Net 4.5 (using async/await)
[Version 5.2 (03-July-2015)]
- Added the Use Enhanced Dictionary option when MeCab is selected.
- Removed entries from the MeCab user dictionary that may interfere with
the statistical model.
[Version 5.1 (13-May-2015)]
- Updated JParser to accept paths greater than 500 characters.
- Updated the Mecab database.
[Version 5.0 (19-July-2014)]
- Added the part-of-speech field to the Word Frequency report.
- Analysis now utilizes all cores.
- Added the Directory Search Options group (extensions textbox and sub-folder checkbox).
- Added the Word Removal option.
- Added the Remove Aozora Formatting option.
- Added the User Words List Column option.
- Updated the Mecab database.
[Version 4.4 (08-December-2013)]
- Added percentage and cumulative percentage fields to the word frequency report and kanji
frequency report.
[Version 4.3 (11-October-2013)]
- Updated the Mecab database.
[Version 4.2 (11-July-2013)]
- Fixed bug where statistics were not reset for the User-based Readability Report when performing
more than one full analysis in a row.
[Version 4.1 (5-July-2013)]
- Fixed crash bug when processing more than ~150 novels.
- Fixed crash bug when closing JTAT while it was still processing.
[Version 4.0 (4-July-2013)]
- Added the User-based Readability Report option.
- Renamed the old Readability report to Formula-based Readability report.
- Updated Mecab to version 0.996.
- Updated mecab_user_dic.dic.
[Version 3.0 (27-May-2012)]
- Added the Compare Frequency Reports Tool.
- Added the Combine Frequency Reports Tool.
- Fixed MeCab Word Frequency Report so that it uses the 'root' field of the MeCab output.
[Version 2.0 (26-May-2012)]
- Added Word Frequency Report options:
1) Parse method: MeCab or JParser
2) Always use kanji from of words
[Version 1.1 (13-May-2012)]
- Removed non-words from the Word Frequency Report.
[Version 1.0 (29-April-2012)]
- Initial version.