I've used a configuration down:
config.setConfigurationRule(WVTConfiguration.STEP_INPUT_FILTER, new WVTConfigurationFact(new XMLInputFilter()));
config.setConfigurationRule(WVTConfiguration.STEP_WORDFILTER, new WVTConfigurationFact(new StopWordsWrapper()));
config.setConfigurationRule(WVTConfiguration.STEP_STEMMER, new WVTConfigurationFact(new PorterStemmerWrapper()));
config.setConfigurationRule(WVTConfiguration.STEP_VECTOR_CREATION, new WVTConfigurationFact(new TFIDF()));
config.setConfigurationRule(WVTConfiguration.STEP_OUTPUT, new WVTConfigurationFact(new WordVectorWriter(new FileWriter("docs.txt"), true)));
So, I've used the next statements to create the output:
wlista = wvtTool.createWordList(lista, config);
wlista.store(new FileWriter("words.txt"));
wvtTool.createVectors(lista, config, wlista);
The problem is:
If I have 20 files, the output is generated with 18 files.
If I have 200 files, the output is generated with 195 files.
and so on..
I'm sure that all files are being read. The file words.txt contains words from all files, but docs.txt has less than this total.
What I've done wrong?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I've used a configuration down:
config.setConfigurationRule(WVTConfiguration.STEP_INPUT_FILTER, new WVTConfigurationFact(new XMLInputFilter()));
config.setConfigurationRule(WVTConfiguration.STEP_WORDFILTER, new WVTConfigurationFact(new StopWordsWrapper()));
config.setConfigurationRule(WVTConfiguration.STEP_STEMMER, new WVTConfigurationFact(new PorterStemmerWrapper()));
config.setConfigurationRule(WVTConfiguration.STEP_VECTOR_CREATION, new WVTConfigurationFact(new TFIDF()));
config.setConfigurationRule(WVTConfiguration.STEP_OUTPUT, new WVTConfigurationFact(new WordVectorWriter(new FileWriter("docs.txt"), true)));
So, I've used the next statements to create the output:
wlista = wvtTool.createWordList(lista, config);
wlista.store(new FileWriter("words.txt"));
wvtTool.createVectors(lista, config, wlista);
The problem is:
If I have 20 files, the output is generated with 18 files.
If I have 200 files, the output is generated with 195 files.
and so on..
I'm sure that all files are being read. The file words.txt contains words from all files, but docs.txt has less than this total.
What I've done wrong?