-
schtepf committed revision 82 to the IMS Open Corpus Workbench SVN repository, changing 2 files.
2009-12-20 20:01:07 UTC in IMS Open Corpus Workbench
-
schtepf committed revision 81 to the IMS Open Corpus Workbench SVN repository, changing 1 files.
2009-12-19 11:16:08 UTC in IMS Open Corpus Workbench
-
CQPweb stores all MySQL tables with collation utf8_general_ci, which is inappropriate for German and presumably many other languages because it ignores accents and other diacritics in addition to case-folding strings. This leads to bogus entries in collocation tables, frequency lists and frequency breakdown.
Examples: German verbs "fallen" and "fällen" are collapsed into a single entry...
2009-12-19 10:43:42 UTC in IMS Open Corpus Workbench
-
schtepf committed revision 80 to the IMS Open Corpus Workbench SVN repository, changing 3 files.
2009-12-19 08:15:05 UTC in IMS Open Corpus Workbench
-
schtepf committed revision 74 to the IMS Open Corpus Workbench SVN repository, changing 1 files.
2009-12-11 13:44:14 UTC in IMS Open Corpus Workbench
-
schtepf committed revision 73 to the IMS Open Corpus Workbench SVN repository, changing 5 files.
2009-12-11 13:41:19 UTC in IMS Open Corpus Workbench
-
schtepf committed revision 72 to the IMS Open Corpus Workbench SVN repository, changing 1 files.
2009-12-11 13:34:41 UTC in IMS Open Corpus Workbench
-
In my tests with a 500-million-word corpus, the MySQL indexing of frequency tables ("Create frequency tables") turned out to be much worse. It quickly ate up some 5 GB of disk space, and then kept running for hours without any tangible result until Firefox and/or Apache collapsed. As of now, MySQL still seems to be busy building the table and index (perhaps this is a side effect of the LOAD...
2009-12-01 11:10:11 UTC in IMS Open Corpus Workbench
-
Having different corpus IDs in CQP vs. MySQL seems an unnecessary complication, since the CWB name of a corpus can easily be changed (rename registry file and change ID entry there). I vote for keeping the restriction, but making sure that it's prominently documented (for those who use hyphens in CWB names, as the documentation suggests).
2009-11-29 15:06:18 UTC in IMS Open Corpus Workbench
-
schtepf committed revision 65 to the IMS Open Corpus Workbench SVN repository, changing 2 files.
2009-11-28 17:11:04 UTC in IMS Open Corpus Workbench