IMS Open Corpus Workbench Activity

Indexing and query tools for very large text corpora

Brought to you by: andrewhardie, marco_baroni, schtepf, sharoff

Activity for IMS Open Corpus Workbench

2 months ago
Andrew Hardie committed [r1906] on Code

fix tiny but horrible bug where integer data was sometimes being transmitted to the browser as strings
3 months ago
Stephanie Evert committed [r1905] on Code

Fix spurious test error in Perl CWB package
4 months ago
Philipp Heinrich created ticket #81

Missing warning message for p- and s-attributes with the same name
5 months ago
Andrew Hardie committed [r1904] on Code

I properly buggered up the try-catch logic round mysqli-rollback. so let's give that one more go, shall we.
5 months ago
Andrew Hardie committed [r1903] on Code

my 2nd effort to stop PHP8's mysqli extension spilling exceptions all over the damn place by try-catching every call to a mysqli func that might possibly throw, hopefully thus keeping the madness contained in sql-lib. [THANKS PHP8svn diff rss.php svn diff rss.php ] ...plus a tiny fix to the rss generator.
5 months ago
Andrew Hardie committed [r1902] on Code

tiny fix for stupid utf8 bug in php 8
6 months ago
Stephanie Evert committed [r1901] on Code

CWB/Perl: fix option specification in cwb-make
7 months ago
Andrew Hardie committed [r1900] on Code

Add requested feature to allow disabling of email confirmation on signup.
7 months ago
Stephanie Evert committed [r1899] on Code

Make Perl/CWB utility cwb-convert-to-utf8 more compatible with iconv implementations
7 months ago
Stephanie Evert committed [r1898] on Code

Check in older modifications in CWB/Perl branch v3.0
7 months ago
Stephanie Evert committed [r1897] on Code

Fix bug in CWB::RegistryFile so that NAME with backslash-escapes round-trips correctly
10 months ago
Andrew Hardie committed [r1896] on Code

fix some bloody obvious errors in the test implementation of soft keyboards, including (\!) having somehow missed adding the needed javascript file to make them actually work
10 months ago
Stephanie Evert committed [r1895] on Code

Fix ShowTargets option to work with Highlight off (shown as parenthesised numbers then)
10 months ago
Stephanie Evert committed [r1894] on Code

Document new hidden KnitWithEcho option in CQP manual
10 months ago
Stephanie Evert committed [r1893] on Code

Add hidden `KnitWithEcho` option to control whether CQP code blocks are included in markdown knit mode (-N) output.
11 months ago
Andrew Hardie committed [r1892] on Code

three minor bugs: in SQL server connection, in colleaguate config, in allowing indexing of CWB IDs that the CL reg parser can't handle
12 months ago
Stephanie Evert committed [r1891] on Code

Fix CWB::CL test that returns different result with PCRE2 (in CWB >= 3.5.1), so CWB::CL now also provides version information.
1 year ago
Stephanie Evert committed [r1890] on Code

amend previous commit
1 year ago
Stephanie Evert committed [r1889] on Code

don't install signal handlers in CQP batch mode
1 year ago
Stephanie Evert committed [r1888] on Code

fix bug in cwb-s-encode
1 year ago
Stephanie Evert committed [r1887] on Code

Document experimental markdown mode in CQP Manual (now updated to v3.5.1)
1 year ago
Stephanie Evert committed [r1886] on Code

CQP v3.5.1: experimental markdown mode, which reads markdown files and executes fenced code block of type "cqp"
1 year ago
Andrew Hardie committed [r1885] on Code

init implementation of soft keyboards (much work remains) and as a fun extra, a rainbow background feature
2 years ago
Stephanie Evert committed [r1884] on Code

Enhance CQP with custom SpheroscopeDebug option (experimental)
2 years ago
Andrew Hardie created ticket #74

CQPweb: more elaborate customisable header for corpus homepage
2 years ago
Stephanie Evert committed [r1883] on Code

Make sure cwb-encode aborts with meaningful error message if there are unparseable arguments
2 years ago
Andrew Hardie committed [r1882] on Code

a bundle of fairly small tweraks and bugfixes
2 years ago
Stephanie Evert committed [r1881] on Code

Encourage users to upgrade to v3.5.1
2 years ago
Andrew Hardie committed [r1880] on Code

small php 8.2 incompatibility fixed
2 years ago
Andrew Hardie committed [r1879] on Code

minor lost comment tweak
2 years ago
Andrew Hardie committed [r1878] on Code

add cronjob to check for overrunning CQP processes
2 years ago
Stephanie Evert committed [r1877] on Code

Fix bug in cl_string_canoncial(), which would fail on diacritic-folding of long strings (where NFC exceeds CL_MAX_LINE_LENGTH)
2 years ago
Andrew Hardie created ticket #73

CQPweb: per-user tabulation templates
2 years ago
Andrew Hardie committed [r1876] on Code

fix to breaking bug in BasicVrtInstaller
2 years ago
Andrew Hardie posted a comment on ticket #72

No there's not, because it's (a) really far in the future and (b) not going to be remotely difficult when we actually get there.
2 years ago
ram posted a comment on ticket #72

Yes, I had to found a work around the attribute separator. My suggestion about FreeLing was mainly for the possibility to display the token and its attributes as nodes, instead of plain text. But it seems it implies a lot of fixes and that is something that is going to be fix in version 4. Is that the case, is there any draft for the XML output?
2 years ago
ram posted a comment on ticket #7

Thanks, sorry for the mistake
2 years ago
Andrew Hardie committed [r1875] on Code

1. MAke code closer to PHP8-ready by declaring all config variables in the global Config object's class def. 2. Initial implementation of classification-type metadata as SQL-level enums, not varchar columns (for more compact storage).
2 years ago
Andrew Hardie committed [r1874] on Code

add ability to probe data for XML structure rather than it needing to be defined. Bump version to 3.3.18
2 years ago
Stephanie Evert posted a comment on ticket #72

A kwic concordance (where left and right context might not even contain complete tokens!) is very different from a list of sentences with pre-determined annotation as in the FreeLing output. I don't think we can learn much from it to help us address the challenges of kwic XML output. SGML print mode is really badly broken if you display s-attributes in the concordance. It also includes them (and any p-attributes) as plain text in the tokens rather than in a way that allows them to be processed e.g....
2 years ago
Stephanie Evert posted a comment on ticket #7

You seem to have forgotten to activate the corpus: info PRUEBA; but PRUEBA; show cd;
2 years ago
ram posted a comment on ticket #7

Thanks for your response! cwb-describe-corpus -s works perfectly. For show cd I still get incomplete information: ===Context Descriptor======================================= left context: 25 characters right context: 25 characters corpus position: shown target anchors: not shown Positional Attributes: <none> Structural Attributes: <none> Aligned Corpora: <none> ============================================================
2 years ago
ram posted a comment on ticket #72

About the DTD, I will use the same SGML structure but XML compliant. For more ideas about the schema, FreeLing output formats could be an useful resource.
2 years ago
Stephanie Evert posted a comment on ticket #72

Note to those not familiar with CQP print modes: Their implementation is a horrible mess, so we are reluctant to add extensions and very limited in what can be achieved. Moreover, the print modes only affect some CQP output (kwic concordances, frequency tables from group) but by far not all.
2 years ago
Stephanie Evert modified ticket #72

XML output mode for CQP
2 years ago
Stephanie Evert posted a comment on ticket #7

Because that's how the orginal developer decided to do things in 1994. It's a quirk that we live with for the sake of backwards compatibility. Note that the filename of the registry file also has to be in lowercase, while corpus IDs are to be specified in all caps everywhere else. You can get the list of attributes with show cd or using cwb-describe-corpus -s on the command line. Canonical attribute names (both positional and structural) should be all lowercase and only use ASCII characters. While...
2 years ago
ram created ticket #7

Corpus info output
2 years ago
ram created ticket #72

XML output mode for CQP
2 years ago
ram created ticket #71

Consistent output formatting
2 years ago
Stephanie Evert modified ticket #80

SGML invalid structure
2 years ago
Stephanie Evert posted a comment on ticket #80

PS: If you want to pursue this, please add a feature request “XML output mode for CQP” for CWB v3.6.
2 years ago
Stephanie Evert posted a comment on ticket #80

Not a bug: SGML allows omission of closing tags – you just have to assume a suitable DTD for the output produced by CQP. Note that your second suggestion isn't valid SGML and would have to be written <attribute// instead. If your SGML output also included the kwic line with some s-attributes shown, you'd get many more validation errors (because nothing guarantees that open/close tags match up within a kwic line, and they can also overlap between context and match). If we ever find the nerves to implement...
2 years ago
ram created ticket #80

SGML invalid structure
2 years ago
Andrew Hardie committed [r1873] on Code

fix bug where PHP8 type stringency caused a null DB result to choke array_map() up.
2 years ago
Andrew Hardie committed [r1872] on Code

add braces to silence complaint from GCC
2 years ago
Stephanie Evert committed [r1871] on Code

cwb-scan-corpus now also reports type count before applying frequency filter
2 years ago
Stephanie Evert committed [r1870] on Code

Ziggurat design: even more B-tree algorithms
2 years ago
Stephanie Evert committed [r1869] on Code

Ziggurat design: added binsearch benchmark observations with Rust on MacOS
2 years ago
Andrew Hardie committed [r1868] on Code

fixes for a couple fo tricksy bugs.
2 years ago
Andrew Hardie committed [r1867] on Code

full rewrite of the url_absolutify() function to work within user corpora, plus to rationalise its overall procedure a bit.
2 years ago
Stephanie Evert committed [r1866] on Code

Ziggurat design: try yet another binary search algo (unsuccessfully)
2 years ago
Timm Weber committed [r1865] on Code

more benchmark data
2 years ago
Timm Weber committed [r1864] on Code

more rust benchmarks
2 years ago
Stephanie Evert committed [r1863] on Code

Ziggurat design: fix uint typedef conflict (now zuint)
2 years ago
Timm Weber committed [r1862] on Code

fixed bug in rust benchmarks
2 years ago
Timm Weber committed [r1861] on Code

added rust implementations of the benchmarks in binsearch_bench
2 years ago
Stephanie Evert committed [r1860] on Code

Added binary search benchmarks for random walks and exponential search algorithm
2 years ago
Stephanie Evert committed [r1859] on Code

Convert README to Markdown + HTML for easier reading
2 years ago
Stephanie Evert committed [r1858] on Code

Ziggurat design: estimate disk size of compressed sparse inverted index
2 years ago
Stephanie Evert committed [r1857] on Code

Add HTML version of Markdown README for convenience
2 years ago
Stephanie Evert committed [r1856] on Code

Ziggurat design: benchmark results for binary search in sort index, with thorough discussion
2 years ago
Stephanie Evert committed [r1855] on Code

Ziggurat design: benchmark binary lookup in large tables vs. b-tree
2 years ago
Stephanie Evert committed [r1854] on Code

Ziggurat design: estimate size of sparse inverted index
2 years ago
Stephanie Evert committed [r1853] on Code

Fix extremely embarrassing as well as catastrophic bug in cwb-scan-corpus introduced by r1851
2 years ago
Stephanie Evert committed [r1852] on Code

minor display fix in cwb-scan-corpus
2 years ago
Stephanie Evert committed [r1851] on Code

cwb-scan-corpus now obtains total token/document counts if no regular keys are specified
3 years ago
Andrew Hardie committed [r1850] on Code

fix query strategy bug (dropdown wasn't ignored in cqp syntax mode)
3 years ago
Andrew Hardie modified ticket #79

Error when setting up speaker metadata via IDlinks for BNC2014 spoken
3 years ago
Andrew Hardie posted a comment on ticket #79

Fixed in commit 1849. This is one of those embarrassing bugs that I would have spotted years ago if I used the embiggenable forms much myself - but I normally use templates. Alas. In any case, thanks to Fabian for the bug report.
3 years ago
Andrew Hardie committed [r1849] on Code

fix UI bug on embiggenable tables
3 years ago
Fabian Vetter created ticket #79

Error when setting up speaker metadata via IDlinks for BNC2014 spoken
3 years ago
Andrew Hardie committed [r1848] on Code

some fixes for small distribution / categorised query bugs
3 years ago
Andrew Hardie modified ticket #78

CQPweb: no alert on absent or malformed text + text_id attributes in input data
3 years ago
Andrew Hardie posted a comment on ticket #78

[UNREADABLE] means that the some word could not be read from the data returned by CQP. In this case, it happened because the absence of text_id mucked up the processes that break up that data for formatting. You'll note that on your installation form screenshot, there is a notice at the top of the XML table saying that <text> and its id="..." are "... compulsory".</text> So they are added to the corpus definition even if you don't speciy them on the form. Unfortunately cwb-encode doesn't issue errors...
3 years ago
ram posted a comment on ticket #6

Thanks for the response. I checked the api-lib.php because some days ago I tried to use the run_query function that in my current version is not implemented, but I see that right now it is implemented. I am going to report all this info, and since we stopped having issues with CQPweb, with my team we are going to see if we actually have to implement a whole new web platform with Django, or if we better work on a nice minimalist template for CQPweb. Thanks!
3 years ago
Andrew Hardie modified a comment on ticket #6

This is one of those rare occasions when I have to disagree with Stephanie - there are some things you can get from CQPweb via API that direct CWB access won't get you: collocations, distribution, linking queries to text or item metadata (and getting same in results...), user accounts, query history, case-folded frequency lists, saved and categorised queries, character set standardisation, ... See https://sourceforge.net/p/cwb/code/HEAD/tree/gui/cqpweb/trunk/lib/api-lib.php for where I'm at so far....
3 years ago
Andrew Hardie posted a comment on ticket #6

This is one of those rare occasions when I have to disagree with Stephanie - there are some things you can get from CQPweb via API that direct CWB access won't get you: collocations, distribution, linking queries to text or item metadata (and getting same in results...), user accounts, query history, case-folded frequency lists, saved and categorised queries, character set standardisation, ... See https://sourceforge.net/p/cwb/code/HEAD/tree/gui/cqpweb/trunk/lib/api-lib.php for where I'm at so far....
3 years ago
ram posted a comment on ticket #78

Ok, now I confirm that the [UNREADABLE] was because the text tag requires the id. With the id attribute everything work as expected. Thanks @schtepf for your patience. The only weird thing left is that I don't receive an error message or warning when I don't add the text tag.
3 years ago
ram posted a comment on ticket #78

I can confirm that the problem is because the lack of <text> tag. I did a test where I include that tag and it doesn't hangs. But now I have these doubts: What does the [UNREADABLE] means in the query result? Just to be sure, is the id attribute for the text tag recommended or mandatory? I attach the VRT test file and the screenshot of the result.
3 years ago
ram posted a comment on ticket #78

I didn't get any errors. I attach the process I am doing for the corpus installation. Thanks!
3 years ago
Stephanie Evert posted a comment on ticket #78

Re. 3: If you installed this particular corpus in CQPweb, you must have ignored all the error messages that it shot at you. It should have outright refused to install the corpus, but perhaps it went far enough to get its database into an inconsistent state that causes the lock-up.
3 years ago
ram posted a comment on ticket #78

Let's see: Every time it returns a match No, it works if I use CQP I did :S
3 years ago
Stephanie Evert modified a comment on ticket #78

(a) Does this happen for a specific query, or for every query that returns some matches? In the latter case there' s probably sth in the corpus that confuses CQPweb. (b) Are there also problems if you run the query directly in CQP? (c) You can't possibly have installed this corpus in CQPweb because it's lacking the mandatory <text id="..."> elements!
3 years ago
Stephanie Evert modified a comment on ticket #78

(a) Does this happen for a specific query, or for every query that returns some matches? In the latter case there' s probably sth in the corpus that confuses CQPweb. (b) Are there also problems if you run the query directly in CQP? (c) You can't possibly have installed this corpus in CQPweb because it's lacking the mandatory <text id="..."> ... </text> elements!
3 years ago
Stephanie Evert modified a comment on ticket #78

(a) Does this happen for a specific query, or for every query that returns some matches? In the latter case there' s probably sth in the corpus that confuses CQPweb. (b) Are there also problems if you run the query directly in CQP? (c) You can't possibly have installed this corpus in CQPweb because it's lacking the mandatory <text id="..."> elements!</text>
3 years ago
Stephanie Evert posted a comment on ticket #78

(a) Does this happen for a specific query, or for every query that returns some matches? In the latter case there' s probably sth in the corpus that confuses CQPweb. (b) Are there also problems if you run the query directly in CQP?
3 years ago
Stephanie Evert posted a comment on ticket #6

cwb-ccc should give you most (if not all) of what you can get from the CQPweb API, and it will be faster (as everything is directly in Python and doesn't need to be serialised and de-serialised) and more directly under your control. I think CQPweb is useful for your use case only if you need its GUI.
3 years ago
Stephanie Evert modified ticket #5

Best approach for interfacing CQP with other software
3 years ago
ram posted a comment on ticket #78

I forgot to put the versions: CQPweb: 3.2.43 CWB: 3.5.0
3 years ago
ram created ticket #78

CQPweb hangs in an specific scenario

1 >

IMS Open Corpus Workbench Activity

Indexing and query tools for very large text corpora

Activity for IMS Open Corpus Workbench

Andrew Hardie committed [r1906] on Code

Stephanie Evert committed [r1905] on Code

Philipp Heinrich created ticket #81

Andrew Hardie committed [r1904] on Code

Andrew Hardie committed [r1903] on Code

Andrew Hardie committed [r1902] on Code

Stephanie Evert committed [r1901] on Code

Andrew Hardie committed [r1900] on Code

Stephanie Evert committed [r1899] on Code

Stephanie Evert committed [r1898] on Code

Stephanie Evert committed [r1897] on Code

Andrew Hardie committed [r1896] on Code

Stephanie Evert committed [r1895] on Code

Stephanie Evert committed [r1894] on Code

Stephanie Evert committed [r1893] on Code

Andrew Hardie committed [r1892] on Code

Stephanie Evert committed [r1891] on Code

Stephanie Evert committed [r1890] on Code

Stephanie Evert committed [r1889] on Code

Stephanie Evert committed [r1888] on Code

Stephanie Evert committed [r1887] on Code

Stephanie Evert committed [r1886] on Code

Andrew Hardie committed [r1885] on Code

Stephanie Evert committed [r1884] on Code

Andrew Hardie created ticket #74

Stephanie Evert committed [r1883] on Code

Andrew Hardie committed [r1882] on Code

Stephanie Evert committed [r1881] on Code

Andrew Hardie committed [r1880] on Code

Andrew Hardie committed [r1879] on Code

Andrew Hardie committed [r1878] on Code

Stephanie Evert committed [r1877] on Code

Andrew Hardie created ticket #73

Andrew Hardie committed [r1876] on Code

Andrew Hardie posted a comment on ticket #72

ram posted a comment on ticket #72

ram posted a comment on ticket #7

Andrew Hardie committed [r1875] on Code

Andrew Hardie committed [r1874] on Code

Stephanie Evert posted a comment on ticket #72

Stephanie Evert posted a comment on ticket #7

ram posted a comment on ticket #7

ram posted a comment on ticket #72

Stephanie Evert posted a comment on ticket #72

Stephanie Evert modified ticket #72

Stephanie Evert posted a comment on ticket #7

ram created ticket #7

ram created ticket #72

ram created ticket #71

Stephanie Evert modified ticket #80

Stephanie Evert posted a comment on ticket #80

Stephanie Evert posted a comment on ticket #80

ram created ticket #80

Andrew Hardie committed [r1873] on Code

Andrew Hardie committed [r1872] on Code

Stephanie Evert committed [r1871] on Code

Stephanie Evert committed [r1870] on Code

Stephanie Evert committed [r1869] on Code

Andrew Hardie committed [r1868] on Code

Andrew Hardie committed [r1867] on Code

Stephanie Evert committed [r1866] on Code

Timm Weber committed [r1865] on Code

Timm Weber committed [r1864] on Code

Stephanie Evert committed [r1863] on Code

Timm Weber committed [r1862] on Code

Timm Weber committed [r1861] on Code

Stephanie Evert committed [r1860] on Code

Stephanie Evert committed [r1859] on Code

Stephanie Evert committed [r1858] on Code

Stephanie Evert committed [r1857] on Code

Stephanie Evert committed [r1856] on Code

Stephanie Evert committed [r1855] on Code

Stephanie Evert committed [r1854] on Code

Stephanie Evert committed [r1853] on Code

Stephanie Evert committed [r1852] on Code

Stephanie Evert committed [r1851] on Code

Andrew Hardie committed [r1850] on Code