You can subscribe to this list here.
2004 |
Jan
(1) |
Feb
|
Mar
(11) |
Apr
(16) |
May
(1) |
Jun
|
Jul
(1) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2006 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(3) |
Oct
(5) |
Nov
|
Dec
|
2007 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Marie G. <mag...@ps...> - 2007-02-09 12:26:18
|
Dear all, While infomap-build -s, during svd.interface I get this message: FEWER THAN EXPECTED SINGULAR VALUES Anyone who can help me solve this problem? Best regards, Marie Gustafsson ================================================== Building target: /home/sverker/infomap_models/test22/left Prerequisites: /home/sverker/infomap_models/test22/coll /home/sverker/infomap_models/test22/indx Tue Jan 23 01:18:31 PST 2007 .................................................. cd /home/sverker/infomap_models/test22 && rm -f svd_diag left \ rght sing cd /home/sverker/infomap_models/test22 && svdinterface \ -singvals 100 \ -iter 100 This is svdinterface. Writing to: left Writing to: rght Writing to: sing Writing to: svd_diag Reading: indx Reading: indx Reading: coll FEWER THAN EXPECTED SINGULAR VALUES .................................................. Finishing target: /home/sverker/infomap_models/test22/left ================================================== ================================================== Building target: /home/sverker/infomap_models/test22/wordvec.bin Prerequisites: /home/sverker/infomap_models/test22/left /home/sverker/infomap_models/test22/dic Tue Jan 23 01:18:31 PST 2007 .................................................. encode_wordvec \ -m /home/sverker/infomap_models/test22 Opening File for "r": "/home/sverker/infomap_models/test22/left" Opening File for "w": "/home/sverker/infomap_models/test22/wordvec.bin" Reading the dictionary... Opening File for "r": "/home/sverker/infomap_models/test22/dic" Initializing row indices...Done. .................................................. Finishing target: /home/sverker/infomap_models/test22/wordvec.bin ================================================== |
From: Scott C. <ced...@gm...> - 2006-10-24 17:32:57
|
As a short-term non-fix, we may also want to clean up the webif code so as not to get people's hopes up. I'd also like to get the spectrum plotter released... Thanks for refreshing my memory on -p and -v. I'm not even sure whether I could find a copy of the code that has these implemented, so I'm glad to hear that you've got one. Scott On 10/24/06, Dominic Widdows <wi...@ma...> wrote: > Dear Zbynek, > > Sorry for not responding sooner. I see what your problem is now. When > we cleaned up the infomap codebase for public release, we only had a > couple of months before the project's funding ended, and decided that > the clustering and contrasting pairs options were bleeding edge and > optional. So we left them out, hoping to put them into later releases. > > We only released the web interface code more recently, and in a > "strictly beta" fashion - it hasn't been through the same cleaning up > process. It turns out that the Perl web interface code is making > calls against options in the core C infomap engine that won't be > there in your release. > > I will think about trying to solve this (and Scott, if you have any > time to help, that would be great). I was never 100% happy with the > former clustering code, it wasn't a great clustering algorithm, and > it was very wrapped in to the "associate" executable. It might be > much easier to create and maintain a more modular architecture, where > associate writes out the vectors to be clustered (e.g. the top 200) > to a file using the "-f" option, and clustering libraries read these > files. It wouldn't be as efficient, but I don't believe that this > would make too much difference in practice (not for a single user > running the algorithm once as part of a browser interface - other > latencies are greater than writing and reading a file some where on > your web server). > > This way we could write or reuse many clustering algorithms quite > easily. It's also the way that the Java word spectrum plotter works. > > If you want to try wrapping in the clustering algorithm from the > original infomap source, I could send you the files and try to help > you. But it may actually take longer, I'm not sure. > Best wishes, > Dominic > > On Oct 24, 2006, at 5:19 AM, Zbynek Studenovsky wrote: > > > Dear Scott, > > > > thank you for your prompt respond. Unfortunately, I am not able to run > > associate from the command line with the -p or -v options. The only > > available options are in the attached man file (associate_man.txt) > > and here: > > > > Usage:associate [-w | -d | -q] > > [-i type_of_input(d or w)] [-f vector_output_file] > > ( [-t] | [-m model_dir] ) > > [-c <model_tag>] > > [-n num_neighbors] [-f vector_output_file] > > <pos_term_1> [pos_term_2 ... pos_term_n] > > [NOT neg_term_1 ... neg_term_n] > > > > Task: -w associate words (DEFAULT) > > -d associate documents > > -q print query vector > > > > I have also compiled a new version of associate from CVS (1.2) and > > it runs > > with the same results. > > > > I think the problem could be an old code in the public version of > > associate > > in Infomap 0.8.6 and I think no one is able to run associate with - > > v or -p > > options. It would be nice, if you or Beate Dorow could add a > > revision of > > associate in CVS. > > > > Many thanks for your help and best regards > > > > Zbynek Studenovsky > > ----------------------------------------- > > email: zb...@ma... > > homepage: http://homepage.mac.com/zbynek > > > > Am 23.10.2006 18:13 Uhr schrieb "Scott Cederberg" unter > > <ced...@gm...>: > > > >> Hi Zbynek, > >> > >> I'm glad to hear that you've gotten the Web frontend working. > >> Are you able to run associate from the command line with the -p or -v > >> options? > >> > >> Unfortunately I'm not that familiar with these options... I'll > >> give them a try tonight when I get home, though, to see if they work > >> in my version of the software. > >> > >> > >> Scott > >> > >> On 10/23/06, Zbynek Studenovsky <zb...@ma...> wrote: > >>> Dear Sirs, > >>> > >>> I use a Infomap software to build a language model (Greek New > >>> Testament and > >>> LXX) for my doctoral thesis. For publishing I would like to make > >>> my model > >>> available for another researches via www. For this reason I have > >>> installed > >>> on my system (Mac OS X 10.3.9 PowerPC) an Infomap demo PERL and > >>> CGI scripts > >>> from CVS (webif) directory. All scripts are running great and I > >>> can search > >>> for nearest neighbors of related words and retrieve documents > >>> without > >>> problems, also with negative keywords. > >>> > >>> Regrettably, I am unable to search for "clustered results" and > >>> "contrasting > >>> pairs" with my version of associate (?) - my Apache server > >>> (version 1.3) > >>> records in error_log "Bad option: -v" for "clustered results" and > >>> "Bad > >>> option: -p" for "contrasting pairs". > >>> > >>> My question is: Is the 'problem' in air file code (lines 74-79): > >>> > >>> #71 sub associate(){ > >>> #72 $command = "associate -w -c " . $input{'corpus'}; > >>> #73 if( $input{'contrast'} eq 'clustered' ){ > >>> #74 $command = $command . " -v clusters " . $input{'results'} . > >>> " " > >>> #75 . $input{'clusters'}; > >>> #76 } > >>> #77 if( $input{'contrast'} eq 'pairs' ){ > >>> #78 $command = $command . " -p"; > >>> #79 } > >>> > >>> or in my installed 'old' version of associate? > >>> > >>> Many thanks for your help and best regards from Prague > >>> > >>> Zbynek Studenovsky > >>> ----------------------------------------- > >>> email: zb...@ma... > >>> homepage: http://homepage.mac.com/zbynek > >>> > >>> > >>> > > > > > > <associate_man.txt> > > |
From: Dominic W. <wi...@ma...> - 2006-10-24 12:29:07
|
Dear Zbynek, Sorry for not responding sooner. I see what your problem is now. When we cleaned up the infomap codebase for public release, we only had a couple of months before the project's funding ended, and decided that the clustering and contrasting pairs options were bleeding edge and optional. So we left them out, hoping to put them into later releases. We only released the web interface code more recently, and in a "strictly beta" fashion - it hasn't been through the same cleaning up process. It turns out that the Perl web interface code is making calls against options in the core C infomap engine that won't be there in your release. I will think about trying to solve this (and Scott, if you have any time to help, that would be great). I was never 100% happy with the former clustering code, it wasn't a great clustering algorithm, and it was very wrapped in to the "associate" executable. It might be much easier to create and maintain a more modular architecture, where associate writes out the vectors to be clustered (e.g. the top 200) to a file using the "-f" option, and clustering libraries read these files. It wouldn't be as efficient, but I don't believe that this would make too much difference in practice (not for a single user running the algorithm once as part of a browser interface - other latencies are greater than writing and reading a file some where on your web server). This way we could write or reuse many clustering algorithms quite easily. It's also the way that the Java word spectrum plotter works. If you want to try wrapping in the clustering algorithm from the original infomap source, I could send you the files and try to help you. But it may actually take longer, I'm not sure. Best wishes, Dominic On Oct 24, 2006, at 5:19 AM, Zbynek Studenovsky wrote: > Dear Scott, > > thank you for your prompt respond. Unfortunately, I am not able to run > associate from the command line with the -p or -v options. The only > available options are in the attached man file (associate_man.txt) > and here: > > Usage:associate [-w | -d | -q] > [-i type_of_input(d or w)] [-f vector_output_file] > ( [-t] | [-m model_dir] ) > [-c <model_tag>] > [-n num_neighbors] [-f vector_output_file] > <pos_term_1> [pos_term_2 ... pos_term_n] > [NOT neg_term_1 ... neg_term_n] > > Task: -w associate words (DEFAULT) > -d associate documents > -q print query vector > > I have also compiled a new version of associate from CVS (1.2) and > it runs > with the same results. > > I think the problem could be an old code in the public version of > associate > in Infomap 0.8.6 and I think no one is able to run associate with - > v or -p > options. It would be nice, if you or Beate Dorow could add a > revision of > associate in CVS. > > Many thanks for your help and best regards > > Zbynek Studenovsky > ----------------------------------------- > email: zb...@ma... > homepage: http://homepage.mac.com/zbynek > > Am 23.10.2006 18:13 Uhr schrieb "Scott Cederberg" unter > <ced...@gm...>: > >> Hi Zbynek, >> >> I'm glad to hear that you've gotten the Web frontend working. >> Are you able to run associate from the command line with the -p or -v >> options? >> >> Unfortunately I'm not that familiar with these options... I'll >> give them a try tonight when I get home, though, to see if they work >> in my version of the software. >> >> >> Scott >> >> On 10/23/06, Zbynek Studenovsky <zb...@ma...> wrote: >>> Dear Sirs, >>> >>> I use a Infomap software to build a language model (Greek New >>> Testament and >>> LXX) for my doctoral thesis. For publishing I would like to make >>> my model >>> available for another researches via www. For this reason I have >>> installed >>> on my system (Mac OS X 10.3.9 PowerPC) an Infomap demo PERL and >>> CGI scripts >>> from CVS (webif) directory. All scripts are running great and I >>> can search >>> for nearest neighbors of related words and retrieve documents >>> without >>> problems, also with negative keywords. >>> >>> Regrettably, I am unable to search for "clustered results" and >>> "contrasting >>> pairs" with my version of associate (?) - my Apache server >>> (version 1.3) >>> records in error_log "Bad option: -v" for "clustered results" and >>> "Bad >>> option: -p" for "contrasting pairs". >>> >>> My question is: Is the 'problem' in air file code (lines 74-79): >>> >>> #71 sub associate(){ >>> #72 $command = "associate -w -c " . $input{'corpus'}; >>> #73 if( $input{'contrast'} eq 'clustered' ){ >>> #74 $command = $command . " -v clusters " . $input{'results'} . >>> " " >>> #75 . $input{'clusters'}; >>> #76 } >>> #77 if( $input{'contrast'} eq 'pairs' ){ >>> #78 $command = $command . " -p"; >>> #79 } >>> >>> or in my installed 'old' version of associate? >>> >>> Many thanks for your help and best regards from Prague >>> >>> Zbynek Studenovsky >>> ----------------------------------------- >>> email: zb...@ma... >>> homepage: http://homepage.mac.com/zbynek >>> >>> >>> > > > <associate_man.txt> |
From: Zbynek S. <zb...@ma...> - 2006-10-24 09:19:42
|
Dear Scott, thank you for your prompt respond. Unfortunately, I am not able to run associate from the command line with the -p or -v options. The only available options are in the attached man file (associate_man.txt) and here: Usage:associate [-w | -d | -q] [-i type_of_input(d or w)] [-f vector_output_file] ( [-t] | [-m model_dir] ) [-c <model_tag>] [-n num_neighbors] [-f vector_output_file] <pos_term_1> [pos_term_2 ... pos_term_n] [NOT neg_term_1 ... neg_term_n] Task: -w associate words (DEFAULT) -d associate documents -q print query vector I have also compiled a new version of associate from CVS (1.2) and it runs with the same results. I think the problem could be an old code in the public version of associate in Infomap 0.8.6 and I think no one is able to run associate with -v or -p options. It would be nice, if you or Beate Dorow could add a revision of associate in CVS. Many thanks for your help and best regards Zbynek Studenovsky ----------------------------------------- email: zb...@ma... homepage: http://homepage.mac.com/zbynek Am 23.10.2006 18:13 Uhr schrieb "Scott Cederberg" unter <ced...@gm...>: > Hi Zbynek, > > I'm glad to hear that you've gotten the Web frontend working. > Are you able to run associate from the command line with the -p or -v > options? > > Unfortunately I'm not that familiar with these options... I'll > give them a try tonight when I get home, though, to see if they work > in my version of the software. > > > Scott > > On 10/23/06, Zbynek Studenovsky <zb...@ma...> wrote: >> Dear Sirs, >> >> I use a Infomap software to build a language model (Greek New Testament and >> LXX) for my doctoral thesis. For publishing I would like to make my model >> available for another researches via www. For this reason I have installed >> on my system (Mac OS X 10.3.9 PowerPC) an Infomap demo PERL and CGI scripts >> from CVS (webif) directory. All scripts are running great and I can search >> for nearest neighbors of related words and retrieve documents without >> problems, also with negative keywords. >> >> Regrettably, I am unable to search for "clustered results" and "contrasting >> pairs" with my version of associate (?) - my Apache server (version 1.3) >> records in error_log "Bad option: -v" for "clustered results" and "Bad >> option: -p" for "contrasting pairs". >> >> My question is: Is the 'problem' in air file code (lines 74-79): >> >> #71 sub associate(){ >> #72 $command = "associate -w -c " . $input{'corpus'}; >> #73 if( $input{'contrast'} eq 'clustered' ){ >> #74 $command = $command . " -v clusters " . $input{'results'} . " " >> #75 . $input{'clusters'}; >> #76 } >> #77 if( $input{'contrast'} eq 'pairs' ){ >> #78 $command = $command . " -p"; >> #79 } >> >> or in my installed 'old' version of associate? >> >> Many thanks for your help and best regards from Prague >> >> Zbynek Studenovsky >> ----------------------------------------- >> email: zb...@ma... >> homepage: http://homepage.mac.com/zbynek >> >> >> |
From: Scott C. <ced...@gm...> - 2006-10-23 16:13:15
|
Hi Zbynek, I'm glad to hear that you've gotten the Web frontend working. Are you able to run associate from the command line with the -p or -v options? Unfortunately I'm not that familiar with these options... I'll give them a try tonight when I get home, though, to see if they work in my version of the software. Scott On 10/23/06, Zbynek Studenovsky <zb...@ma...> wrote: > Dear Sirs, > > I use a Infomap software to build a language model (Greek New Testament and > LXX) for my doctoral thesis. For publishing I would like to make my model > available for another researches via www. For this reason I have installed > on my system (Mac OS X 10.3.9 PowerPC) an Infomap demo PERL and CGI scripts > from CVS (webif) directory. All scripts are running great and I can search > for nearest neighbors of related words and retrieve documents without > problems, also with negative keywords. > > Regrettably, I am unable to search for "clustered results" and "contrasting > pairs" with my version of associate (?) - my Apache server (version 1.3) > records in error_log "Bad option: -v" for "clustered results" and "Bad > option: -p" for "contrasting pairs". > > My question is: Is the 'problem' in air file code (lines 74-79): > > #71 sub associate(){ > #72 $command = "associate -w -c " . $input{'corpus'}; > #73 if( $input{'contrast'} eq 'clustered' ){ > #74 $command = $command . " -v clusters " . $input{'results'} . " " > #75 . $input{'clusters'}; > #76 } > #77 if( $input{'contrast'} eq 'pairs' ){ > #78 $command = $command . " -p"; > #79 } > > or in my installed 'old' version of associate? > > Many thanks for your help and best regards from Prague > > Zbynek Studenovsky > ----------------------------------------- > email: zb...@ma... > homepage: http://homepage.mac.com/zbynek > > > |
From: Zbynek S. <zb...@ma...> - 2006-10-23 11:12:44
|
Dear Sirs, I use a Infomap software to build a language model (Greek New Testament and LXX) for my doctoral thesis. For publishing I would like to make my model available for another researches via www. For this reason I have installed on my system (Mac OS X 10.3.9 PowerPC) an Infomap demo PERL and CGI scripts from CVS (webif) directory. All scripts are running great and I can search for nearest neighbors of related words and retrieve documents without problems, also with negative keywords. Regrettably, I am unable to search for "clustered results" and "contrasting pairs" with my version of associate (?) - my Apache server (version 1.3) records in error_log "Bad option: -v" for "clustered results" and "Bad option: -p" for "contrasting pairs". My question is: Is the 'problem' in air file code (lines 74-79): #71 sub associate(){ #72 $command = "associate -w -c " . $input{'corpus'}; #73 if( $input{'contrast'} eq 'clustered' ){ #74 $command = $command . " -v clusters " . $input{'results'} . " " #75 . $input{'clusters'}; #76 } #77 if( $input{'contrast'} eq 'pairs' ){ #78 $command = $command . " -p"; #79 } or in my installed 'old' version of associate? Many thanks for your help and best regards from Prague Zbynek Studenovsky ----------------------------------------- email: zb...@ma... homepage: http://homepage.mac.com/zbynek |
From: Christian P. <pr...@gm...> - 2006-09-14 22:20:28
|
Dear Ted and Dominic, Thank you for all the helpful information. The "folding in" is great idea but I am concerned that with a growing amount of rows and columns the result will become increasingly imprecise (plus Perl's performance might be of a concern too). The data base used in my thesis is 700k+ terms and 200k+ documents. Eventually only a more "economical" adaptation of INFOMAP will solve the problem in my opinion. The information about the SVDPACK format are very welcome in this regard. A modification of INFOMAP is quite possible but I am uncertain if I will have the time available to do so. In case of a successful modification I will post a patch on this list but don't hold your breath yet. Cheers, Christian p.s. If anyone on this list has further ideas or hints feel free to send an email any time. ted pedersen wrote: > Hi Christian, > > I have been following your notes on the infomap mailing list, and > wanted to mention that we have used SVDPACKC a fair bit, and I think > it might scale reasonably well to your particular situation. The problem > with SVDPACK is that it uses a rather obscure input format, and then the > output format is equally obscure. :) But, we have created some programs > that try and deal with that in the SenseClusters package. > > http://senseclusters.sourceforge.net > > There are two programs that might help - the first is called > > mat2harbo.pl > > and this takes a matrix in a fairly standard adjacency matrix > representatin (sparse) and converts it to Harwell-Boeing format, > which is what SVDPACKC requires. It also helps set up the > parameters that SVDPACKC needs to run, and then goes ahead and > runs las2 (one of the types of SVD supported by SVDPACKC, and > to our mind the most standard and reliable). > > Then, a program called svdpackout.pl is run to read the binary > files generated by las2 and produce more readable output, which > allows you to see the post-svd matrix in a plain text form that > you can then use for whatever you need to do. > > I hope this might help you try out SVDPACKC. I don't know if it > will solve your problem exactly, but I think it has a good chance > of doing so. We have run matrices of approximately the size you > describe with SenseClusters. > > BTW, SVDPACKC is the C version of SVDPACK, download and install > instructions are included with SenseClusters in the INSTALL file. > > Cordially, > Ted > > On Mon, 11 Sep 2006, Dominic Widdows wrote: > > >> Dear Christian, >> >> I'm afraid the deafening silence in response to your question seems >> to suggest that there isn't a very good answer to your questions - at >> least, not one that anyone has actively used yet. >> >> In answer to your SVD question - I don't think that SVD-Pack would >> necessarily run into the same problems, because it uses a sparse >> representation. (At least, I know that it reads a fairly sparse >> column-major representation from disk, though I don't really know its >> internals.) It would certainly have scaling issues at some point, but >> I don't know how these would compare with infomap's initial matrix >> generation. >> >> Computing and writing the matrix in blocks would certainly be an >> effort - one I'd very appreciate someone doing, but not to be taken >> on lightly. >> >> Here is one sort-of solution I've used in the past for extending a >> basic model to individual rare words or phrases. Compute a basic >> infomap model within the 50k x 1k safe area. Once you've done this, >> you can generate word vectors for rare words using the same "folding >> in" method you might use to get context vectors, document vectors, >> etc. That is, for a single rare word W, collect the words V_1, ... , >> V_n that occur near W (using grep or some more principled method), >> take an average of those V_i that already have word vectors, and call >> this the word vector for W. In this way, you can build a framework >> from the common words, and use this as scaffolding to get vectors for >> rare words. >> >> Used naively, the method scales pretty poorly - if you wanted to >> create vectors for another 50k words, you'd be pretty sad to run >> 50,000 greps end to end. Obviously you wouldn't do this in practice, >> you'd write something to keep track of your next 50k words and their >> word vectors as you go along. For example, some data structure that >> recorded "word, vector, count_of_neighbors_used" would enable you to >> update the word vector when you encountered new neighbors in text, >> using the count to weight changes to the vector. In this case, memory >> requirements to add a lot of new words would be pretty minimal. For >> large scale work, you'd then want to find a way of adding these >> vectors to the database files you already have for the common words. >> >> So, there is work to do, but I think it's simpler than refactoring >> the matrix algebra. If you only want word vectors for a few rare >> words, it's really easy. Let me know if this is the case, I have a >> (very grubby) perl script already that might help you out. >> >> Sorry for the delay in answering, I hope this helps. >> Dominic >> >> On Sep 8, 2006, at 3:55 AM, Christian Prokopp wrote: >> >> >>> Hello, >>> >>> I am running INFOMAP on a 32bit Linux machine and have problems when I >>> try to use a large matrix, e.g. beyond 40k x 2k or 80k x 1k. My >>> suspicion is that the matrix allocated in initialize_matrix() in >>> matrix.c exits because it runs out of address space at around 3GB. >>> Does anyone have a solution besides using a 64bit system? >>> It seems very possible to rewrite the parts of INFOMAP to compute and >>> write the matrix in blocks rather than in its entirety but (a) that >>> is a >>> lot of work and (b) would SVD-Pack run into the same problem? >>> >>> Any thoughts are appreciated! >>> >>> Cheers, >>> Christian >>> >>> ---------------------------------------------------------------------- >>> --- >>> Using Tomcat but need to do more? Need to support web services, >>> security? >>> Get stuff done quickly with pre-integrated technology to make your >>> job easier >>> Download IBM WebSphere Application Server v.1.0.1 based on Apache >>> Geronimo >>> http://sel.as-us.falkag.net/sel? >>> cmd=lnk&kid=120709&bid=263057&dat=121642 >>> _______________________________________________ >>> infomap-nlp-devel mailing list >>> inf...@li... >>> https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel >>> >>> >> ------------------------------------------------------------------------- >> Using Tomcat but need to do more? Need to support web services, security? >> Get stuff done quickly with pre-integrated technology to make your job easier >> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo >> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 >> _______________________________________________ >> infomap-nlp-devel mailing list >> inf...@li... >> https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel >> >> > > |
From: Dominic W. <wi...@ma...> - 2006-09-12 01:56:53
|
Dear Christian, I'm afraid the deafening silence in response to your question seems to suggest that there isn't a very good answer to your questions - at least, not one that anyone has actively used yet. In answer to your SVD question - I don't think that SVD-Pack would necessarily run into the same problems, because it uses a sparse representation. (At least, I know that it reads a fairly sparse column-major representation from disk, though I don't really know its internals.) It would certainly have scaling issues at some point, but I don't know how these would compare with infomap's initial matrix generation. Computing and writing the matrix in blocks would certainly be an effort - one I'd very appreciate someone doing, but not to be taken on lightly. Here is one sort-of solution I've used in the past for extending a basic model to individual rare words or phrases. Compute a basic infomap model within the 50k x 1k safe area. Once you've done this, you can generate word vectors for rare words using the same "folding in" method you might use to get context vectors, document vectors, etc. That is, for a single rare word W, collect the words V_1, ... , V_n that occur near W (using grep or some more principled method), take an average of those V_i that already have word vectors, and call this the word vector for W. In this way, you can build a framework from the common words, and use this as scaffolding to get vectors for rare words. Used naively, the method scales pretty poorly - if you wanted to create vectors for another 50k words, you'd be pretty sad to run 50,000 greps end to end. Obviously you wouldn't do this in practice, you'd write something to keep track of your next 50k words and their word vectors as you go along. For example, some data structure that recorded "word, vector, count_of_neighbors_used" would enable you to update the word vector when you encountered new neighbors in text, using the count to weight changes to the vector. In this case, memory requirements to add a lot of new words would be pretty minimal. For large scale work, you'd then want to find a way of adding these vectors to the database files you already have for the common words. So, there is work to do, but I think it's simpler than refactoring the matrix algebra. If you only want word vectors for a few rare words, it's really easy. Let me know if this is the case, I have a (very grubby) perl script already that might help you out. Sorry for the delay in answering, I hope this helps. Dominic On Sep 8, 2006, at 3:55 AM, Christian Prokopp wrote: > Hello, > > I am running INFOMAP on a 32bit Linux machine and have problems when I > try to use a large matrix, e.g. beyond 40k x 2k or 80k x 1k. My > suspicion is that the matrix allocated in initialize_matrix() in > matrix.c exits because it runs out of address space at around 3GB. > Does anyone have a solution besides using a 64bit system? > It seems very possible to rewrite the parts of INFOMAP to compute and > write the matrix in blocks rather than in its entirety but (a) that > is a > lot of work and (b) would SVD-Pack run into the same problem? > > Any thoughts are appreciated! > > Cheers, > Christian > > ---------------------------------------------------------------------- > --- > Using Tomcat but need to do more? Need to support web services, > security? > Get stuff done quickly with pre-integrated technology to make your > job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache > Geronimo > http://sel.as-us.falkag.net/sel? > cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > infomap-nlp-devel mailing list > inf...@li... > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel > |
From: Christian P. <pr...@gm...> - 2006-09-08 07:55:22
|
Hello, I am running INFOMAP on a 32bit Linux machine and have problems when I try to use a large matrix, e.g. beyond 40k x 2k or 80k x 1k. My suspicion is that the matrix allocated in initialize_matrix() in matrix.c exits because it runs out of address space at around 3GB. Does anyone have a solution besides using a 64bit system? It seems very possible to rewrite the parts of INFOMAP to compute and write the matrix in blocks rather than in its entirety but (a) that is a lot of work and (b) would SVD-Pack run into the same problem? Any thoughts are appreciated! Cheers, Christian |
From: Beate D. <do...@im...> - 2005-06-02 08:20:24
|
Hi! Thanks very much for your bug report! I fixed the problem, committed the new code and the "-f" option of associate should now write the correct word vectors to the file specified. Best, Beate On Tue, 31 May 2005, Linuxer Wang wrote: > Hello, all > > I read part of the source codes, I have a problem with function > find_neighbors() in file neighbors.c. > > I marked the problems lines as @1 and @2 as follows: > > if( neighbor_item.score > threshold) { > if ( (neighbor_item.vector = malloc(vector_size)) == NULL ) { <---------@1 > fprintf( stderr, "neighbors.c: can't allocate vector memory.\n" ); > free_tail( list, neighbor_item_free ); > return 0; > } > list_insert( list, &last, depth, <---------@2 > (void *) &neighbor_item, sizeof( NEIGHBOR_ITEM ), > neighbor_item_cmp, neighbor_item_free ); > > list_length++; > > /* Set the threshold to the lowest value in the list */ > if( (last != NULL) && (list_length >= depth)) > threshold = ((NEIGHBOR_ITEM *) (last->data))->score; > } > > Note that, @1 clears the vector of neighbor_item, so when @2 tries to > insert neighbor_item to the list, the vector of neighbor_item is already > zero array. > So the problem is that the order of @1 and @2 should be conversed. The > current codes cause the output to file contains only zeros. > > Can anyone verify it? > Yours, > > > > ------------------------------------------------------- > This SF.Net email is sponsored by Yahoo. > Introducing Yahoo! Search Developer Network - Create apps using Yahoo! > Search APIs Find out how you can build Yahoo! directly into your own > Applications - visit http://developer.yahoo.net/?fr=offad-ysdn-ostg-q22005 > _______________________________________________ > infomap-nlp-devel mailing list > inf...@li... > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel > |
From: Linuxer W. <lin...@gm...> - 2005-05-31 18:25:36
|
Hello, all I read part of the source codes, I have a problem with function find_neighbors() in file neighbors.c. I marked the problems lines as @1 and @2 as follows: if( neighbor_item.score > threshold) { if ( (neighbor_item.vector = malloc(vector_size)) == NULL ) { <---------@1 fprintf( stderr, "neighbors.c: can't allocate vector memory.\n" ); free_tail( list, neighbor_item_free ); return 0; } list_insert( list, &last, depth, <---------@2 (void *) &neighbor_item, sizeof( NEIGHBOR_ITEM ), neighbor_item_cmp, neighbor_item_free ); list_length++; /* Set the threshold to the lowest value in the list */ if( (last != NULL) && (list_length >= depth)) threshold = ((NEIGHBOR_ITEM *) (last->data))->score; } Note that, @1 clears the vector of neighbor_item, so when @2 tries to insert neighbor_item to the list, the vector of neighbor_item is already zero array. So the problem is that the order of @1 and @2 should be conversed. The current codes cause the output to file contains only zeros. Can anyone verify it? Yours, |
From: Dominic W. <dwi...@cs...> - 2004-08-02 14:54:00
|
Dear Colin and Victor, I'm afraid I can't be of much help right now because unfortunately (or rather, fortunately for me I guess) I've just started a new and very exciting job in Pittsbugh, Pennsylvania for a group of mad scientists called MAYA Design. I'm doing lots of fuzzy geometry for dealing with imprecise spatial data, and they're really interested in forming a good general representation for temporal concepts and events, so I might end up learning a lot more about the semantics of verbs than I ever managed to on the Infomap project! I don know that Shuji Yamaguchi implemented a system for turning Japanese characters into ASCII-like text so that the infomap software could build vectors from Japanese corpora. I don't know if Shuji would be able to help you with a development version of whatever he did? Sorry I can't be more help, but good luck. -Dominic On Fri, 30 Jul 2004, Viktor Tron wrote: > Dear Colin, > > > Infomap had a lot of features hardwired that influenced tokenization. > Since it was made for English, the upper half of 8-bit ascii table was > not considered as word characters. > By making this user-driven, now any 8-bit character-coding can be used. > This change has been incorporated in the program and is found > in the cvs source-tree. > > As far as I can see, what you mean is unicode or where characters are more than > one byte. > Although I know nothing about this, I reckon that C IO cannot handle these > and therefore reads characters byte-wise. > Since tokenization into words is character-based (which is now one byte I reckon), > segmentation rules (see documentation) can only be given correctly for a multibyte > language if all possible bytes that occur in word characters and non-word characters are disjoint. And there might be other problems as well I guess. > > Is that correct or complete rubbish? > > This problem as well as the rather crude nature of hard-wired character-set-based > tokenization is why I thought imporivement is in order. > My idea was to have a mode where segmentation is totally user-defined > with word tags (similar to doc/text already built in), that is > in the first tokenization stage any entity between <w> and </w> is > considered a word. > > Anyone having time to implement it? > > Best > Viktor > > On Fri, 30 Jul 2004 15:35:46 +0100, Colin J Bannard <C.J...@ed...> wrote: > > > Hi Viktor, > > > > Earlier in the year you mentioned to me some changes that you had made to the > > InfoMap code to enable it to handle multibyte languages. I have been asked > > about using InfoMap by a researcher in Japan who says that the version > > currently available from Sourceforge doesn't handle Japanese. Do you know what > > happened to the changes you made? If they haven't been included in the official > > release yet, would you be willing to provide my friend in Kobe with the improved > > version? > > > > hope you are well. > > > > see you soon, > > Colin > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by OSTG. Have you noticed the changes on > Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now, > one more big change to announce. We are now OSTG- Open Source Technology > Group. Come see the changes on the new OSTG site. www.ostg.com > _______________________________________________ > infomap-nlp-devel mailing list > inf...@li... > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel > |
From: Viktor T. <v....@ed...> - 2004-07-30 18:17:47
|
Dear Colin, Infomap had a lot of features hardwired that influenced tokenization. Since it was made for English, the upper half of 8-bit ascii table was not considered as word characters. By making this user-driven, now any 8-bit character-coding can be used. This change has been incorporated in the program and is found in the cvs source-tree. As far as I can see, what you mean is unicode or where characters are more than one byte. Although I know nothing about this, I reckon that C IO cannot handle these and therefore reads characters byte-wise. Since tokenization into words is character-based (which is now one byte I reckon), segmentation rules (see documentation) can only be given correctly for a multibyte language if all possible bytes that occur in word characters and non-word characters are disjoint. And there might be other problems as well I guess. Is that correct or complete rubbish? This problem as well as the rather crude nature of hard-wired character-set-based tokenization is why I thought imporivement is in order. My idea was to have a mode where segmentation is totally user-defined with word tags (similar to doc/text already built in), that is in the first tokenization stage any entity between <w> and </w> is considered a word. Anyone having time to implement it? Best Viktor On Fri, 30 Jul 2004 15:35:46 +0100, Colin J Bannard <C.J...@ed...> wrote: > Hi Viktor, > > Earlier in the year you mentioned to me some changes that you had made to the > InfoMap code to enable it to handle multibyte languages. I have been asked > about using InfoMap by a researcher in Japan who says that the version > currently available from Sourceforge doesn't handle Japanese. Do you know what > happened to the changes you made? If they haven't been included in the official > release yet, would you be willing to provide my friend in Kobe with the improved > version? > > hope you are well. > > see you soon, > Colin |
From: Beate D. <do...@IM...> - 2004-05-11 13:37:28
|
Dear Menno, dear Dominic, I added an option -i to associate which allows a user to tell the program whether the query consists of words ( -i w) or of documents (-i d). Together with the -q option, it's now possible to return document vectors. You can check out the new code via cvs on sourceforge. In order to try out the new option you'll first have to rebuild your model from the beginning (the new option relies on a database produced by prepare_corpus). In case of a single-file corpus, "associate -i d" expects document IDs as input; in case of a multi-file corpus, it expects document names. Default is "-i w" which corresponds to associate as before, e.g. associate -w -i w -m ... -c ... word1 .. wordk will return words which are similar to word1 .. wordk. associate -w -i d -m ... -c ... doc_id1 .. doc_idk will return words which are similar to the documents corresponding to doc_id1 .. doc_idk, and associate -q -i d -m ... -c ... doc_id1 .. doc_idk will return the average over the document vectors corresponding to doc_id1 .. doc_idk. So if you wanted to look up the document vectors of a whole bunch of docs you'd have to call "associate -q -i d" for each of the docs (i.e. a loop). Could you please check if the new option works properly, please? If you have suggestions for a more convenient way of getting document vectors or any other suggestions for improvement, let me know. And could you read the new man pages, please, and check whether they are comprehensible? Once you have tested the code, we can post a new release together with the other recent changes. Another thing which I changed is print_doc. So far, it expected document IDs as input. Since "associate -d", in case of a multi-file corpus, returns document names rather than document ids, I thought it'd make more sense to pass document names to print_doc if it's a multi-file corpus. For a single-file corpus, however, print_doc still expects doc ids (since this is what "associate -d" returns in this case). Does that make sense? Best wishes, Beate On Thu, 6 May 2004, Dominic Widdows wrote: > >> There must be an easier way, but I think not many people will be >> interested in the raw document vectors (or am I wrong)? > >Hi Menno, > >It sounds like your work-around to get the document vectors is pretty >effective, though as you say there should be an easier way. > >For word and query vectors there's an "associate -q" option which simply >prints out the query vector rather than performing a search. One way I've >often used to get document vectors is simply to pass the whole document as >an argument to "associate -q", which is pretty unsatisfactory though it >does have the benefit that you can get document vectors for textfiles that >weren't in your original corpus. > >If the "associate -q" option was combined with the "associate_doc" >function Beate described, this would solve the problem properly, and I >could see benefits to making this available (eg. for work on document >clustering). It sounds as though you've already got a workable solution, >but if enough other people on the list express an interest we should look >into it. > >I'm delighted to hear about people using the infomap software as part of a >richer and more complex system of features - I'd be interested to hear >more about your work whenever you are ready. > >Best wishes, >Dominic > > >------------------------------------------------------- >This SF.Net email is sponsored by Sleepycat Software >Learn developer strategies Cisco, Motorola, Ericsson & Lucent use to deliver >higher performing products faster, at low TCO. >http://www.sleepycat.com/telcomwpreg.php?From=osdnemail3 >_______________________________________________ >infomap-nlp-users mailing list >inf...@li... >https://lists.sourceforge.net/lists/listinfo/infomap-nlp-users > |
From: Shuji Y. <yam...@ya...> - 2004-04-30 18:15:10
|
Dear Dominic, Thank you for your advice. Sorry for not responding you sooner. My replies to your points are as follows. < "In your creation of an initial bilingual model (steps 1 and 2), is it possible that any Japanese and English words will be accidentally represented by the same strings of 8 byte characters, or is there some = way of avoiding this possibility?" A good point. I noted it so that I have appended an underscore '_' to = the transliterated (i.e. tr(/0-9a-f/g-v/); in Perl) hex string for a = Japanese word in my program. This gets around the clash problem you kindly = pointed out (assuming that it should be few case of an English natural word = starting with '_'. < "Please let me know if this interpretation is correct, and if so how = well it works - I definitely think it's worth a try. One worry I have is = using such a small training set will give very little information about most = words - many won't appear, and all those that appear with unit frequency = within the same document will be mapped to exactly the same vector. But it will = at least be relatively easy to test, by comparing English searches over the larger model to those within the small model." Thank you for your reading my lengthy explanation. Yes, you get it = right. I have gone through 5-6 cycles of the bootstrapping process. Results are mixed. Even with the small bilingual model out of 200 pairs the document similarity ranking gives good indication of original English news for a Japanese translation on the same or similar event especially when the = news is from small news editing bureaux. It however does not perform well = (with accuracy of less than 10-20%) to identify original English news when it = is reported from large editing bureaux e.g. in London and New York. This is because there are often more than 5-10 news reported from the large = bureaux on a single event throughout a day in slightly different angles and = timings, while it is not the case for news from the small bureaux. The small word vector is not fine-tuned enough to pin-point an exact original among the 5-10 similar news. > "another way might be to select fairly unambiguous names (such as = "Iraq", "Dow Jones stock exchange", "UNICEF") and artificially treat the English = and Japanese versions of these names as identical content-bearing words, now that Beate has enabled the users to choose these words for themselves." This sounds a very interesting approach. I will explore this further. Best regards, Shuji -----Original Message----- From: inf...@li... [mailto:inf...@li...] On Behalf Of = Dominic Widdows Sent: Wednesday, April 21, 2004 5:15 PM To: Shuji Yamaguchi Cc: inf...@li... Subject: Re: [infomap-nlp-devel] Use of Infomap for bootstrapping text alignment. Wondering whether someone could review my method. Dear Shuji, Thanks for this message, and for trying out the infomap software in such = a creative fashion. Please do not hesitate to ask our opinions on such matters - I'm sure we would all be delighted if our work could be put to positive use with AlertNet. My main regret is that I might not have sufficient time or expertise to give as much help as I would like, but I will gladly contribute where I can. Here are some suggestions of possible pitfalls - I don't know if any of them will actually occur. It sounds as though your approach is a very promising way of building a cross-lingual system from a comparable corpus with some seed-alignment, a development we've wished for for some time. In your creation of an initial bilingual model (steps 1 and 2), is it possible that any Japanese and English words will be accidentally represented by the same strings of 8 byte characters, or is there some = way of avoiding this possibility? Now, your specific stage: > 5)-1. Replace word vector files (i.e. wordvec.bin, = word2offset.dir/pag, > offset2word.dir/pag) and the dictionary file (dic) in the English = model > obtained in 2) by the ones from the bilingual training model obtained = in > 1). If I understand it correctly, you'd be replacing the vectors and dictionary for a larger English collection with a much smaller set determined just from the aligned pairs? Then in 5)-2, you compute = document vectors for the English collection using vectors from this smaller = model, and in 5)-3 you test this by seeing if query built from a Japanese document retrieves its English counterpart? And if this works well, you can feed in other Japanese documents and treat their best matches as potential translations, increasing the size of the aligned set. Please let me know if this interpretation is correct, and if so how well it works - I definitely think it's worth a try. One worry I have is = using such a small training set will give very little information about most words - many won't appear, and all those that appear with unit frequency within the same document will be mapped to exactly the same vector. But = it will at least be relatively easy to test, by comparing English searches over the larger model to those within the small model. An alternative to using the vectors from the small aligned model might = be to use the larger English model to get term vectors for the Japanese = words in the aligned documents (by averaging the vectors of the docuents these terms appear in). But you'd still have the problem that two Japanese = words of unit frequency appearing in the same documents would be mapped to the same vector. If it doesn't work well with documents, another way might be to select fairly unambiguous names (such as "Iraq", "Dow Jones stock exchange", "UNICEF") and artificially treat the English and Japanese versions of these names as identical content-bearing words, now that Beate has = enabled the users to choose these words for themselves. Can you get a list of English/Japanese term-pairs likt this farily easily? I remember you presented a PowerPoint slide once with a few of these - were they drawn from a larger collection? Please let me know how you get on. Best wishes, Dominic On Tue, 20 Apr 2004, Shuji Yamaguchi wrote: > Hi all, > > I wonder whether some of you could review and validate my = bootstrapping use > of Infomap for bilingual text alignment. I have described it in = details > below. I especially wonder whether my steps of 5)-1 and 5)-2 are (not > authentic but still) a doable short-cut for calculating document = vectors of > new additional documents under an existing model. > > My bootstrapping use of Informap > ---------------------------------------------- > Given a comparable English and Japanese news corpus from Reuters = (which I > work for), my task is to find out an English original news for a given > Japanese translated news. Roughly number of English news is 10 times = more > than Japanese. > > I use Infomap as follows to narrow down candidates of English original news > for a Japanese translation. > > 1) Initially 120 news bilingual pairs are identified rather manually = and > they are used as an initial "training" bilingual corpus. > > 2) Each pair of news are merged into a single text file. All of the = pairs > are fed into Infomap to come up with a pseudo bilingual training = model. > (NB: I have not yet used the unreleased bilingual InfoMap. I have converted > a Japanese 2 byte character into a special transliterated hex string = to get > over the current limitation of 8 byte-per-character assumption in = Infomap. I > have also modified count_wordvec.c locally in my copy of Infomap so = that a > whole bilingual file falls into a "context window" for co-occurrence > analysis.) > > 3) Now a few thousands of English news (reported on a particular date) = are > taken out of the rest of corpus and fed into Infomap to create another > English only monolingual model. Some of these English news are the original > for a few hundred Japanese translated news on the same date. (NB: = Actually a > small percentage of the original may have been reported on the = previous date > due to the time difference, but this is ignored for the moment.) > > 4) My basic idea here is to calculate the document vectors for all of = the > English news and a given Japanese translation in 3) above under the > bilingual training model created in 2) above, to compare the = similarity, to > look into a few English news with the highest similarity scores and to > select a real original out of them. > > 5) In order to make best use of Infomap software, I have been doing = the > following for the idea of 4) above: > > 5)-1. Replace word vector files (i.e. wordvec.bin, = word2offset.dir/pag, > offset2word.dir/pag) and the dictionary file (dic) in the English = model > obtained in 2) by the ones from the bilingual training model obtained = in 1). > > 5)-2. Recalculate document vector files (artvec.bin, = art2offset.dir/pag, > offset2art.dir/pag) of the English model by the count_artvec command. = I > suppose this calculate document vector under the bilingual model = because of > the word vector file replacement in 5)-1. > > 5)-3. Treat the given Japanese translation as a long query and = calculate > its vector by my slightly modified version of "associate -d" command (which > accepts a filename of the Japanese translation as well) running = against the > English model with the bilingual word vector created in the 5)-2 step above. > > 5)-4 The associate command nicely lists out English news documents in = the > similarity order for the Japanese translation as query so that I look = at the > list and examine the highest ones to find the real original. > > 6) By repeating 5)-3 and 5)-4 over the few hundreds of Japanese > translations, I can add additional correct pairs (say 10-20) to the initial > set of pairs and go through the 2) - 5) steps again. I hope this would > gradually improve the bilingual model with a growing number of pairs. = I can > then use the sufficiently improved bilingual model for CLIR and other > interesting tasks. > > ---------------( end of my bootstrapping use of > Infomap------------------------------------------------ > > I have looked into count_artvec.c to confirm whether the 5)-1 and 5)-2 would > still work fine, but I am not sure whether I fully understand the following > code within process_region(), which I think is a key here whether my > irregular usage would be still all right. > /* Add the vectors up */ > while( cursor <=3D region_out) { > /* If this is a row label... */ > if( ( row =3D ((env->word_array)[int_buffer[cursor]]).row) >=3D 0) > for( i=3D0; i < singvals; i++) > tmpvector[i] +=3D (env->matrix)[row][i]; > cursor++; > } > My casual walk through of the codes suggests that the word_array in = the IF > statement above will work fine still with words in int_buffer[ ] from = the > English only new and that it would give the document vector for the English > news under the bilingual model. But I am not much confident about it. > > Feeling sorry for a long mail, but I would really appreciate your kind > review and advice. > Best regards, Shuji > > > > ------------------------------------------------------- > This SF.Net email is sponsored by: IBM Linux Tutorials > Free Linux tutorial presented by Daniel Robbins, President and CEO of > GenToo technologies. Learn everything from fundamentals to system > = administration.http://ads.osdn.com/?ad_id=3D1470&alloc_id=3D3638&op=3Dcli= ck > _______________________________________________ > infomap-nlp-devel mailing list > inf...@li... > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel > ------------------------------------------------------- This SF.Net email is sponsored by: IBM Linux Tutorials Free Linux tutorial presented by Daniel Robbins, President and CEO of GenToo technologies. Learn everything from fundamentals to system administration.http://ads.osdn.com/?ad_id=3D1470&alloc_id=3D3638&op=3Dcli= ck _______________________________________________ infomap-nlp-devel mailing list inf...@li... https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel |
From: Dominic W. <dwi...@cs...> - 2004-04-23 02:26:11
|
Dear All, I just uploaded and released a new version of infomap-nlp, with the following changes: Users can opt to specify content-bearing words Allowed characters set in separate file, to allow non-Roman characters Fixed bug with count_artvec for small corpora This hopefully fixes some of the problems people have been having with small corpora. Many thanks to Beate and Viktor for their recent contributions. If I don't hear any screams of it not working, I'll send an e-mail to the users list tomorrow. Best wishes, Dominic |
From: Dominic W. <dwi...@cs...> - 2004-04-23 00:18:36
|
Dear Beate, Thanks for putting together such a comprehensive set of instrcutions for Shuji to use the bilingual code. Would anyone have any objection to putting this up on the SourceForge website (maybe called "infomap-bilingual"), along with the instructions? We can say it's strictly a beta version. This way if anyone does take it upong themselves to use the code and get it as robust as the main infomap code, they have the option. Any thoughts? Dominic On Tue, 20 Apr 2004, Beate Dorow wrote: > > > Dear Shuji, > > Here is the tarball of the Bilingual Infomap code. > Unfortunately, it's not as convenient to use as the monolingual model on > sourceforge. > I prepared a very small example corpus (it's a tiny fraction of the > Canadian-Hansard), so that you can see what you'll have to change in order > to build a model from your own corpus. Since the tarball is already big, I > put the corpus on the web, so you can download it separately from there: > http://infomap.stanford.edu/shuji. > The results on this example corpus are quite bad due to its small size. In > particular, looking for documents related to a query may not result in any > document, because the similarity is below a threshold. This shouldn't > bother you, however, and the results on your own corpus should be a > lot better! > > I added two directories to the main directory, "corpora" and "data" which > are normally located at some other place. You can specify their location > in the Makefile of the main directory by changing the CORPUS_DIR, > DATA_PATH, DATA_DIR variables. > > To build a model from the example corpus, you have to do the following: > > * Download the example corpus from the web, unpack it and put it in the > BiLing/corpora directory. > > * Go into the BiLing/search directory and create the following symbolic > links: > > ln -f -s ../preprocessing/utils.c > ln -f -s ../preprocessing/utils.h > ln -f -s ../preprocessing/list.c > ln -f -s ../preprocessing/list.h > > (I couldn't get the Makefile to do this automatically, so for now, > you'll have to do it by hand). > > * Go back to the main "BiLing" directory and run "make data". You'll > then have to change into the preprocessing directory and run > "encode_wordvec" and then "count_artvec". "make data" is supposed > to build the model at once, but this is another bug which has to be > resolved at some point. > > * Move all the produced model files from data/working_data to > data_finished_data. > > For searching the model, go into the search directory. To look for similar > *words*, run "associate -w", to look for similar documents you have to use > "associate -d". > E.g. to look for *English* words which are similar to "health", the > complete associate command looks like this: > > associate -w -l A A health > (the "A" stands for language A which is English) > > If you are instead interested in *French* words associated to "health", > use: > > associate -w -l B A health > > Or to look for English documents similar to the French word "sant\351", > type > > associate -d -l A B sant\351 > > > To build your own bilingual model, you'll have to do the > following: > > * Suppose your corpus is called "reuters". Add directories named > "reuters" to both the corpora and the data/working_data and > data/finished_data direcotries. Copy your corpus (consisting of > documents and their translations) into the corpora/reuters > directory, together with two stoplists, one for each language, which > you put in a directory corpora/reuters/lists. > Now suppose that your copus consists of English and German documents, > the former ending in ".eng", the latter in ".ger". The names of the > stoplists then have to be "stopeng.list" and "stopger.list" > > * Check that your corpus is in the proper format; documents which are > translations of each other have the same filename stem and differ only > in prefixes and suffixes which indicate which language a file is > written in. In case your documents are big and you want to use smaller > units for counting co-occurrences, sentence id tags (<s id=...>) can be > used (but are not required) to divide each file into smaller junks. A > document and its translation have to have the same number of sentences > and sentences which are translations of each other have the same id. > > * You'll then have to create a file named "reutersNames2.txt" in which > you list all the stems of the corpus files together with the number of > sentences contained in this file. There is a perl script > "count_sentences.pl" in the "BiLing/corpora/Canadian-Hansard" directory > which, after a bit of customization, you can use to build this file > automatically. > > * Then edit the Makefile and change the variables (e.g. corpus name, > corpus directory, prefixes, suffixes, ...) such that they fit your > situation. > > * Now, run "make data" to build the model. Then change into the > preprocessing directory and run "encode_wordvec" and then "count_artvec". > The model files are all put in data/working_data, and you'll now have > to move them to data/finished_data. > > * Change into the search directory. "associate ..." should now work > for your corpus. In case you build models from different corpora, you > can use the "-c" option of "associate" to specify which corpus you want > to query. > > There are two things to note: The bilingual code still uses the old > my_isalpha procedure in preprocessing/utils.c to decide which characters > to read during preprocessing, and there is only one is_alpha function for > both languages. Depending on your corpus, you may have to include other > characters than the one specified in my_isalpha as well. > > I added the "read column labels from file" to the bilingual code as well. > So if, instead of taking the most frequent words, you prefer to read the > column labels from a file, you will also have to change the line > $(MY_DIR_STEM)/preprocessing/count_wordvec > in the main Makefile to > $(MY_DIR_STEM)/preprocessing/count_wordvec > -col_label_file (name_of_your_col_label_file) > Column labels are assumed to be language A words. > > I know, this is a lot of info at once, and I am sorry it's not more > convenient at the moment. I hope you are succesful in building your own > model, and I am happy to help in case there are problems. > > Best wishes, > Beate > > > > > > On Mon, 19 Apr 2004, Shuji Yamaguchi wrote: > > >Dear Beate, > > > >Yes, I would appreciate it as I do not have a reply back from Stanley (and > >Emma told me I should count on him for sapir). > >Could you please send it to me via mail to my CSLI account (which has larger > >quota), > > sh...@cs... > >? > >I assume it would be around 300 kb in size, judging from a gzip file of > >version 0.8.3. > > > >Many thanks for your time and support. > >Regards, Shuji > > |
From: Dominic W. <dwi...@cs...> - 2004-04-22 00:16:07
|
Dear Shuji, Thanks for this message, and for trying out the infomap software in such a creative fashion. Please do not hesitate to ask our opinions on such matters - I'm sure we would all be delighted if our work could be put to positive use with AlertNet. My main regret is that I might not have sufficient time or expertise to give as much help as I would like, but I will gladly contribute where I can. Here are some suggestions of possible pitfalls - I don't know if any of them will actually occur. It sounds as though your approach is a very promising way of building a cross-lingual system from a comparable corpus with some seed-alignment, a development we've wished for for some time. In your creation of an initial bilingual model (steps 1 and 2), is it possible that any Japanese and English words will be accidentally represented by the same strings of 8 byte characters, or is there some way of avoiding this possibility? Now, your specific stage: > 5)-1. Replace word vector files (i.e. wordvec.bin, word2offset.dir/pag, > offset2word.dir/pag) and the dictionary file (dic) in the English model > obtained in 2) by the ones from the bilingual training model obtained in > 1). If I understand it correctly, you'd be replacing the vectors and dictionary for a larger English collection with a much smaller set determined just from the aligned pairs? Then in 5)-2, you compute document vectors for the English collection using vectors from this smaller model, and in 5)-3 you test this by seeing if query built from a Japanese document retrieves its English counterpart? And if this works well, you can feed in other Japanese documents and treat their best matches as potential translations, increasing the size of the aligned set. Please let me know if this interpretation is correct, and if so how well it works - I definitely think it's worth a try. One worry I have is using such a small training set will give very little information about most words - many won't appear, and all those that appear with unit frequency within the same document will be mapped to exactly the same vector. But it will at least be relatively easy to test, by comparing English searches over the larger model to those within the small model. An alternative to using the vectors from the small aligned model might be to use the larger English model to get term vectors for the Japanese words in the aligned documents (by averaging the vectors of the docuents these terms appear in). But you'd still have the problem that two Japanese words of unit frequency appearing in the same documents would be mapped to the same vector. If it doesn't work well with documents, another way might be to select fairly unambiguous names (such as "Iraq", "Dow Jones stock exchange", "UNICEF") and artificially treat the English and Japanese versions of these names as identical content-bearing words, now that Beate has enabled the users to choose these words for themselves. Can you get a list of English/Japanese term-pairs likt this farily easily? I remember you presented a PowerPoint slide once with a few of these - were they drawn from a larger collection? Please let me know how you get on. Best wishes, Dominic On Tue, 20 Apr 2004, Shuji Yamaguchi wrote: > Hi all, > > I wonder whether some of you could review and validate my bootstrapping use > of Infomap for bilingual text alignment. I have described it in details > below. I especially wonder whether my steps of 5)-1 and 5)-2 are (not > authentic but still) a doable short-cut for calculating document vectors of > new additional documents under an existing model. > > My bootstrapping use of Informap > ---------------------------------------------- > Given a comparable English and Japanese news corpus from Reuters (which I > work for), my task is to find out an English original news for a given > Japanese translated news. Roughly number of English news is 10 times more > than Japanese. > > I use Infomap as follows to narrow down candidates of English original news > for a Japanese translation. > > 1) Initially 120 news bilingual pairs are identified rather manually and > they are used as an initial "training" bilingual corpus. > > 2) Each pair of news are merged into a single text file. All of the pairs > are fed into Infomap to come up with a pseudo bilingual training model. > (NB: I have not yet used the unreleased bilingual InfoMap. I have converted > a Japanese 2 byte character into a special transliterated hex string to get > over the current limitation of 8 byte-per-character assumption in Infomap. I > have also modified count_wordvec.c locally in my copy of Infomap so that a > whole bilingual file falls into a "context window" for co-occurrence > analysis.) > > 3) Now a few thousands of English news (reported on a particular date) are > taken out of the rest of corpus and fed into Infomap to create another > English only monolingual model. Some of these English news are the original > for a few hundred Japanese translated news on the same date. (NB: Actually a > small percentage of the original may have been reported on the previous date > due to the time difference, but this is ignored for the moment.) > > 4) My basic idea here is to calculate the document vectors for all of the > English news and a given Japanese translation in 3) above under the > bilingual training model created in 2) above, to compare the similarity, to > look into a few English news with the highest similarity scores and to > select a real original out of them. > > 5) In order to make best use of Infomap software, I have been doing the > following for the idea of 4) above: > > 5)-1. Replace word vector files (i.e. wordvec.bin, word2offset.dir/pag, > offset2word.dir/pag) and the dictionary file (dic) in the English model > obtained in 2) by the ones from the bilingual training model obtained in 1). > > 5)-2. Recalculate document vector files (artvec.bin, art2offset.dir/pag, > offset2art.dir/pag) of the English model by the count_artvec command. I > suppose this calculate document vector under the bilingual model because of > the word vector file replacement in 5)-1. > > 5)-3. Treat the given Japanese translation as a long query and calculate > its vector by my slightly modified version of "associate -d" command (which > accepts a filename of the Japanese translation as well) running against the > English model with the bilingual word vector created in the 5)-2 step above. > > 5)-4 The associate command nicely lists out English news documents in the > similarity order for the Japanese translation as query so that I look at the > list and examine the highest ones to find the real original. > > 6) By repeating 5)-3 and 5)-4 over the few hundreds of Japanese > translations, I can add additional correct pairs (say 10-20) to the initial > set of pairs and go through the 2) - 5) steps again. I hope this would > gradually improve the bilingual model with a growing number of pairs. I can > then use the sufficiently improved bilingual model for CLIR and other > interesting tasks. > > ---------------( end of my bootstrapping use of > Infomap------------------------------------------------ > > I have looked into count_artvec.c to confirm whether the 5)-1 and 5)-2 would > still work fine, but I am not sure whether I fully understand the following > code within process_region(), which I think is a key here whether my > irregular usage would be still all right. > /* Add the vectors up */ > while( cursor <= region_out) { > /* If this is a row label... */ > if( ( row = ((env->word_array)[int_buffer[cursor]]).row) >= 0) > for( i=0; i < singvals; i++) > tmpvector[i] += (env->matrix)[row][i]; > cursor++; > } > My casual walk through of the codes suggests that the word_array in the IF > statement above will work fine still with words in int_buffer[ ] from the > English only new and that it would give the document vector for the English > news under the bilingual model. But I am not much confident about it. > > Feeling sorry for a long mail, but I would really appreciate your kind > review and advice. > Best regards, Shuji > > > > ------------------------------------------------------- > This SF.Net email is sponsored by: IBM Linux Tutorials > Free Linux tutorial presented by Daniel Robbins, President and CEO of > GenToo technologies. Learn everything from fundamentals to system > administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click > _______________________________________________ > infomap-nlp-devel mailing list > inf...@li... > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel > |
From: Shuji Y. <yam...@ya...> - 2004-04-21 05:24:59
|
Hi all, I wonder whether some of you could review and validate my bootstrapping use of Infomap for bilingual text alignment. I have described it in details below. I especially wonder whether my steps of 5)-1 and 5)-2 are (not authentic but still) a doable short-cut for calculating document vectors of new additional documents under an existing model. My bootstrapping use of Informap ---------------------------------------------- Given a comparable English and Japanese news corpus from Reuters (which I work for), my task is to find out an English original news for a given Japanese translated news. Roughly number of English news is 10 times more than Japanese. I use Infomap as follows to narrow down candidates of English original news for a Japanese translation. 1) Initially 120 news bilingual pairs are identified rather manually and they are used as an initial "training" bilingual corpus. 2) Each pair of news are merged into a single text file. All of the pairs are fed into Infomap to come up with a pseudo bilingual training model. (NB: I have not yet used the unreleased bilingual InfoMap. I have converted a Japanese 2 byte character into a special transliterated hex string to get over the current limitation of 8 byte-per-character assumption in Infomap. I have also modified count_wordvec.c locally in my copy of Infomap so that a whole bilingual file falls into a "context window" for co-occurrence analysis.) 3) Now a few thousands of English news (reported on a particular date) are taken out of the rest of corpus and fed into Infomap to create another English only monolingual model. Some of these English news are the original for a few hundred Japanese translated news on the same date. (NB: Actually a small percentage of the original may have been reported on the previous date due to the time difference, but this is ignored for the moment.) 4) My basic idea here is to calculate the document vectors for all of the English news and a given Japanese translation in 3) above under the bilingual training model created in 2) above, to compare the similarity, to look into a few English news with the highest similarity scores and to select a real original out of them. 5) In order to make best use of Infomap software, I have been doing the following for the idea of 4) above: 5)-1. Replace word vector files (i.e. wordvec.bin, word2offset.dir/pag, offset2word.dir/pag) and the dictionary file (dic) in the English model obtained in 2) by the ones from the bilingual training model obtained in 1). 5)-2. Recalculate document vector files (artvec.bin, art2offset.dir/pag, offset2art.dir/pag) of the English model by the count_artvec command. I suppose this calculate document vector under the bilingual model because of the word vector file replacement in 5)-1. 5)-3. Treat the given Japanese translation as a long query and calculate its vector by my slightly modified version of "associate -d" command (which accepts a filename of the Japanese translation as well) running against the English model with the bilingual word vector created in the 5)-2 step above. 5)-4 The associate command nicely lists out English news documents in the similarity order for the Japanese translation as query so that I look at the list and examine the highest ones to find the real original. 6) By repeating 5)-3 and 5)-4 over the few hundreds of Japanese translations, I can add additional correct pairs (say 10-20) to the initial set of pairs and go through the 2) - 5) steps again. I hope this would gradually improve the bilingual model with a growing number of pairs. I can then use the sufficiently improved bilingual model for CLIR and other interesting tasks. ---------------( end of my bootstrapping use of Infomap------------------------------------------------ I have looked into count_artvec.c to confirm whether the 5)-1 and 5)-2 would still work fine, but I am not sure whether I fully understand the following code within process_region(), which I think is a key here whether my irregular usage would be still all right. /* Add the vectors up */ while( cursor <= region_out) { /* If this is a row label... */ if( ( row = ((env->word_array)[int_buffer[cursor]]).row) >= 0) for( i=0; i < singvals; i++) tmpvector[i] += (env->matrix)[row][i]; cursor++; } My casual walk through of the codes suggests that the word_array in the IF statement above will work fine still with words in int_buffer[ ] from the English only new and that it would give the document vector for the English news under the bilingual model. But I am not much confident about it. Feeling sorry for a long mail, but I would really appreciate your kind review and advice. Best regards, Shuji |
From: Beate D. <do...@IM...> - 2004-04-20 09:23:41
|
Hi Dominic, >One question, though - is there a default location for the COL_LABEL_FILE? >It can be set in the default-params file and since it's a special option I >guess it doesn't need a default (since the default is not to have one). >Does this sound reasonable? The infomap-build script initializes a variable COL_LABELS_FROM_FILE with 0 and COL_LABEL_FILE with "". So unless the user specifies these variables otherwise (via the -D option of infomap-build or via the parameter file), column labels are "computed" automatically just as before. I think you are right, since a column label file is not necessary for the code, a default location probably doesn't make so much sense. What we could do is to initialize in analogy to the stoplist file: COL_LABELS_FROM_FILE=0 COL_LABEL_FILE="@pkgdatadir@/col.labels" If the user sets COL_LABELS_FROM_FILE to 1, then column labels will be read from the default location. It may however confuse the user that although the Boolean variable is set to 0, COL_LABEL_FILE is not empty. What do you think? Sleep well, Beate > >Best wishes, >Dominic > > >------------------------------------------------------- >This SF.Net email is sponsored by: IBM Linux Tutorials >Free Linux tutorial presented by Daniel Robbins, President and CEO of >GenToo technologies. Learn everything from fundamentals to system >administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click >_______________________________________________ >infomap-nlp-devel mailing list >inf...@li... >https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel > |
From: Viktor T. <v....@ed...> - 2004-04-20 08:56:06
|
On Mon, 19 Apr 2004 16:49:10 -0700 (PDT), Dominic Widdows <dwi...@cs...> wrote: > > Dear All, > > I checked out Viktor's changes and the new valid_chars file seems to work > really well. I don't know if it will work for Japanese as well? Well, if it is encoded with some 8bit ascii, you can always compile your valid_chars file, but I guess eventually unicode seems inevitable... V > Scott - did you manage to track down Beate's problem with getting a new > version called 0.8.4? I think we should definitely get the changes we've > made released. > > Beate - do you think you might be able to update the man pages to explain > the COL_LABELS_FROM_FILE functionality? > > Thanks to everyone for what you've done so far. > Best wishes, > Dominic > > On Mon, 19 Apr 2004, Viktor Tron wrote: > >> Yes finally I have uploaded the changes. >> >> It took me a while cause I wnated to document it so I extended the manpages. >> (Nothing to the tutorial, though) >> >> *Everything* should work like before out of the box. >> >> Please check this if you can with a clean temp checkout and compile, etc. >> >> Best >> Viktor >> >> On Thu, 15 Apr 2004 11:40:14 -0700 (PDT), Dominic Widdows <dwi...@cs...> wrote: >> >> > >> > Dear Viktor, >> > >> > Did you manage to commit your changes to the infomap code to SourcForge at >> > all? >> > >> > Best wishes, >> > Dominic >> > >> > On Thu, 8 Apr 2004, Viktor Tron wrote: >> > >> >> Hello Dominic >> >> I am viktron on Sourcefourge, if you want to add me. >> >> and then I can commit changes. >> >> Or maybe you want me to add changes to the documentation as well. >> >> But then again, that makes sense only if a proper >> >> conception is crystallized concerning what we want the tokenization >> >> to do. >> >> BTW, do you know Colin Bannard? >> >> Best >> >> Viktor >> >> >> >> >> >> Quoting Dominic Widdows <dwi...@cs...>: >> >> >> >> > >> >> > Dear Viktor, >> >> > >> >> > Thanks so much for doing all of this and documenting the changes for >> >> > the >> >> > list. I agree that the my_isalpha function was long overdue an >> >> > overhaul. >> >> > It sounds like your changes are much more far reaching than just this, >> >> > though, and should enable the software to be much more >> >> > language-general. >> >> > For example, we've been hoping to enable support for Japanese and it >> >> > sounds like this will be possible now? >> >> > >> >> > It definitely makes more sense to specify what characters you want the >> >> > tokenizer to treat as alphabetic in a separate file. >> >> > >> >> > I'd definitely like to incorporate these changes to the software - >> >> > would >> >> > the best way be to add you to the project admins on SourceForge and >> >> > allow >> >> > you to commit the changes? If you sign up for an account at >> >> > https://sourceforge.net/ (or if you have one already) >> >> > we can add you as a project developer with the necessary permissions. >> >> > >> >> > Again, thanks so much for the feedback and the contributions. >> >> > Best wishes, >> >> > Dominic >> >> > >> >> > On Thu, 8 Apr 2004, Viktor Tron wrote: >> >> > >> >> > > Hello all, >> >> > > >> >> > > Your software is great, but praises should be on the user list :-). >> >> > > I subsribed to the list now, because I suggest some changes to 0.8.4 >> >> > > >> >> > > If you are interested I send you the tarball or work it out with docs >> >> > etc >> >> > > and commit in cvs. >> >> > > >> >> > > Story and summary of changes are below. >> >> > > Cheers >> >> > > Viktor >> >> > > >> >> > > It all started out yesterday. I wanted to use infomap on a >> >> > > Hungarian corpus. I soon figured out why things went wrong already >> >> > at >> >> > > the tokenization step. >> >> > > >> >> > > The problem was: >> >> > > utils.c >> >> > > lines 46--53 >> >> > > >> >> > > /* This is a somewhat radical approach, in that it assumes >> >> > > ASCII for efficiency and will *break* with other character >> >> > > encodings. */ >> >> > > int my_isalpha( int c) { // configured to let underscore through for >> >> > POS >> >> > > and tilda for indexing compounds >> >> > > return( ( c > 64 && c < 91) || ( c > 96 && c < 123) || ( c == '_') >> >> > || c >> >> > > == '~'); >> >> > > } >> >> > > >> >> > > This function is used by the tokenizer to determine which are the >> >> > non-word >> >> > > (breaking) characters. >> >> > > It views 8 bit ascii chars above 128 as non-word (breaking) >> >> > characters, >> >> > > These characters happen to constitute a crucial part of most >> >> > languages >> >> > > other than English >> >> > > usually encoded in ISO-8859-X coding with X>1. >> >> > > >> >> > > It is not that it is a 'radical approach' as someone appropriately >> >> > > described it, >> >> > > but actually makes the program entirely English-specific entirely >> >> > > unnecessarily. >> >> > > So I set out to fix it. >> >> > > >> >> > > The whole alpha test should be done directly by the tokenizer. This >> >> > > funciton actually >> >> > > says how to segment a stram of strings, which is an extremely >> >> > important >> >> > > *meaningful* part of the tokenizer, not an auxiliary function like >> >> > > my_fopen, etc. Fortunately my_isalpha is indeed only used by >> >> > > tokenizer.c. >> >> > > >> >> > > To correctly handle all this, I introduced an extra resource file >> >> > > containing >> >> > > a string of legitimate characters considered valid in words. >> >> > > All other characters will be considered as breaking characters by >> >> > the >> >> > > tokenizer >> >> > > and are skipped. >> >> > > >> >> > > The resource file is read in by initialize_tokenizer (appropriately >> >> > > together with the corpus filenames file) and used to initialize >> >> > > an array (details below). Then lookup from this array can >> >> > conveniently >> >> > > replace >> >> > > all uses of the previous my_isalpha test. >> >> > > >> >> > > This should give sufficiently flexible and charset-independent >> >> > control >> >> > > over simple text-based tokenization, which means it can be a proper >> >> > > multilingual software. >> >> > > Well, I checked and it worked for my Hungarian stuff. >> >> > > >> >> > > Surely I have further ideas of very simple extensions which would >> >> > perform >> >> > > tokenization of already tokenized (e.g. xml) files directly. >> >> > > With this in place the solution with valid_chars would just be >> >> > > one of the two major tokenization modes. >> >> > > Also: read-in doesn't seem to me to be optimized (characters of a line >> >> > are >> >> > > scanned over twice). Since with large corpora this takes up a great >> >> > deal >> >> > > of time, we might want to consider to rewrite it. >> >> > > >> >> > > >> >> > > Details of the changes: >> >> > > nothing in the documentation yet. >> >> > > >> >> > > utils.{c,h}: >> >> > > function my_isalpha no longer exists, superseded by >> >> > > more configurable method in tokenizer >> >> > > >> >> > > tokenizer.{c,h}: >> >> > > introduced an int array: valid_chars[256] to look up >> >> > > for a character c, valid_chars[c] is nonzero iff it is a valid >> >> > > word-character >> >> > > if it is 0, it is considered as breaking (and skipped) by the >> >> > tokenizer >> >> > > >> >> > > initialize_tokenizer: now also initializes valid_chars by >> >> > > reading from a file passed as an extra argument >> >> > > >> >> > > prepare_corpus.c: >> >> > > modified invocation of initialize_tokenizer accordingly >> >> > > added parsing code for extra option '-chfile' >> >> > > >> >> > > For proper invocation of prepare_corpus Makefile.data.in and >> >> > > informap-build.in >> >> > > needed to be modified and for proper configuration/installation, >> >> > some >> >> > > further changes: >> >> > > >> >> > > admin/valid_chars.en: >> >> > > new file: contains the valid chars that exactly replicate the chars >> >> > > accepted as non-breaking by the now obsolete my_isalpha (utils.c) >> >> > > I.e.: (c > 64 && c < 91) || ( c > 96 && c < 123) || ( c == '_') || c >> >> > == >> >> > > '~'); >> >> > > >> >> > > admin/default-params.in: >> >> > > line 13: added default value >> >> > > VALID_CHARS_FILE="@pkgdatadir@/valid_chars.en" >> >> > > >> >> > > admin/Makefile: >> >> > > line 216: added default valid chars file 'valid_chars.en' to >> >> > EXTRA_DIST >> >> > > list >> >> > > to be copied into central data directory >> >> > > >> >> > > admin/Makefile.data.in: >> >> > > line 119-125: quotes supplied for all arguments >> >> > > (lack of quotes caused the build procedure to stop already >> >> > at >> >> > > invoking prepare-corpus if some filenames were empty, >> >> > > rather than reaching the point where it could tell what is missing >> >> > > if at all a problem that it is missing.) >> >> > > line 125: added line for valid_chars >> >> > > >> >> > > admin/infomap-build.in: >> >> > > line 113: added line to dump value of VALID_CHARS_FILE >> >> > > >> >> > > line 44: 'cat' corrected to 'echo' (sorry I see sy spotted this >> >> > this >> >> > > morning) >> >> > > this dumps overriding command line settings (-D option) to an extra >> >> > > parameter >> >> > > file which is then sourced. >> >> > > cat expected actual setting strings (such as >> >> > "STOPLIST_FILE=my_stop_list") >> >> > > to be filenames >> >> > > >> >> > > +------------------------------------------------------------------+ >> >> > > |Viktor Tron v....@ed...| >> >> > > |3fl Rm8 2 Buccleuch Pl EH8 9LW Edinburgh Tel +44 131 650 4414| >> >> > > |European Postgraduate College www.coli.uni-sb.de/egk| >> >> > > |School of Informatics www.informatics.ed.ac.uk| >> >> > > |Theoretical and Applied Linguistics www.ling.ed.ac.uk| >> >> > > | @ University of Edinburgh, UK www.ed.ac.uk| >> >> > > |Dept of Computational Linguistics www.coli.uni-sb.de| >> >> > > | @ Saarland University (Saarbruecken, Germany) www.uni-saarland.de| >> >> > > |use LINUX and FREE Software www.linux.org| >> >> > > +------------------------------------------------------------------+ >> >> > > >> >> > > >> >> > > >> >> > > ------------------------------------------------------- >> >> > > This SF.Net email is sponsored by: IBM Linux Tutorials >> >> > > Free Linux tutorial presented by Daniel Robbins, President and CEO >> >> > of >> >> > > GenToo technologies. Learn everything from fundamentals to system >> >> > > >> >> > administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click >> >> > > _______________________________________________ >> >> > > infomap-nlp-devel mailing list >> >> > > inf...@li... >> >> > > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel >> >> > > >> >> > >> >> > >> >> > ------------------------------------------------------- >> >> > This SF.Net email is sponsored by: IBM Linux Tutorials >> >> > Free Linux tutorial presented by Daniel Robbins, President and CEO of >> >> > GenToo technologies. Learn everything from fundamentals to system >> >> > administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click >> >> > _______________________________________________ >> >> > infomap-nlp-devel mailing list >> >> > inf...@li... >> >> > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel >> >> > >> >> >> >> >> >> >> >> +------------------------------------------------------------------+ >> >> |Viktor Tron v....@ed...| >> >> |3fl Rm8. 2 Buccleuch Place Edinburgh Tel +44 131 650 4414| >> >> |European Postgraduate College www.coli.uni-sb.de/egk| >> >> |School of Informatics www.informatics.ed.ac.uk| >> >> |Theoretical and Applied Linguistics www.ling.ed.ac.uk| >> >> | @ University of Edinburgh, UK www.ed.ac.uk| >> >> |Dept of Computational Linguistics www.coli.uni-sb.de| >> >> | @ Saarland University (Saarbruecken, Germany) www.uni-saarland.de| >> >> |use LINUX and FREE Software www.linux.org| >> >> +------------------------------------------------------------------+ >> >> >> >> >> > > > ------------------------------------------------------- > This SF.Net email is sponsored by: IBM Linux Tutorials > Free Linux tutorial presented by Daniel Robbins, President and CEO of > GenToo technologies. Learn everything from fundamentals to system > administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click > _______________________________________________ > infomap-nlp-devel mailing list > inf...@li... > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel |
From: Dominic W. <dwi...@cs...> - 2004-04-20 07:55:21
|
> I changed the man pages already to explain the COL_LABELS_FROM_FILE > functionality, but let me know if it needs to be explained in more detail. Sorry Beate, I hadn't seen this before. One question, though - is there a default location for the COL_LABEL_FILE? It can be set in the default-params file and since it's a special option I guess it doesn't need a default (since the default is not to have one). Does this sound reasonable? Best wishes, Dominic |
From: Beate D. <do...@IM...> - 2004-04-20 05:22:39
|
Dear Dominic, I changed the man pages already to explain the COL_LABELS_FROM_FILE functionality, but let me know if it needs to be explained in more detail. Cheers, Beate On Mon, 19 Apr 2004, Dominic Widdows wrote: > >Dear All, > >I checked out Viktor's changes and the new valid_chars file seems to work >really well. I don't know if it will work for Japanese as well? > >Scott - did you manage to track down Beate's problem with getting a new >version called 0.8.4? I think we should definitely get the changes we've >made released. > >Beate - do you think you might be able to update the man pages to explain >the COL_LABELS_FROM_FILE functionality? > >Thanks to everyone for what you've done so far. >Best wishes, >Dominic > |
From: Dominic W. <dwi...@cs...> - 2004-04-19 23:49:29
|
Dear All, I checked out Viktor's changes and the new valid_chars file seems to work really well. I don't know if it will work for Japanese as well? Scott - did you manage to track down Beate's problem with getting a new version called 0.8.4? I think we should definitely get the changes we've made released. Beate - do you think you might be able to update the man pages to explain the COL_LABELS_FROM_FILE functionality? Thanks to everyone for what you've done so far. Best wishes, Dominic On Mon, 19 Apr 2004, Viktor Tron wrote: > Yes finally I have uploaded the changes. > > It took me a while cause I wnated to document it so I extended the manpages. > (Nothing to the tutorial, though) > > *Everything* should work like before out of the box. > > Please check this if you can with a clean temp checkout and compile, etc. > > Best > Viktor > > On Thu, 15 Apr 2004 11:40:14 -0700 (PDT), Dominic Widdows <dwi...@cs...> wrote: > > > > > Dear Viktor, > > > > Did you manage to commit your changes to the infomap code to SourcForge at > > all? > > > > Best wishes, > > Dominic > > > > On Thu, 8 Apr 2004, Viktor Tron wrote: > > > >> Hello Dominic > >> I am viktron on Sourcefourge, if you want to add me. > >> and then I can commit changes. > >> Or maybe you want me to add changes to the documentation as well. > >> But then again, that makes sense only if a proper > >> conception is crystallized concerning what we want the tokenization > >> to do. > >> BTW, do you know Colin Bannard? > >> Best > >> Viktor > >> > >> > >> Quoting Dominic Widdows <dwi...@cs...>: > >> > >> > > >> > Dear Viktor, > >> > > >> > Thanks so much for doing all of this and documenting the changes for > >> > the > >> > list. I agree that the my_isalpha function was long overdue an > >> > overhaul. > >> > It sounds like your changes are much more far reaching than just this, > >> > though, and should enable the software to be much more > >> > language-general. > >> > For example, we've been hoping to enable support for Japanese and it > >> > sounds like this will be possible now? > >> > > >> > It definitely makes more sense to specify what characters you want the > >> > tokenizer to treat as alphabetic in a separate file. > >> > > >> > I'd definitely like to incorporate these changes to the software - > >> > would > >> > the best way be to add you to the project admins on SourceForge and > >> > allow > >> > you to commit the changes? If you sign up for an account at > >> > https://sourceforge.net/ (or if you have one already) > >> > we can add you as a project developer with the necessary permissions. > >> > > >> > Again, thanks so much for the feedback and the contributions. > >> > Best wishes, > >> > Dominic > >> > > >> > On Thu, 8 Apr 2004, Viktor Tron wrote: > >> > > >> > > Hello all, > >> > > > >> > > Your software is great, but praises should be on the user list :-). > >> > > I subsribed to the list now, because I suggest some changes to 0.8.4 > >> > > > >> > > If you are interested I send you the tarball or work it out with docs > >> > etc > >> > > and commit in cvs. > >> > > > >> > > Story and summary of changes are below. > >> > > Cheers > >> > > Viktor > >> > > > >> > > It all started out yesterday. I wanted to use infomap on a > >> > > Hungarian corpus. I soon figured out why things went wrong already > >> > at > >> > > the tokenization step. > >> > > > >> > > The problem was: > >> > > utils.c > >> > > lines 46--53 > >> > > > >> > > /* This is a somewhat radical approach, in that it assumes > >> > > ASCII for efficiency and will *break* with other character > >> > > encodings. */ > >> > > int my_isalpha( int c) { // configured to let underscore through for > >> > POS > >> > > and tilda for indexing compounds > >> > > return( ( c > 64 && c < 91) || ( c > 96 && c < 123) || ( c == '_') > >> > || c > >> > > == '~'); > >> > > } > >> > > > >> > > This function is used by the tokenizer to determine which are the > >> > non-word > >> > > (breaking) characters. > >> > > It views 8 bit ascii chars above 128 as non-word (breaking) > >> > characters, > >> > > These characters happen to constitute a crucial part of most > >> > languages > >> > > other than English > >> > > usually encoded in ISO-8859-X coding with X>1. > >> > > > >> > > It is not that it is a 'radical approach' as someone appropriately > >> > > described it, > >> > > but actually makes the program entirely English-specific entirely > >> > > unnecessarily. > >> > > So I set out to fix it. > >> > > > >> > > The whole alpha test should be done directly by the tokenizer. This > >> > > funciton actually > >> > > says how to segment a stram of strings, which is an extremely > >> > important > >> > > *meaningful* part of the tokenizer, not an auxiliary function like > >> > > my_fopen, etc. Fortunately my_isalpha is indeed only used by > >> > > tokenizer.c. > >> > > > >> > > To correctly handle all this, I introduced an extra resource file > >> > > containing > >> > > a string of legitimate characters considered valid in words. > >> > > All other characters will be considered as breaking characters by > >> > the > >> > > tokenizer > >> > > and are skipped. > >> > > > >> > > The resource file is read in by initialize_tokenizer (appropriately > >> > > together with the corpus filenames file) and used to initialize > >> > > an array (details below). Then lookup from this array can > >> > conveniently > >> > > replace > >> > > all uses of the previous my_isalpha test. > >> > > > >> > > This should give sufficiently flexible and charset-independent > >> > control > >> > > over simple text-based tokenization, which means it can be a proper > >> > > multilingual software. > >> > > Well, I checked and it worked for my Hungarian stuff. > >> > > > >> > > Surely I have further ideas of very simple extensions which would > >> > perform > >> > > tokenization of already tokenized (e.g. xml) files directly. > >> > > With this in place the solution with valid_chars would just be > >> > > one of the two major tokenization modes. > >> > > Also: read-in doesn't seem to me to be optimized (characters of a line > >> > are > >> > > scanned over twice). Since with large corpora this takes up a great > >> > deal > >> > > of time, we might want to consider to rewrite it. > >> > > > >> > > > >> > > Details of the changes: > >> > > nothing in the documentation yet. > >> > > > >> > > utils.{c,h}: > >> > > function my_isalpha no longer exists, superseded by > >> > > more configurable method in tokenizer > >> > > > >> > > tokenizer.{c,h}: > >> > > introduced an int array: valid_chars[256] to look up > >> > > for a character c, valid_chars[c] is nonzero iff it is a valid > >> > > word-character > >> > > if it is 0, it is considered as breaking (and skipped) by the > >> > tokenizer > >> > > > >> > > initialize_tokenizer: now also initializes valid_chars by > >> > > reading from a file passed as an extra argument > >> > > > >> > > prepare_corpus.c: > >> > > modified invocation of initialize_tokenizer accordingly > >> > > added parsing code for extra option '-chfile' > >> > > > >> > > For proper invocation of prepare_corpus Makefile.data.in and > >> > > informap-build.in > >> > > needed to be modified and for proper configuration/installation, > >> > some > >> > > further changes: > >> > > > >> > > admin/valid_chars.en: > >> > > new file: contains the valid chars that exactly replicate the chars > >> > > accepted as non-breaking by the now obsolete my_isalpha (utils.c) > >> > > I.e.: (c > 64 && c < 91) || ( c > 96 && c < 123) || ( c == '_') || c > >> > == > >> > > '~'); > >> > > > >> > > admin/default-params.in: > >> > > line 13: added default value > >> > > VALID_CHARS_FILE="@pkgdatadir@/valid_chars.en" > >> > > > >> > > admin/Makefile: > >> > > line 216: added default valid chars file 'valid_chars.en' to > >> > EXTRA_DIST > >> > > list > >> > > to be copied into central data directory > >> > > > >> > > admin/Makefile.data.in: > >> > > line 119-125: quotes supplied for all arguments > >> > > (lack of quotes caused the build procedure to stop already > >> > at > >> > > invoking prepare-corpus if some filenames were empty, > >> > > rather than reaching the point where it could tell what is missing > >> > > if at all a problem that it is missing.) > >> > > line 125: added line for valid_chars > >> > > > >> > > admin/infomap-build.in: > >> > > line 113: added line to dump value of VALID_CHARS_FILE > >> > > > >> > > line 44: 'cat' corrected to 'echo' (sorry I see sy spotted this > >> > this > >> > > morning) > >> > > this dumps overriding command line settings (-D option) to an extra > >> > > parameter > >> > > file which is then sourced. > >> > > cat expected actual setting strings (such as > >> > "STOPLIST_FILE=my_stop_list") > >> > > to be filenames > >> > > > >> > > +------------------------------------------------------------------+ > >> > > |Viktor Tron v....@ed...| > >> > > |3fl Rm8 2 Buccleuch Pl EH8 9LW Edinburgh Tel +44 131 650 4414| > >> > > |European Postgraduate College www.coli.uni-sb.de/egk| > >> > > |School of Informatics www.informatics.ed.ac.uk| > >> > > |Theoretical and Applied Linguistics www.ling.ed.ac.uk| > >> > > | @ University of Edinburgh, UK www.ed.ac.uk| > >> > > |Dept of Computational Linguistics www.coli.uni-sb.de| > >> > > | @ Saarland University (Saarbruecken, Germany) www.uni-saarland.de| > >> > > |use LINUX and FREE Software www.linux.org| > >> > > +------------------------------------------------------------------+ > >> > > > >> > > > >> > > > >> > > ------------------------------------------------------- > >> > > This SF.Net email is sponsored by: IBM Linux Tutorials > >> > > Free Linux tutorial presented by Daniel Robbins, President and CEO > >> > of > >> > > GenToo technologies. Learn everything from fundamentals to system > >> > > > >> > administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click > >> > > _______________________________________________ > >> > > infomap-nlp-devel mailing list > >> > > inf...@li... > >> > > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel > >> > > > >> > > >> > > >> > ------------------------------------------------------- > >> > This SF.Net email is sponsored by: IBM Linux Tutorials > >> > Free Linux tutorial presented by Daniel Robbins, President and CEO of > >> > GenToo technologies. Learn everything from fundamentals to system > >> > administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click > >> > _______________________________________________ > >> > infomap-nlp-devel mailing list > >> > inf...@li... > >> > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel > >> > > >> > >> > >> > >> +------------------------------------------------------------------+ > >> |Viktor Tron v....@ed...| > >> |3fl Rm8. 2 Buccleuch Place Edinburgh Tel +44 131 650 4414| > >> |European Postgraduate College www.coli.uni-sb.de/egk| > >> |School of Informatics www.informatics.ed.ac.uk| > >> |Theoretical and Applied Linguistics www.ling.ed.ac.uk| > >> | @ University of Edinburgh, UK www.ed.ac.uk| > >> |Dept of Computational Linguistics www.coli.uni-sb.de| > >> | @ Saarland University (Saarbruecken, Germany) www.uni-saarland.de| > >> |use LINUX and FREE Software www.linux.org| > >> +------------------------------------------------------------------+ > >> > > > |
From: Dominic W. <dwi...@cs...> - 2004-04-19 17:19:34
|
Thanks so much, Victor. I'll check out your changes this afternoon and try my luck :) On Mon, 19 Apr 2004, Viktor Tron wrote: > Yes finally I have uploaded the changes. > > It took me a while cause I wnated to document it so I extended the manpages. > (Nothing to the tutorial, though) > > *Everything* should work like before out of the box. > > Please check this if you can with a clean temp checkout and compile, etc. > > Best > Viktor > > On Thu, 15 Apr 2004 11:40:14 -0700 (PDT), Dominic Widdows <dwi...@cs...> wrote: > > > > > Dear Viktor, > > > > Did you manage to commit your changes to the infomap code to SourcForge at > > all? > > > > Best wishes, > > Dominic > > > > On Thu, 8 Apr 2004, Viktor Tron wrote: > > > >> Hello Dominic > >> I am viktron on Sourcefourge, if you want to add me. > >> and then I can commit changes. > >> Or maybe you want me to add changes to the documentation as well. > >> But then again, that makes sense only if a proper > >> conception is crystallized concerning what we want the tokenization > >> to do. > >> BTW, do you know Colin Bannard? > >> Best > >> Viktor > >> > >> > >> Quoting Dominic Widdows <dwi...@cs...>: > >> > >> > > >> > Dear Viktor, > >> > > >> > Thanks so much for doing all of this and documenting the changes for > >> > the > >> > list. I agree that the my_isalpha function was long overdue an > >> > overhaul. > >> > It sounds like your changes are much more far reaching than just this, > >> > though, and should enable the software to be much more > >> > language-general. > >> > For example, we've been hoping to enable support for Japanese and it > >> > sounds like this will be possible now? > >> > > >> > It definitely makes more sense to specify what characters you want the > >> > tokenizer to treat as alphabetic in a separate file. > >> > > >> > I'd definitely like to incorporate these changes to the software - > >> > would > >> > the best way be to add you to the project admins on SourceForge and > >> > allow > >> > you to commit the changes? If you sign up for an account at > >> > https://sourceforge.net/ (or if you have one already) > >> > we can add you as a project developer with the necessary permissions. > >> > > >> > Again, thanks so much for the feedback and the contributions. > >> > Best wishes, > >> > Dominic > >> > > >> > On Thu, 8 Apr 2004, Viktor Tron wrote: > >> > > >> > > Hello all, > >> > > > >> > > Your software is great, but praises should be on the user list :-). > >> > > I subsribed to the list now, because I suggest some changes to 0.8.4 > >> > > > >> > > If you are interested I send you the tarball or work it out with docs > >> > etc > >> > > and commit in cvs. > >> > > > >> > > Story and summary of changes are below. > >> > > Cheers > >> > > Viktor > >> > > > >> > > It all started out yesterday. I wanted to use infomap on a > >> > > Hungarian corpus. I soon figured out why things went wrong already > >> > at > >> > > the tokenization step. > >> > > > >> > > The problem was: > >> > > utils.c > >> > > lines 46--53 > >> > > > >> > > /* This is a somewhat radical approach, in that it assumes > >> > > ASCII for efficiency and will *break* with other character > >> > > encodings. */ > >> > > int my_isalpha( int c) { // configured to let underscore through for > >> > POS > >> > > and tilda for indexing compounds > >> > > return( ( c > 64 && c < 91) || ( c > 96 && c < 123) || ( c == '_') > >> > || c > >> > > == '~'); > >> > > } > >> > > > >> > > This function is used by the tokenizer to determine which are the > >> > non-word > >> > > (breaking) characters. > >> > > It views 8 bit ascii chars above 128 as non-word (breaking) > >> > characters, > >> > > These characters happen to constitute a crucial part of most > >> > languages > >> > > other than English > >> > > usually encoded in ISO-8859-X coding with X>1. > >> > > > >> > > It is not that it is a 'radical approach' as someone appropriately > >> > > described it, > >> > > but actually makes the program entirely English-specific entirely > >> > > unnecessarily. > >> > > So I set out to fix it. > >> > > > >> > > The whole alpha test should be done directly by the tokenizer. This > >> > > funciton actually > >> > > says how to segment a stram of strings, which is an extremely > >> > important > >> > > *meaningful* part of the tokenizer, not an auxiliary function like > >> > > my_fopen, etc. Fortunately my_isalpha is indeed only used by > >> > > tokenizer.c. > >> > > > >> > > To correctly handle all this, I introduced an extra resource file > >> > > containing > >> > > a string of legitimate characters considered valid in words. > >> > > All other characters will be considered as breaking characters by > >> > the > >> > > tokenizer > >> > > and are skipped. > >> > > > >> > > The resource file is read in by initialize_tokenizer (appropriately > >> > > together with the corpus filenames file) and used to initialize > >> > > an array (details below). Then lookup from this array can > >> > conveniently > >> > > replace > >> > > all uses of the previous my_isalpha test. > >> > > > >> > > This should give sufficiently flexible and charset-independent > >> > control > >> > > over simple text-based tokenization, which means it can be a proper > >> > > multilingual software. > >> > > Well, I checked and it worked for my Hungarian stuff. > >> > > > >> > > Surely I have further ideas of very simple extensions which would > >> > perform > >> > > tokenization of already tokenized (e.g. xml) files directly. > >> > > With this in place the solution with valid_chars would just be > >> > > one of the two major tokenization modes. > >> > > Also: read-in doesn't seem to me to be optimized (characters of a line > >> > are > >> > > scanned over twice). Since with large corpora this takes up a great > >> > deal > >> > > of time, we might want to consider to rewrite it. > >> > > > >> > > > >> > > Details of the changes: > >> > > nothing in the documentation yet. > >> > > > >> > > utils.{c,h}: > >> > > function my_isalpha no longer exists, superseded by > >> > > more configurable method in tokenizer > >> > > > >> > > tokenizer.{c,h}: > >> > > introduced an int array: valid_chars[256] to look up > >> > > for a character c, valid_chars[c] is nonzero iff it is a valid > >> > > word-character > >> > > if it is 0, it is considered as breaking (and skipped) by the > >> > tokenizer > >> > > > >> > > initialize_tokenizer: now also initializes valid_chars by > >> > > reading from a file passed as an extra argument > >> > > > >> > > prepare_corpus.c: > >> > > modified invocation of initialize_tokenizer accordingly > >> > > added parsing code for extra option '-chfile' > >> > > > >> > > For proper invocation of prepare_corpus Makefile.data.in and > >> > > informap-build.in > >> > > needed to be modified and for proper configuration/installation, > >> > some > >> > > further changes: > >> > > > >> > > admin/valid_chars.en: > >> > > new file: contains the valid chars that exactly replicate the chars > >> > > accepted as non-breaking by the now obsolete my_isalpha (utils.c) > >> > > I.e.: (c > 64 && c < 91) || ( c > 96 && c < 123) || ( c == '_') || c > >> > == > >> > > '~'); > >> > > > >> > > admin/default-params.in: > >> > > line 13: added default value > >> > > VALID_CHARS_FILE="@pkgdatadir@/valid_chars.en" > >> > > > >> > > admin/Makefile: > >> > > line 216: added default valid chars file 'valid_chars.en' to > >> > EXTRA_DIST > >> > > list > >> > > to be copied into central data directory > >> > > > >> > > admin/Makefile.data.in: > >> > > line 119-125: quotes supplied for all arguments > >> > > (lack of quotes caused the build procedure to stop already > >> > at > >> > > invoking prepare-corpus if some filenames were empty, > >> > > rather than reaching the point where it could tell what is missing > >> > > if at all a problem that it is missing.) > >> > > line 125: added line for valid_chars > >> > > > >> > > admin/infomap-build.in: > >> > > line 113: added line to dump value of VALID_CHARS_FILE > >> > > > >> > > line 44: 'cat' corrected to 'echo' (sorry I see sy spotted this > >> > this > >> > > morning) > >> > > this dumps overriding command line settings (-D option) to an extra > >> > > parameter > >> > > file which is then sourced. > >> > > cat expected actual setting strings (such as > >> > "STOPLIST_FILE=my_stop_list") > >> > > to be filenames > >> > > > >> > > +------------------------------------------------------------------+ > >> > > |Viktor Tron v....@ed...| > >> > > |3fl Rm8 2 Buccleuch Pl EH8 9LW Edinburgh Tel +44 131 650 4414| > >> > > |European Postgraduate College www.coli.uni-sb.de/egk| > >> > > |School of Informatics www.informatics.ed.ac.uk| > >> > > |Theoretical and Applied Linguistics www.ling.ed.ac.uk| > >> > > | @ University of Edinburgh, UK www.ed.ac.uk| > >> > > |Dept of Computational Linguistics www.coli.uni-sb.de| > >> > > | @ Saarland University (Saarbruecken, Germany) www.uni-saarland.de| > >> > > |use LINUX and FREE Software www.linux.org| > >> > > +------------------------------------------------------------------+ > >> > > > >> > > > >> > > > >> > > ------------------------------------------------------- > >> > > This SF.Net email is sponsored by: IBM Linux Tutorials > >> > > Free Linux tutorial presented by Daniel Robbins, President and CEO > >> > of > >> > > GenToo technologies. Learn everything from fundamentals to system > >> > > > >> > administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click > >> > > _______________________________________________ > >> > > infomap-nlp-devel mailing list > >> > > inf...@li... > >> > > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel > >> > > > >> > > >> > > >> > ------------------------------------------------------- > >> > This SF.Net email is sponsored by: IBM Linux Tutorials > >> > Free Linux tutorial presented by Daniel Robbins, President and CEO of > >> > GenToo technologies. Learn everything from fundamentals to system > >> > administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click > >> > _______________________________________________ > >> > infomap-nlp-devel mailing list > >> > inf...@li... > >> > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel > >> > > >> > >> > >> > >> +------------------------------------------------------------------+ > >> |Viktor Tron v....@ed...| > >> |3fl Rm8. 2 Buccleuch Place Edinburgh Tel +44 131 650 4414| > >> |European Postgraduate College www.coli.uni-sb.de/egk| > >> |School of Informatics www.informatics.ed.ac.uk| > >> |Theoretical and Applied Linguistics www.ling.ed.ac.uk| > >> | @ University of Edinburgh, UK www.ed.ac.uk| > >> |Dept of Computational Linguistics www.coli.uni-sb.de| > >> | @ Saarland University (Saarbruecken, Germany) www.uni-saarland.de| > >> |use LINUX and FREE Software www.linux.org| > >> +------------------------------------------------------------------+ > >> > > > |