You can subscribe to this list here.
2004 |
Jan
(1) |
Feb
|
Mar
(8) |
Apr
(6) |
May
(6) |
Jun
(1) |
Jul
|
Aug
(2) |
Sep
|
Oct
(7) |
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2005 |
Jan
(9) |
Feb
(13) |
Mar
(2) |
Apr
(2) |
May
(14) |
Jun
(9) |
Jul
|
Aug
(1) |
Sep
|
Oct
(15) |
Nov
(1) |
Dec
(1) |
2006 |
Jan
(2) |
Feb
(9) |
Mar
|
Apr
(1) |
May
(3) |
Jun
(3) |
Jul
(8) |
Aug
(8) |
Sep
(3) |
Oct
(6) |
Nov
(4) |
Dec
|
2007 |
Jan
(6) |
Feb
(5) |
Mar
(5) |
Apr
(15) |
May
(4) |
Jun
|
Jul
(4) |
Aug
(20) |
Sep
(14) |
Oct
(7) |
Nov
(11) |
Dec
(5) |
2008 |
Jan
(4) |
Feb
(5) |
Mar
(34) |
Apr
(35) |
May
(10) |
Jun
(14) |
Jul
(35) |
Aug
(15) |
Sep
(17) |
Oct
(21) |
Nov
(43) |
Dec
(40) |
2009 |
Jan
(37) |
Feb
(22) |
Mar
(45) |
Apr
(48) |
May
(123) |
Jun
(103) |
Jul
(71) |
Aug
(25) |
Sep
(19) |
Oct
(42) |
Nov
(12) |
Dec
(22) |
2010 |
Jan
(12) |
Feb
(18) |
Mar
(39) |
Apr
(59) |
May
(67) |
Jun
(65) |
Jul
(37) |
Aug
(39) |
Sep
(20) |
Oct
(3) |
Nov
|
Dec
(1) |
2011 |
Jan
(3) |
Feb
(6) |
Mar
(7) |
Apr
(1) |
May
(1) |
Jun
(4) |
Jul
(7) |
Aug
(4) |
Sep
(4) |
Oct
|
Nov
(4) |
Dec
(1) |
2012 |
Jan
(4) |
Feb
(5) |
Mar
(3) |
Apr
(4) |
May
(2) |
Jun
(1) |
Jul
(1) |
Aug
(5) |
Sep
(2) |
Oct
(4) |
Nov
(4) |
Dec
(1) |
2013 |
Jan
(7) |
Feb
(4) |
Mar
(2) |
Apr
(2) |
May
(3) |
Jun
(5) |
Jul
(9) |
Aug
(6) |
Sep
(6) |
Oct
(5) |
Nov
(7) |
Dec
(3) |
2014 |
Jan
(7) |
Feb
(4) |
Mar
(8) |
Apr
(7) |
May
(7) |
Jun
(2) |
Jul
(6) |
Aug
(4) |
Sep
(7) |
Oct
(9) |
Nov
(3) |
Dec
(1) |
2015 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
(1) |
2016 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2017 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Allan F. <fz...@te...> - 2006-06-06 02:54:21
|
Trade Date: Tuesday, June 6th, 2006 Company: BioElectronics Corporation Symbol: BIEL Price: $0.025 IS MOMENTUM BUILDING FOR THIS STOCK? CAN YOU MAKE SOME FAST MONEY ON IT? RADAR BIEL FOR TUESDAY'S OPEN RIGHT NOW!! THE ALERT IS ON!!! RECENT NEWS HEADLINE: (GO READ ALL THE NEWS ON BIEL RIGHT NOW!) BioElectronics Corporation Announces New 510(k) Market Clearance Application Filed With FDA!! About BioElectronics Corporation (Source: News 5/18/2006) BioElectronics currently manufactures and sells ActiPatch(TM), a drug-free anti-inflammatory patch with an embedded battery operated microchip that delivers weeks of continuous pulsed therapy for less than a dollar a day. The unique ActiPatch delivery system, using patented technology, provides a cost-effective, patient friendly method to reduce soft tissue pain and swelling. GO READ ALL THE NEWS ON THIS ONE!! DO YOUR DUE DILIGENCE!! RADAR IT FOR TUESDAY'S OPEN NOW! ______________ Information within this report contains forward looking statements within the meaning of Section 27A of the Securities Act of 1933 and Section 21B of the SEC Act of 1934. Statements that involve discussions with respect to projections of future events are not statements of historical fact and may be forward looking statements. Don't rely on them to make a decision. Past performance is never indicative of future results. We received four hundred thousand free trading shares in the past for our services. All those shares have been sold. We have received an additional one million free trading shares now. We intend to sell all one million shares now, which could cause the stock to go down, resulting in losses for you. The four hundred thousand shares and one million shares were received from two different third parties, not officers, directors or affiliate shareholders. This company has: an accumulated deficit, a negative net worth, a reliance on loans from officers directors and affiliates to pay expenses, and a nominal cash position. These factors raise substantial doubt about its ability to continue as a going concern. The company and its president are a defendant in a lawsuit. The publicly available float of stock is currently increasing. URGENT: Read the company's SEC filing before you invest. This report shall not be construed as any kind of investment advice or solicitation. WARNING: You can lose all your money by investing in this stock. |
From: Neal S. <sn...@st...> - 2006-05-19 19:46:28
|
infomap gurus: I am trying to build an LSA model with the Arabic Gigaword corpus, which is actually about 4GB of text. However, I keep getting a "wordlist File size limit exceeded" error about half way through. I tried increasing the ROWS and COLUMNS parameters: infomap-build -D ROWS=1000000 -D COLUMNS=1000000 -m ./list.txt aragiga_full2 but that doesn't help. Do you have any suggestions? Thanks! Neal PS: Here's the exact output: make: *** [/user/snider/scr/aragiga_full2/wordlist] File size limit exceeded make: *** Deleting file `/user/snider/scr/aragiga_full2/wordlist' ------ Neal Snider Ph.D. Student Department of Linguistics Stanford University Margaret Jacks Hall, Bldg 460 - Room 118 Stanford CA 94305-2150 (650) 723-4284; Fax: (650) 723-5666 http://www.stanford.edu/~snider |
From: Steve P. <xmb...@ca...> - 2006-05-12 09:40:23
|
Watch this Company like a hawk tomorrow, May 12. De Greko Inc. DGKO Major News for De Greko Inc, DGKO has been released De Greko Inc. Announces Ongoing Corporate Structure; Steps Taken to Migrate to Larger Exchange(Go read the entire news release now) Clixme Draws Tremendous Interest on Launch <--Exciting News release, make sure to read it asap. GLASTONBURY, CT--(MARKET WIRE)--May 2, 2006 -- De Greko Communication, a wholly owned subsidiary of De Greko Inc. (Other OTC:DGKO.PK) a holding company that specializes in consolidating revenue-generating companies, today announced that the official launch of its Clixme "click to call" service to US businesses was an overwhelming success. "Mr. Georgiadis also added that the company had received inquiries from companies in both Europe and Asia inquiring to the availability of the service in both regions. We were pleased to see that large companies were signing up for the service as well. When GE Healthcare, a division of General Electric, signed up to use the service today we were sure that adoption of Clixme in the Enterprise market was assured." Do your research now and watch this one like a hawk tomorrow, May 12. Make sure to read the current news releases. Information within this report contains forward looking statements within the meaning of Section 27A of the Securities Act of 1933 and Section 21B of the SEC Act of 1934. Statements that involve discussions with respect to projections of future events are not statements of historical fact and may be forward looking statements. Don't rely on them to make a decision. The Company is not a reporting company registered under the Exchange Act of 1934. We have received two million free trading shares from a third party not an officer, director or affiliate shareholder. We intend to sell all our shares now, which could cause the stock to go down, resulting in losses for you. This company has revenues in its most recent quarter with the float currently increasing. Read the Company's Annual Report if one is available and Information Statement before you invest. This report shall not be construed as any kind of investment advice or solicitation. You can lose all your money by investing in this stock. |
From: Beate D. <do...@im...> - 2006-02-21 08:19:14
|
Hi Neal, There is a file called valid_chars.en (it's in the admin directory) which contains the characters to be kept. You can adjust this file to your needs by adding the ~, $, |, 1, etc. This will leave your words "unharmed". In addition, you might want to replace the English stoplist with an Arabic one (to ignore determiners, pronouns, etc.). Best wishes, Beate On Mon, 20 Feb 2006, Neal Snider wrote: > Does anyone know how do prevent the text processing that infomap does on its > corpora? I'm using (trying anyway) infomap to work on a project to try to > induce Arabic verb clusters. My data are already lemmatized and they use the > Buckwalter transliteration system, so they look rather funny: > > HalAwaY_1 jan~ap_1 muriyd_1 > taHoDiyr_1 |l_2 > HalAwap_2 jaraH-a_1 muro$id_1 > taHoDiyriy~_1 |laY_1 > HalAyib_2 jaraY-i_1 muroDiy_1 > taHoSiyl_1 |lam_1 > > but the dic file after infomap processing shows that it takes out a lot of the > important Arabic characters: > > 16474 3508 0 adaf_ > 16403 3434 0 amokan_ > 15606 3404 0 ar_ > 14308 3263 0 ieotabar_ > 13134 3180 0 ra > 12666 2933 0 daea > 12290 3055 0 ay > 11965 2849 0 wasal > 11558 2824 0 hasal > 11173 2772 0 qad~am_ > 11148 2997 0 nolemma > > How can I keep it from doing this? > > Thanks! > > ------ > Neal Snider > Ph.D. Student > Department of Linguistics > Stanford University > Margaret Jacks Hall, Bldg 460 - Room 118 > Stanford CA 94305-2150 > (650) 723-4284; Fax: (650) 723-5666 > http://www.stanford.edu/~snider > > > > > ------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. Do you grep through log files > for problems? Stop! Download the new AJAX search engine that makes > searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642 > _______________________________________________ > infomap-nlp-users mailing list > inf...@li... > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-users > |
From: Neal S. <sn...@st...> - 2006-02-21 06:23:26
|
Does anyone know how do prevent the text processing that infomap does on its corpora? I'm using (trying anyway) infomap to work on a project to try to induce Arabic verb clusters. My data are already lemmatized and they use the Buckwalter transliteration system, so they look rather funny: HalAwaY_1 jan~ap_1 muriyd_1 taHoDiyr_1 |l_2 HalAwap_2 jaraH-a_1 muro $id_1 taHoDiyriy~_1 |laY_1 HalAyib_2 jaraY-i_1 muroDiy_1 taHoSiyl_1 |lam_1 but the dic file after infomap processing shows that it takes out a lot of the important Arabic characters: 16474 3508 0 adaf_ 16403 3434 0 amokan_ 15606 3404 0 ar_ 14308 3263 0 ieotabar_ 13134 3180 0 ra 12666 2933 0 daea 12290 3055 0 ay 11965 2849 0 wasal 11558 2824 0 hasal 11173 2772 0 qad~am_ 11148 2997 0 nolemma How can I keep it from doing this? Thanks! ------ Neal Snider Ph.D. Student Department of Linguistics Stanford University Margaret Jacks Hall, Bldg 460 - Room 118 Stanford CA 94305-2150 (650) 723-4284; Fax: (650) 723-5666 http://www.stanford.edu/~snider |
From: Dominic W. <wi...@ma...> - 2006-02-16 19:31:31
|
Dear Joerg, Apologies for not replying to your earlier message, glad you got things figures out. > Now I managed to create a model for my 76 million word text collection > (using a single fiel input) using standard settings of infomap. > However, > my retrieval results are quite bad for my test queries. Much worse than > with standard IR engines such as Apache Lucene. I'm not surprised at all that you get worse results with Infomap than with a standard engines like Lucene. There's very little evidence in the literature of LSA actually improving straight text retrieval, and this is the main reason why the most interesting work in LSA-type systems in the past 10 years has been in applications like WSD, classification, lexical acquisition, etc. I wouldn't really advise anyone to think of Infomap as a top class search engine, I'd think of it as a lexical modelling tool that does some document retrieval. Thus saying, if you were to pursue this, the place I would start would be with term-weighting. There's a term_weight function in count_artvec.c that you could call from process_region, and I don't believe this is done by default. Experimenting with term weights (both in document vectors and query vectors?) would be one of the most sensible places to start. I would still be surprised if you did better than Lucene for most queries using Infomap for document retrieval. However, you might get improvements using Infomap for query expansion on sparse queries. In other words, if you get low recall for a query, try putting the query terms through an Infomap "associate words" query to get neighboring terms, and add neighboring terms to the query if their cosine similarity is greater than (e.g.) 0.65. Just a suggestion. In general, I think you'd expect a kind if similarity induction / smoothing engine like Infomap to help with improving recall rather than precision. > I wonder if I should play around with the parameters of Infomap a bit > more. I started to do some little experiments but without any huge > improvements. > > I have 76 million tokens and i just counted 740,000 different word > types > (with alphabetic characters only). The standard settings for infomap, > however, are only 20000 rows and 1000 columns. For the first I'm not > really sure what the difference is between "content bearing words to be > used as features" and "words for which to learn word vectors". I'm not > really sure what that means The "content bearing words" are columns, the "words from which to learn vectors" are rows. After processing the corpus, each entry in the matrix records the number of times each row-word occurred near each column word. (After that, the SVD happens.) > (sorry for asking newbie questions ...) Don't be. I'm sorry for not having time to reply often! > I'd like to know if somebody has experience with settings for large > text > collections. Is there any rule-of-thumb how to adjust the parameters to > get good results? I don't know exactly where to start ... should I > increase the numbers of rows and columns (and how much) or should I > play > with the dimensions of LSA (SINGVALS)? My initial suggestion would be to increase the number of rows, if you're going to try increasing anything. Then you have a bigger vocabulary, but described in terms of the same "content bearing words". > What exaclty does the pre- and > post-context? Pre-context is how many words before a row-word a column-word must appear to be counted, post-context is how many words after. I think this is the case, I haven't used the options myself. Best wishes, Dominic |
From: Joerg T. <tie...@le...> - 2006-02-16 18:18:13
|
Dear Infomap users (and developers), Now I managed to create a model for my 76 million word text collection=20 (using a single fiel input) using standard settings of infomap. However,=20 my retrieval results are quite bad for my test queries. Much worse than=20 with standard IR engines such as Apache Lucene. I wonder if I should play around with the parameters of Infomap a bit=20 more. I started to do some little experiments but without any huge=20 improvements. I have 76 million tokens and i just counted 740,000 different word types=20 (with alphabetic characters only). The standard settings for infomap,=20 however, are only 20000 rows and 1000 columns. For the first I'm not=20 really sure what the difference is between "content bearing words to be=20 used as features" and "words for which to learn word vectors". I'm not=20 really sure what that means (sorry for asking newbie questions ...) I'd like to know if somebody has experience with settings for large text=20 collections. Is there any rule-of-thumb how to adjust the parameters to=20 get good results? I don't know exactly where to start ... should I=20 increase the numbers of rows and columns (and how much) or should I play=20 with the dimensions of LSA (SINGVALS)? What exaclty does the pre- and=20 post-context? Thanks in advance for any advise! best regards, J=F6rg ***********/\/\/\/\/\/\/\/\/\/\/\************************************ ** J=F6rg Tiedemann tie...@le... ** ** Alfa-Informatica http://www.let.rug.nl/~tiedeman ** =20 ** Rijksuniversiteit Groningen Harmoniegebouw, room 1311-429 ** ** Oude Kijk in 't Jatstraat 26 phone: +31 (0)50-363 5935 ** ** 9712 EK Groningen fax: +31 (0)50-363 6855 ** *************************************/\/\/\/\/\/\/\/\/\/\/\********** |
From: Joerg T. <tie...@le...> - 2006-02-14 14:33:54
|
hi infomap users, I have the same problem as montse=20 (http://sourceforge.net/mailarchive/forum.php?thread_id=3D8855577&forum_id= =3D37265) - I just described to this list and don't know how to reply to a previous= =20 message - I have a corpus with >1,000,000 paragraphs that I'd like to index. but I=20 get the same error message about memory allocation. I could put everything= =20 in a single file (actually I would prefer that) but I'd like to keep the=20 file names when querying the index. as far as I understand I only get the= =20 byte offset o the document when building a model on a single file and then= =20 I have to find the document name myself. or is there any smart way of=20 doing that.=20 I'd like to add the filenames in the single corpus document in,=20 let's say, <DOCID>name</DOCID> tags before the actual text of the=20 document. would be great if the index builder could take these names and=20 use them when replying to a query instead of giving me the internal=20 doc-ID's with the offset. is that maybe easy to implement (I didn't check the source code yet). thanks in advance for your reply! best, J=F6rg ***********/\/\/\/\/\/\/\/\/\/\/\************************************ ** J=F6rg Tiedemann tie...@le... ** ** Alfa-Informatica http://www.let.rug.nl/~tiedeman ** =20 ** Rijksuniversiteit Groningen Harmoniegebouw, room 1311-429 ** ** Oude Kijk in 't Jatstraat 26 phone: +31 (0)50-363 5935 ** ** 9712 EK Groningen fax: +31 (0)50-363 6855 ** *************************************/\/\/\/\/\/\/\/\/\/\/\********** |
From: Joerg T. <tie...@le...> - 2006-02-14 14:09:59
|
hi infomap users, I have the same problem as montse=20 (http://sourceforge.net/mailarchive/forum.php?thread_id=3D8855577&forum_id= =3D37265) - I just described to this list and don't know how to reply to a previous= =20 message - I have a corpus with >1,000,000 paragraphs that I'd like to index. but I=20 get the same error message about memory allocation. I could put everything= =20 in a single file (actually I would prefer that) but I'd like to keep the=20 file names when querying the index. as far as I understand I only get the= =20 byte offset o the document when building a model on a single file and then= =20 I have to find the document name myself. or is there any smart way of=20 doing that.=20 I'd like to add the filenames in the single corpus document in,=20 let's say, <DOCID>name</DOCID> tags before the actual text of the=20 document. would be great if the index builder could take these names and=20 use them when replying to a query instead of giving me the internal=20 doc-ID's with the offset. is that maybe easy to implement (I didn't check the source code yet). thanks in advance for your reply! best, J=F6rg ***********/\/\/\/\/\/\/\/\/\/\/\************************************ ** J=F6rg Tiedemann tie...@le... ** ** Alfa-Informatica http://www.let.rug.nl/~tiedeman ** =20 ** Rijksuniversiteit Groningen Harmoniegebouw, room 1311-429 ** ** Oude Kijk in 't Jatstraat 26 phone: +31 (0)50-363 5935 ** ** 9712 EK Groningen fax: +31 (0)50-363 6855 ** *************************************/\/\/\/\/\/\/\/\/\/\/\********** |
From: David H. <dl...@st...> - 2006-02-10 03:48:53
|
Hi, I'm trying to start running infomap on a 64 bit system, and it runs fine up until the svdinterface comes up, where it starts to load in "coll" and then promptly crashes in las2.c at line 336: 331 errormessage("i>=*nnzero","",NON); 332 fchuck = BADFLOAT; 333 sscanf(zeile,"%d %f",&chuck,&fchuck); 334 if (fchuck==BADFLOAT) 335 errormessage("fchuck==BADFLOAT","",NON); 336 rowind[i] = chuck; 337 value[i] = fchuck; 338 } 339 fclose(collfile); 340 if (i!=*nnzero) rowind seems to be way out of bounds. Is this a known issue? Should I dig around to get more information? Thanks, David Hall P.S. here's uname -a : Linux XXX.XXX.XXX 2.6.14-1.1644_FC4smp #1 SMP Sun Nov 27 03:37:58 EST 2005 x86_64 x86_64 x86_64 GNU/Linux |
From: Dominic W. <wi...@ma...> - 2006-02-08 18:01:59
|
Hi Sverker, Sorry not to reply for a while, I'm horribly busy. (again. long story) > Dear "infomap-nlp-users" mailing list, > > I have some questions: > > 1. How do I know which documents have which ID? I believe there's a print_doc function that will take a document ID and return the document. > 2. I need the coordinate vectors for *all* documents and *all* words. > Is there a simple way to get these data? I believe Scott implemented something that did this over the summer. I don't think it's in the released version, but should be under CVS on Sourceforge. Scott may be able to help more. > 3. Which is the best way to exclude words with low frequency from the > model (before the diagonalization)? I think you set the ROWS variable in /admin/default-params > I'm trying to use infomap to visualize relations between documents, in > the style of Multiple Correspondence Analysis. Very cool. Let us know if you get interesting results. Again, apologies for not giving more detailed answers. I hope this helps somewhat. Best wishes, Dominic |
From: Sverker L. <sv...@de...> - 2006-02-03 22:29:15
|
Dear "infomap-nlp-users" mailing list, I have some questions: 1. How do I know which documents have which ID? 2. I need the coordinate vectors for *all* documents and *all* words. Is there a simple way to get these data? 3. Which is the best way to exclude words with low frequency from the model (before the diagonalization)? I'm trying to use infomap to visualize relations between documents, in the style of Multiple Correspondence Analysis. Yours sincerely, Sverker Lundin |
From: Beate D. <do...@im...> - 2006-01-31 14:56:19
|
Hi there, To retrieve the documents most similar to a given document, you can simply use "associate -d -i d -m <model_path> -c <corpus> <doc_id>". If you are interested in the pairwise similarities of a set of documents, for now, you can use "associate -q -i d" to retrieve the vector representations of the documents you want to compare. You can then compute the pairwise scalar products of the resulting document vectors to obtain doc-doc similarities. I am attaching a short perl script which computes the pairwise document similarities given a list of document ids (these are the numbers enclosed in the <f> and </f> tags in the wordlist file). I don't know how well the program performs if the number of documents to be compared is large. Good luck! Beate On Tue, 24 Jan 2006, P. Kumsaikaew wrote: > Dear all > > Now, I am try to use your program to identify document similarity. I have done > all installation process. Also, I am able to find the similarity for word like > associate -d -c testSF1 suit (get docid of suit word) > My data file contains three documents. Is any way i can put the entire > document to compare instead of word. For example, i want to get the similarity > value between Doc1, Doc2 and Doc 3. > > Thank you > > > > ------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. Do you grep through log files > for problems? Stop! Download the new AJAX search engine that makes > searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642 > _______________________________________________ > infomap-nlp-users mailing list > inf...@li... > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-users > |
From: P. K. <piy...@ia...> - 2006-01-25 05:26:48
|
Dear all Now, I am try to use your program to identify document similarity. I have done all installation process. Also, I am able to find the similarity for word like associate -d -c testSF1 suit (get docid of suit word) My data file contains three documents. Is any way i can put the entire document to compare instead of word. For example, i want to get the similarity value between Doc1, Doc2 and Doc 3. Thank you |
From: Rahul J. <rj...@ya...> - 2005-12-02 20:53:33
|
Hi All, I am building models with large corpora (mostly >60000 words). I have the following questions (please bear with me for the long questions): (1) The "content bearing words (CBW) - words" matrix formed (i.e., input to SVD) has dimensions: number of words (where maximum = ROWS in default-params.in) X number of CBWs (specified by COLUMNS). The dictionary identifies words that are stop words. However, when the matrix is formed, the printf statements show that the number of rows is equal to min(ROWS, the total number of words). In the case, ROWS > total number words, the matrix has rows equal to total number of words. That is, it seems that the stop words are considered (?) Illustration: ROWS=20000 COLUMNS=1000 Dic entries = 10608, Non-stop word types = 9951 "Entering write_matrix_svd; rows = 10608 and columns = 1000." (2) Number of singular values is controlled by SINGVALS and SVD_ITER. As the SVD_ITER is increased, we obtain more singular values (limited by SINGVALS and actual max number of computed SVD singular values possible). Is there a good value of computed SINGVALS (hashcomp) that we should aim (using more iterations) for a given number of words (rows) in the input matrix? In other words, if time is NOT a constraint, increasing singular values could increase the dimensions (rows) of the resultant matrices but it could also increase accuracy. Example: SINGVALS = 200, SVD_ITER = 400 for ROWS=50000 COLUMNS=1000. This could give us, for example, 150 singular values. (3) The valid chars file in the new release is effective to discard tokens with, for example, numbers if we don't include numbers as valid chars. But non-standard chars from the corpus like the copyright symbol, registered trade mark symbol, etc. still appear in the words in the dictionary. Any hints to make quick code change? Your replies are really, greatly appreciated. Thanks! Rahul. __________________________________ Start your day with Yahoo! - Make it your home page! http://www.yahoo.com/r/hs |
From: Dominic W. <wi...@ma...> - 2005-11-09 16:53:35
|
Dear Montse, I am so sorry to be so long in replying, I have been travelling and have been very busy at work and at home recently. I don't know if there is a biggest appropriate number of files for multifile corpora on different platforms. What I have done in the past to get similar behaviour is to use a single corpus file split into different documents, i.e. <DOC> <TEXT> This is a sentence from the BNC. </TEXT> </DOC> Pretty ugly, I'll grant you, bad case of markup making the corpus unnecessarily bigger. Also, there were problems if we ever tried to put 2 tags on the same line, which I guess is just a C parsing issue that we could fix if we tracked it down. Sorry I can't respond to your actual question at the moment, but if you want a hack to do much the same thing, this should work. Best wishes, Dominic On Oct 31, 2005, at 11:20 AM, Montse Cuadros wrote: > Dear All, > > I'm trying to build a model for BNC corpus but instead of building > for document, I want to do it for sentence. I have all the corpus > separately sentence-by-sentence in files and then when trying to > construct the model, it fails showing me this error: > > > Allocating filename memory: Cannot allocate memory > Couldn't initialize tokenizer. > make: *** [/corpus/models//BNC_SENTENCE/wordlist] Error 1 > > The file directory, is huge, now contains 5.009.088 files and I don't > know if there is any problem because of the amount of files, and/or > just because they contain a very few data and doesn't make sense to > construct such a model or just because of the default parametres. > > Thanks in advance, > > Bests, > > Montse > > > ------------------------------------------------------- > This SF.Net email is sponsored by the JBoss Inc. > Get Certified Today * Register for a JBoss Training Course > Free Certification Exam for All Training Attendees Through End of 2005 > Visit http://www.jboss.com/services/certification for more information > _______________________________________________ > infomap-nlp-users mailing list > inf...@li... > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-users > |
From: Montse C. <cu...@ls...> - 2005-10-31 16:21:18
|
Dear All, I'm trying to build a model for BNC corpus but instead of building for document, I want to do it for sentence. I have all the corpus separately sentence-by-sentence in files and then when trying to construct the model, it fails showing me this error: Allocating filename memory: Cannot allocate memory Couldn't initialize tokenizer. make: *** [/corpus/models//BNC_SENTENCE/wordlist] Error 1 The file directory, is huge, now contains 5.009.088 files and I don't know if there is any problem because of the amount of files, and/or just because they contain a very few data and doesn't make sense to construct such a model or just because of the default parametres. Thanks in advance, Bests, Montse |
From: Scott C. <ced...@gm...> - 2005-10-26 06:47:34
|
Hi Rahul, On 10/26/05, Rahul Joshi <rj...@ya...> wrote: > Hi Dominic, > > > because the Infomap software relies on cooccurrence > > within a > > fixed-width window rather than cooccurrence within a > > document. > > Does the "cooccurence within a window" mean "a window > within the same document"? > Yes. > In other words, the window doesn't span documents. > No, the window doesn't span documents. > Is the width set using: PRE_CONTEXT_SIZE and > POST_CONTEXT_SIZE? > Yes. > Thanks, > Rahul. > > Scott |
From: Rahul J. <rj...@ya...> - 2005-10-26 06:11:24
|
Hi Dominic, > because the Infomap software relies on cooccurrence > within a > fixed-width window rather than cooccurrence within a > document. Does the "cooccurence within a window" mean "a window within the same document"? In other words, the window doesn't span documents. Is the width set using: PRE_CONTEXT_SIZE and POST_CONTEXT_SIZE? Thanks, Rahul. --- Dominic Widdows <wi...@ma...> wrote: > Dear Rahul, > > Sorry for not getting back to you sooner. > > > Is there a known limit for number of files in a > > multi-document corpus, expecting optimal > performance? > > Is there a known break-down point? > > I know of no maximum number of files for a > multi-document corpus, as > far as the Infomap software is concerned. However, > this isn't because > we've stretched the system and found that it doesn't > break, it's > because we haven't really used it for this very > much. I've only ever > built a couple of multidocument models, and have had > sporadic reports > of this functionality not working at all. If we were > embarking on a > new, well-resourced project, we'd look into this > straight away, > > > Does the uniformity (or lack of it) in the sizes > of > > individual documents in a corpus affect the > quality > > model? > > It certainly matters much less than in a standard > search / LSA engine, > because the Infomap software relies on cooccurrence > within a > fixed-width window rather than cooccurrence within a > document. > > This is discussed properly in Ch6 of Geometry and > Meaning. There's a > sketch on the web at > http://infomap.stanford.edu/book/chapters/chapter6.html > but of course, finding yourself a copy of the book > would be more useful > ;-) > > Best wishes, > Dominic > > > > ------------------------------------------------------- > This SF.Net email is sponsored by: > Power Architecture Resource Center: Free content, > downloads, discussions, > and more. > http://solutions.newsforge.com/ibmarch.tmpl > _______________________________________________ > infomap-nlp-users mailing list > inf...@li... > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-users > __________________________________ Yahoo! FareChase: Search multiple travel sites in one click. http://farechase.yahoo.com |
From: Dominic W. <wi...@ma...> - 2005-10-22 12:49:03
|
Dear Rahul, Sorry for not getting back to you sooner. > Is there a known limit for number of files in a > multi-document corpus, expecting optimal performance? > Is there a known break-down point? I know of no maximum number of files for a multi-document corpus, as far as the Infomap software is concerned. However, this isn't because we've stretched the system and found that it doesn't break, it's because we haven't really used it for this very much. I've only ever built a couple of multidocument models, and have had sporadic reports of this functionality not working at all. If we were embarking on a new, well-resourced project, we'd look into this straight away, > Does the uniformity (or lack of it) in the sizes of > individual documents in a corpus affect the quality > model? It certainly matters much less than in a standard search / LSA engine, because the Infomap software relies on cooccurrence within a fixed-width window rather than cooccurrence within a document. This is discussed properly in Ch6 of Geometry and Meaning. There's a sketch on the web at http://infomap.stanford.edu/book/chapters/chapter6.html but of course, finding yourself a copy of the book would be more useful ;-) Best wishes, Dominic |
From: Dominic W. <wi...@ma...> - 2005-10-21 00:14:46
|
Hi Gabriel, If it's a really big corpus, then 500 occurrences may not be enough for=20= a word to get into the top 20000, which I think is the default number=20 of rows in the cooccurrence matrix. The first thing you should try is building a model with more rows,=20 which I think you can do by changing the "admin/default_params" file. Try a "grep -n falklands" on the ".dic" file in your model directory if=20= you want to get a sense of how many rows you should include before you=20= get to the word "falklands". If it looks like the word should have been=20= in anyway, then the problem is something else. Hope this helps. Best wishes, Dominic On Oct 20, 2005, at 8:01 PM, Gabriel Murray wrote: > I built an Infomap model using a very large corpus of newspaper=20 > articles (100+ million words). I can use associate to query words, but=20= > I find that some words that were contained in the corpus and were NOT=20= > stopwords are for some reason not contained in the model, i.e. I get a=20= > response of "no word vector for X." Is there some frequency threshold=20= > set? For example, "falklands" doesn't appear in the model even though=20= > it appeared more than 500 times in the corpus. > =A0 > If there is some threshold, can I turn it off? > Thanks, > Gabriel Murray= |
From: Gabriel M. <gab...@gm...> - 2005-10-21 00:01:21
|
I built an Infomap model using a very large corpus of newspaper articles (100+ million words). I can use associate to query words, but I find that some words that were contained in the corpus and were NOT stopwords are for some reason not contained in the model, i.e. I get a response of "no word vector for X." Is there some frequency threshold set? For example, "falklands" doesn't appear in the model even though it appeared more than 500 times in the corpus. If there is some threshold, can I turn it off? Thanks, Gabriel Murray |
From: Rahul J. <rj...@ya...> - 2005-10-18 00:12:30
|
Hi: Few days ago I saw a question on this list relating to large corpus (large number of words). I have different but relevant questions: Is there a known limit for number of files in a multi-document corpus, expecting optimal performance? Is there a known break-down point? Does the uniformity (or lack of it) in the sizes of individual documents in a corpus affect the quality model? Thanks! Rahul. __________________________________ Yahoo! Mail - PC Magazine Editors' Choice 2005 http://mail.yahoo.com |
From: Dominic W. <wi...@ma...> - 2005-10-06 21:35:47
|
> >ii. how reliable do you want the results to be? > Well, i want them to be reliable enough. When i said 170k words, i=20 > meant 170k types. So as a general rule,=A0types with a frequency of=20 > greater than 10 should be added to the wordvector, right? This is probably not a bad rule of thumb. I may go for 20 rather than=20 10 as a first guess. > I haven't completely understood the SVD part of infomap. Could you=20 > please explain why this problem of "infrequent words cropping up in=20 > odd places" occurs? This is more to do with the problem of sparse data in general than=20 anything specifically to do with SVD. If you only have 3 or 4=20 occurrences of a word, it may occur with very atypical usages. Once you=20= have 30 or 40 occurrences, you have a much better chance of having a=20 representative sample of the available meanings. In theory at least, SVD actually helps this situation because it makes=20= your matrix less sparse. Though if your sample occurrences are a skewed=20= sample to begin with, I don't really see how projecting down to fewer=20 dimensions is likely to make your sample less skewed. Best wishes, Dominic > Thanks, > Arooj >> From:=A0=A0Dominic Widdows <wi...@ma...> >> To:=A0=A0"Arooj Asghar" <aro...@ho...> >> CC:=A0=A0i...@li... >> Subject:=A0=A0Re: [infomap-nlp-users] Number of ROWS for a large = corpus >> Date:=A0=A0Tue, 4 Oct 2005 10:51:12 -0400 >> MIME-Version:=A0=A01.0 (Apple Message framework v623) >> Received:=A0=A0from raven.maya.com ([192.70.254.20]) by=20 >> mc10-f26.hotmail.com with Microsoft SMTPSVC(6.0.3790.211); Tue, 4 Oct=20= >> 2005 07:51:18 -0700 >> Received:=A0=A0from [10.20.50.23] (unknown [10.20.50.23])(using TLSv1=20= >> with cipher RC4-SHA (128/128 bits))(No client certificate=20 >> requested)by raven.maya.com (Postfix) with ESMTPid 7B3511481B3; Tue,=20= >> 4 Oct 2005 10:27:48 -0400 (EDT) >> >Dear Arooj, >> > >> >Two questions I can think of that affect your question is >> >i. is space a concern? >> >and >> >ii. how reliable do you want the results to be? >> > >> >The second question is affected crucially by the token frequency of >> >the types that are being indexed. In my (ad hoc) experience, by the >> >time you have about 20 or 30 occurrences of a word type, things are >> >looking reasonably stable. If you have fewer than 10, things are >> >looking very hit and miss. Those who have experimented with the King >> >James Bible corpus will probably have seen for themselves that >> >infrequent words like "kirioth" seem to crop up in very odd places. >> > >> >This suggests rephrasing the question as follows. In a corpus with m >> >tokens belonging to n < m types, how many of the types do you expect >> >to occur with frequency greater than some "stability threshold"? I >> >believe that the range 10 to 30 is a pretty good guess for this >> >threshold, but a guess it remains. One should be able to do a bit of >> >counting and comparing with Zipf's law to firm up the mathematics of >> >this suggestion. >> > >> >Am I answering the right question? >> >Best wishes, >> >Dominic >> > >> >On Oct 4, 2005, at 2:46 AM, Arooj Asghar wrote: >> > >> >>Hi, >> >>=A0 >> >>I=92ve wanted to index a large corpus, with almost 170K words. The >> >>default 20K doesn=92t work well for this. Is there a way of telling >> >>what the optimal number of rows for a corpus of =91n=92 words = should >> >>be? >> >>=A0 >> >>Thanks, >> >>Arooj >> >>=A0 >> >>=A0 >> >> >> >>Express yourself instantly with MSN Messenger! MSN Messenger >> >>Download today it's FREE!=A0=A0 >> >>------------------------------------------------------- This SF.Net >> >>email is sponsored by: Power Architecture Resource Center: Free >> >>content, downloads, discussions, and more. >> >>http://solutions.newsforge.com/ibmarch.tmpl >> >>_______________________________________________ infomap-nlp-users >> >>mailing list inf...@li... >> >>https://lists.sourceforge.net/lists/listinfo/infomap-nlp-users > > Express yourself instantly with MSN Messenger! MSN Messenger Download=20= > today it's FREE!=20= |
From: Arooj A. <aro...@ho...> - 2005-10-05 11:12:33
|
<html><div style='background-color:'><DIV class=RTE> <P>Dear Dominic,</P> <P>Thanks for your reply. You definitely are answering the right question and it certainly is very helpful. Just to elaborate things a little more...</P> <P>>i. is space a concern?<BR>No, not really. </P> <P>>ii. how reliable do you want the results to be?<BR>Well, i want them to be reliable enough. When i said 170k words, i meant 170k types. So as a general rule, types with a frequency of greater than 10 should be added to the wordvector, right?</P> <P>I haven't completely understood the SVD part of infomap. Could you please explain why this problem of "infrequent words cropping up in odd places" occurs?</P> <P>Thanks,<BR>Arooj<BR></P></DIV> <DIV></DIV> <BLOCKQUOTE style="PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #a0c6e5 2px solid; MARGIN-RIGHT: 0px"><FONT style="FONT-SIZE: 11px; FONT-FAMILY: tahoma,sans-serif"> <HR color=#a0c6e5 SIZE=1> <DIV></DIV>From: <I>Dominic Widdows <wi...@ma...></I><BR>To: <I>"Arooj Asghar" <aro...@ho...></I><BR>CC: <I>inf...@li...</I><BR>Subject: <I>Re: [infomap-nlp-users] Number of ROWS for a large corpus</I><BR>Date: <I>Tue, 4 Oct 2005 10:51:12 -0400</I><BR>MIME-Version: <I>1.0 (Apple Message framework v623)</I><BR>Received: <I>from raven.maya.com ([192.70.254.20]) by mc10-f26.hotmail.com with Microsoft SMTPSVC(6.0.3790.211); Tue, 4 Oct 2005 07:51:18 -0700</I><BR>Received: <I>from [10.20.50.23] (unknown [10.20.50.23])(using TLSv1 with cipher RC4-SHA (128/128 bits))(No client certificate requested)by raven.maya.com (Postfix) with ESMTPid 7B3511481B3; Tue, 4 Oct 2005 10:27:48 -0400 (EDT)</I><BR>>Dear Arooj,<BR>><BR>>Two questions I can think of that affect your question is<BR>>i. is space a concern?<BR>>and<BR>>ii. how reliable do you want the results to be?<BR>><BR>>The second question is affected crucially by the token frequency of <BR>>the types that are being indexed. In my (ad hoc) experience, by the <BR>>time you have about 20 or 30 occurrences of a word type, things are <BR>>looking reasonably stable. If you have fewer than 10, things are <BR>>looking very hit and miss. Those who have experimented with the King <BR>>James Bible corpus will probably have seen for themselves that <BR>>infrequent words like "kirioth" seem to crop up in very odd places.<BR>><BR>>This suggests rephrasing the question as follows. In a corpus with m <BR>>tokens belonging to n < m types, how many of the types do you expect <BR>>to occur with frequency greater than some "stability threshold"? I <BR>>believe that the range 10 to 30 is a pretty good guess for this <BR>>threshold, but a guess it remains. One should be able to do a bit of <BR>>counting and comparing with Zipf's law to firm up the mathematics of <BR>>this suggestion.<BR>><BR>>Am I answering the right question?<BR>>Best wishes,<BR>>Dominic<BR>><BR>>On Oct 4, 2005, at 2:46 AM, Arooj Asghar wrote:<BR>><BR>>>Hi,<BR>>> <BR>>>Ive wanted to index a large corpus, with almost 170K words. The <BR>>>default 20K doesnt work well for this. Is there a way of telling <BR>>>what the optimal number of rows for a corpus of n words should <BR>>>be?<BR>>> <BR>>>Thanks,<BR>>>Arooj<BR>>> <BR>>> <BR>>><BR>>>Express yourself instantly with MSN Messenger! MSN Messenger <BR>>>Download today it's FREE! <BR>>>------------------------------------------------------- This SF.Net <BR>>>email is sponsored by: Power Architecture Resource Center: Free <BR>>>content, downloads, discussions, and more. <BR>>>http://solutions.newsforge.com/ibmarch.tmpl <BR>>>_______________________________________________ infomap-nlp-users <BR>>>mailing list inf...@li... <BR>>>https://lists.sourceforge.net/lists/listinfo/infomap-nlp-users<BR></FONT></BLOCKQUOTE></div><br clear=all><hr>Express yourself instantly with MSN Messenger! <a href="http://g.msn.com/8HMAEN/2743??PS=47575" target="_top">MSN Messenger</a> Download today it's FREE!</html> |