From: Vaqalado P. <vaq...@gm...> - 2014-01-23 18:36:59
|
Thank you Markus, and dear SMW-user mailing list, now I'll take my chances and increase the value of the MAX_HASH_LENGTH constant to store titles and incipits as a way to retrieve them as excerpts of longer texts for which, in my case, a search through operators is desirable but not yet essential. Cheers. 2014/1/23 Markus Krötzsch <ma...@se...> > Dear Vaqalado, dear all, > > With the recent changes in Text/String (merging both types), the length of > the searchable String has been reduced compared to earlier versions: > > * In the past, a String was strictly limited to 255 characters, which you > could search completely when using LIKE. > * Now, String/Text is unlimited, but fewer characters are used for > searching: > ** For long strings, we store the first 40 characters, followed by 32 > characters of hash code. The rest of the content is stored in a Blob. > ** For short strings (less than 40+32=72 characters), we store the whole > string without using a Blob, and the search operates on that. > > If you just do the normal alphabetic ordering, or selection with < and >, > you won't normally notice a difference (40 chars are usually enough to > distinguish strings). However, LIKE is obviously affected. > > The magic value of 72 is set in file SMW_DIHandler_Blob.php [1]. You can > increase it to a maximum of 255 to get the old behaviour (but this would > increase the storage space needed, especially for the indexes, so it is not > something we should do in general). One could also turn this constant into > a configuration option (but note that values above 255 will not work > correctly, since the DB will simply truncate the string at this length). > After this change, all string data has to be refreshed to update the > database tables accordingly. > > Note that this behaviour is specific to SQLStore3. SMW does not have the > limit of 72 anywhere in its datamodel. > > Obviously, going from 72 to 255 still won't give you real full text search > capabilities for SMW, which should be implemented differently. It should > use the full text rather than the index string, but it should also use a > sane full text search index instead of the SQL function (in general, LIKE > currently can only be answered by a full scan of all strings in the > database, checking the pattern on each of them). A Lucene-based search > would be even better, but then again it is a challenge to get the other > search features working properly on Lucene (as joining partial results from > Lucene with results from MySQL won't be very efficient). Probably using a > dedicated text search feature of (My)SQL would be most promising for small > to medium configurations. > > Cheers, > > Markus > > [1] https://git.wikimedia.org/blob/mediawiki%2Fextensions% > 2FSemanticMediaWiki/49937de6e80c6399724395c4c957ff > a257308221/includes%2Fstorage%2FSQLStore%2FSMW_DIHandler_Blob.php#L35 > > > > On 23/01/14 03:26, Vaqalado Paraservirle wrote: > >> Thanks Chad, and yeah, >> apparently SMW is approaching the Text as it would over the deprecated >> String property... I haven't counted the number of characters yet, but did >> tried increasing the number of words (from the first one up to the last) >> in >> the values to retrieve through "LIKE" (~) and, certainly, the queries are >> not reading the chains up to their end... not unless the queries deal with >> values for the Page property, which is not convenient for my setup, since >> I've been flirting with the idea to store longest values as Text (such as >> paragraphs) and not merely large titles or incipits. >> >> >> 2014/1/22 Chad Spratt <int...@gm...> >> >> The only extra info I found was where the documentation says 'like' (~) >>> can only be used with data of type String or Page >>> http://semantic-mediawiki.org/wiki/Help:Selecting_pages#Like.2C_not_like >>> >>> but then elsewhere it says String is a deprecated alias for Text >>> http://semantic-mediawiki.org/wiki/Help:Type_String >>> >>> >>> On Wed, Jan 22, 2014 at 4:30 PM, Chad Spratt <int...@gm... >>> >wrote: >>> >>> I'm stumped too. You may want to post your extra question to the mailing >>>> list, I forgot to include it in my reply before. >>>> >>>> >>>> On Tue, Jan 21, 2014 at 4:53 PM, Vaqalado Paraservirle < >>>> vaq...@gm...> wrote: >>>> >>>> Just checked: It doesn't, Chad, >>>>> >>>>> btw. I must mention another, probably non related, behaviour: >>>>> while querying with strings of numbers to retrieve a text value like: >>>>> "1810 ¡Viva el 16 de septiembre! 1912" >>>>> as follows: >>>>> >>>>> A. {#ask:[[Es íncipit de impreso::~*18*]] ... >>>>> >>>>> B. {#ask:[[Es íncipit de impreso::~*16*]] ... >>>>> >>>>> C. {#ask:[[Es íncipit de impreso::~*12*]] ... >>>>> >>>>> >>>>> ... many proper texts arise, but also quite a lot of unexpected strings >>>>> like: "Cogida de Rodolfo Gaona en la plaza de toros de Puebla el 13 de >>>>> diciembre de 1908" ! >>>>> >>>>> >>>>> >>>>> 2014/1/21 Chad Spratt <int...@gm...> >>>>> >>>>> Does the query work if you remove the asterisk after 'dolor'? >>>>>> >>>>>> >>>>>> On Mon, Jan 20, 2014 at 9:59 PM, Vaqalado Paraservirle < >>>>>> vaq...@gm...> wrote: >>>>>> >>>>>> Hi, >>>>>>> while performing queries through #ask to retrieve values for >>>>>>> properties of >>>>>>> the datatype Text, the results I'm getting seem to be conditioned by >>>>>>> the >>>>>>> length of the string. >>>>>>> >>>>>>> Example: >>>>>>> >>>>>>> This is the stored text under the "Es íncipit de impreso" Text >>>>>>> property: >>>>>>> >>>>>>> >>>>>>> "A las infortunadas hijas del señor Don Venustiano Carranza, extinto >>>>>>> presidente de la República, ante su acerbo dolor" >>>>>>> >>>>>>> >>>>>>> Now, when I query: >>>>>>> >>>>>>> >>>>>>> {{#ask:[[Es íncipit de impreso::~*infortunadas hijas*]] >>>>>>> >>>>>>> |?Es íncipit de impreso >>>>>>> >>>>>>> |format=broadtable >>>>>>> >>>>>>> }} >>>>>>> >>>>>>> >>>>>>> ... I do get the whole text, but, >>>>>>> when I query: >>>>>>> >>>>>>> >>>>>>> {{#ask:[[Es íncipit de impreso::~*acerbo dolor*]] >>>>>>> >>>>>>> |?Es íncipit de impreso >>>>>>> >>>>>>> |format=broadtable >>>>>>> >>>>>>> }} >>>>>>> >>>>>>> >>>>>>> ... then, no result is shown at all. >>>>>>> >>>>>>> >>>>>>> l'm running Mediawiki 1.21.1 and Semantic MediaWiki 1.8.0.4. >>>>>>> >>>>>>> In order to retrieve the data, I've changed the datatype of the >>>>>>> property to >>>>>>> Page so I could ask for 'acerbo' and 'dolor' separately and cross the >>>>>>> results, but line breaks are not friendly with page titles (some "Has >>>>>>> improper value for" emerged) and I do want to state the strings I'm >>>>>>> working >>>>>>> with as... text. >>>>>>> >>>>>>> Can you imagine any workaround for this problem? >>>>>>> >>>>>>> Cheers ! >>>>>>> >>>>>>> - Rafael >>>>>>> >>>>>>> ------------------------------------------------------------ >>>>>>> ------------------ >>>>>>> CenturyLink Cloud: The Leader in Enterprise Cloud Services. >>>>>>> Learn Why More Businesses Are Choosing CenturyLink Cloud For >>>>>>> Critical Workloads, Development Environments & Everything In Between. >>>>>>> Get a Quote or Start a Free Trial Today. >>>>>>> >>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=119420431& >>>>>>> iu=/4140/ostg.clktrk >>>>>>> _______________________________________________ >>>>>>> Semediawiki-user mailing list >>>>>>> Sem...@li... >>>>>>> https://lists.sourceforge.net/lists/listinfo/semediawiki-user >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> ------------------------------------------------------------ >> ------------------ >> CenturyLink Cloud: The Leader in Enterprise Cloud Services. >> Learn Why More Businesses Are Choosing CenturyLink Cloud For >> Critical Workloads, Development Environments & Everything In Between. >> Get a Quote or Start a Free Trial Today. >> http://pubads.g.doubleclick.net/gampad/clk?id=119420431& >> iu=/4140/ostg.clktrk >> _______________________________________________ >> Semediawiki-user mailing list >> Sem...@li... >> https://lists.sourceforge.net/lists/listinfo/semediawiki-user >> >> > |