From: Markus K. <ma...@se...> - 2014-01-23 09:16:52
|
Dear Vaqalado, dear all, With the recent changes in Text/String (merging both types), the length of the searchable String has been reduced compared to earlier versions: * In the past, a String was strictly limited to 255 characters, which you could search completely when using LIKE. * Now, String/Text is unlimited, but fewer characters are used for searching: ** For long strings, we store the first 40 characters, followed by 32 characters of hash code. The rest of the content is stored in a Blob. ** For short strings (less than 40+32=72 characters), we store the whole string without using a Blob, and the search operates on that. If you just do the normal alphabetic ordering, or selection with < and >, you won't normally notice a difference (40 chars are usually enough to distinguish strings). However, LIKE is obviously affected. The magic value of 72 is set in file SMW_DIHandler_Blob.php [1]. You can increase it to a maximum of 255 to get the old behaviour (but this would increase the storage space needed, especially for the indexes, so it is not something we should do in general). One could also turn this constant into a configuration option (but note that values above 255 will not work correctly, since the DB will simply truncate the string at this length). After this change, all string data has to be refreshed to update the database tables accordingly. Note that this behaviour is specific to SQLStore3. SMW does not have the limit of 72 anywhere in its datamodel. Obviously, going from 72 to 255 still won't give you real full text search capabilities for SMW, which should be implemented differently. It should use the full text rather than the index string, but it should also use a sane full text search index instead of the SQL function (in general, LIKE currently can only be answered by a full scan of all strings in the database, checking the pattern on each of them). A Lucene-based search would be even better, but then again it is a challenge to get the other search features working properly on Lucene (as joining partial results from Lucene with results from MySQL won't be very efficient). Probably using a dedicated text search feature of (My)SQL would be most promising for small to medium configurations. Cheers, Markus [1] https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FSemanticMediaWiki/49937de6e80c6399724395c4c957ffa257308221/includes%2Fstorage%2FSQLStore%2FSMW_DIHandler_Blob.php#L35 On 23/01/14 03:26, Vaqalado Paraservirle wrote: > Thanks Chad, and yeah, > apparently SMW is approaching the Text as it would over the deprecated > String property... I haven't counted the number of characters yet, but did > tried increasing the number of words (from the first one up to the last) in > the values to retrieve through "LIKE" (~) and, certainly, the queries are > not reading the chains up to their end... not unless the queries deal with > values for the Page property, which is not convenient for my setup, since > I've been flirting with the idea to store longest values as Text (such as > paragraphs) and not merely large titles or incipits. > > > 2014/1/22 Chad Spratt <int...@gm...> > >> The only extra info I found was where the documentation says 'like' (~) >> can only be used with data of type String or Page >> http://semantic-mediawiki.org/wiki/Help:Selecting_pages#Like.2C_not_like >> >> but then elsewhere it says String is a deprecated alias for Text >> http://semantic-mediawiki.org/wiki/Help:Type_String >> >> >> On Wed, Jan 22, 2014 at 4:30 PM, Chad Spratt <int...@gm...>wrote: >> >>> I'm stumped too. You may want to post your extra question to the mailing >>> list, I forgot to include it in my reply before. >>> >>> >>> On Tue, Jan 21, 2014 at 4:53 PM, Vaqalado Paraservirle < >>> vaq...@gm...> wrote: >>> >>>> Just checked: It doesn't, Chad, >>>> >>>> btw. I must mention another, probably non related, behaviour: >>>> while querying with strings of numbers to retrieve a text value like: >>>> "1810 ¡Viva el 16 de septiembre! 1912" >>>> as follows: >>>> >>>> A. {#ask:[[Es íncipit de impreso::~*18*]] ... >>>> >>>> B. {#ask:[[Es íncipit de impreso::~*16*]] ... >>>> >>>> C. {#ask:[[Es íncipit de impreso::~*12*]] ... >>>> >>>> >>>> ... many proper texts arise, but also quite a lot of unexpected strings >>>> like: "Cogida de Rodolfo Gaona en la plaza de toros de Puebla el 13 de >>>> diciembre de 1908" ! >>>> >>>> >>>> >>>> 2014/1/21 Chad Spratt <int...@gm...> >>>> >>>>> Does the query work if you remove the asterisk after 'dolor'? >>>>> >>>>> >>>>> On Mon, Jan 20, 2014 at 9:59 PM, Vaqalado Paraservirle < >>>>> vaq...@gm...> wrote: >>>>> >>>>>> Hi, >>>>>> while performing queries through #ask to retrieve values for >>>>>> properties of >>>>>> the datatype Text, the results I'm getting seem to be conditioned by >>>>>> the >>>>>> length of the string. >>>>>> >>>>>> Example: >>>>>> >>>>>> This is the stored text under the "Es íncipit de impreso" Text >>>>>> property: >>>>>> >>>>>> >>>>>> "A las infortunadas hijas del señor Don Venustiano Carranza, extinto >>>>>> presidente de la República, ante su acerbo dolor" >>>>>> >>>>>> >>>>>> Now, when I query: >>>>>> >>>>>> >>>>>> {{#ask:[[Es íncipit de impreso::~*infortunadas hijas*]] >>>>>> >>>>>> |?Es íncipit de impreso >>>>>> >>>>>> |format=broadtable >>>>>> >>>>>> }} >>>>>> >>>>>> >>>>>> ... I do get the whole text, but, >>>>>> when I query: >>>>>> >>>>>> >>>>>> {{#ask:[[Es íncipit de impreso::~*acerbo dolor*]] >>>>>> >>>>>> |?Es íncipit de impreso >>>>>> >>>>>> |format=broadtable >>>>>> >>>>>> }} >>>>>> >>>>>> >>>>>> ... then, no result is shown at all. >>>>>> >>>>>> >>>>>> l'm running Mediawiki 1.21.1 and Semantic MediaWiki 1.8.0.4. >>>>>> >>>>>> In order to retrieve the data, I've changed the datatype of the >>>>>> property to >>>>>> Page so I could ask for 'acerbo' and 'dolor' separately and cross the >>>>>> results, but line breaks are not friendly with page titles (some "Has >>>>>> improper value for" emerged) and I do want to state the strings I'm >>>>>> working >>>>>> with as... text. >>>>>> >>>>>> Can you imagine any workaround for this problem? >>>>>> >>>>>> Cheers ! >>>>>> >>>>>> - Rafael >>>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> CenturyLink Cloud: The Leader in Enterprise Cloud Services. >>>>>> Learn Why More Businesses Are Choosing CenturyLink Cloud For >>>>>> Critical Workloads, Development Environments & Everything In Between. >>>>>> Get a Quote or Start a Free Trial Today. >>>>>> >>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk >>>>>> _______________________________________________ >>>>>> Semediawiki-user mailing list >>>>>> Sem...@li... >>>>>> https://lists.sourceforge.net/lists/listinfo/semediawiki-user >>>>>> >>>>> >>>>> >>>> >>> >> > ------------------------------------------------------------------------------ > CenturyLink Cloud: The Leader in Enterprise Cloud Services. > Learn Why More Businesses Are Choosing CenturyLink Cloud For > Critical Workloads, Development Environments & Everything In Between. > Get a Quote or Start a Free Trial Today. > http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk > _______________________________________________ > Semediawiki-user mailing list > Sem...@li... > https://lists.sourceforge.net/lists/listinfo/semediawiki-user > |