From: James H. K. <jam...@gm...> - 2012-01-14 14:34:05
|
Hi, Starting with SMW 1.7 and MW 1.18, we began to convert our old legacy document system into a SMW-MW based system which right now left us with more than 700.00 triplets stored in SMW but at the same time decreased our response time on SMW-related queries. Somewhere around 200.000 triplets (it does not mean the number is a threshold) we recognized an increased impact on query performance where now every time we execute a query we feel the pinch. We are not talking about in-template query performance as seen by the Wikia/Familypedia example (we abandoned such practices some time ago). Nowadays we encourage users to execute all complex queries either via Special:Ask or provide an input form to run a RunQuery and yes we are using APC to improve caching and response time in general. We tried to look at external solutions such as 4Store which is not supported on Windows, Virtuoso has no real documentation available to make it work with SMW (at least we couldn't find one), and Jena which seems to require SMW+ leaving us with the native SMW store itself and we would like to keep it that way as every external software means an additional fault point and maintenances effort. == Architectural question == #1 Could their be an indexing problem on behalf of one of the primary SMW table key indexes? # 2 Does SMW natively support MySQL internal query-cache-type/query-cache-size option to improve query performance? We made sure MySQL is using query-cache-type/query-cache-size option but somehow this don't show any effect for SMW-related queries. #3 Would a different approach to handle query data namely storing query data in a temporary in-memory table bring advantages compared to the current approach of accessing SMW disk tables every-time a query is executed? Would an in-memory concept for queried data (SMW data is mirrored into a temporary in-memory table for READ purpose only at the time of the actual MySQL session and every time MySQL is restarted temporary in-memory tables have to been rebuild) improve query and access performance of SMW related triplets. I guess (I don't know) neither MyISM or InnoDB would do have an impact since the bottleneck seems the disk access to execute queries on behalf of triplets stored in SMW-related tables. Of course their is always a way to improve performance by using better hardware (RAID, SSD to improve output performance) but this a last resort approach which we would like to avoid for the moment. System: MediaWiki 1.18.0, PHP 5.3.8 (apache2handler), MySQL 5.5.16, APC version 3.1.6-dev PS: Our increased use of triplets comes from an automatic indexing process of content and document transfer which exchanges information with Sphinx Search while identifying the 30 most used words in a document which is written back to the wiki and stored as semantic triplet on the related NS_IMAGE object. Cheers, mwjames |
From: CNIT <cn...@un...> - 2012-01-16 06:30:58
|
14.01.2012 18:33, James Hong Kong пишет: > Hi, > > Starting with SMW 1.7 and MW 1.18, we began to convert our old legacy > document system into a SMW-MW based system which right now left us > with more than 700.00 triplets stored in SMW but at the same time > decreased our response time on SMW-related queries. > > Somewhere around 200.000 triplets (it does not mean the number is a > threshold) we recognized an increased impact on query performance > where now every time we execute a query we feel the pinch. We are not > talking about in-template query performance as seen by the > Wikia/Familypedia example (we abandoned such practices some time ago). > Nowadays we encourage users to execute all complex queries either via > Special:Ask or provide an input form to run a RunQuery and yes we are > using APC to improve caching and response time in general. > > We tried to look at external solutions such as 4Store which is not > supported on Windows, Virtuoso has no real documentation available to > make it work with SMW (at least we couldn't find one), and Jena which > seems to require SMW+ leaving us with the native SMW store itself and > we would like to keep it that way as every external software means an > additional fault point and maintenances effort. > > == Architectural question == > > #1 Could their be an indexing problem on behalf of one of the primary > SMW table key indexes? > > # 2 Does SMW natively support MySQL internal > query-cache-type/query-cache-size option to improve query performance? > We made sure MySQL is using query-cache-type/query-cache-size option > but somehow this don't show any effect for SMW-related queries. > > #3 Would a different approach to handle query data namely storing > query data in a temporary in-memory table bring advantages compared to > the current approach of accessing SMW disk tables every-time a query > is executed? Would an in-memory concept for queried data (SMW data is > mirrored into a temporary in-memory table for READ purpose only at the > time of the actual MySQL session and every time MySQL is restarted > temporary in-memory tables have to been rebuild) improve query and > access performance of SMW related triplets. I guess (I don't know) > neither MyISM or InnoDB would do have an impact since the bottleneck > seems the disk access to execute queries on behalf of triplets stored > in SMW-related tables. > > Of course their is always a way to improve performance by using better > hardware (RAID, SSD to improve output performance) but this a last > resort approach which we would like to avoid for the moment. > > System: > MediaWiki 1.18.0, PHP 5.3.8 (apache2handler), MySQL 5.5.16, APC > version 3.1.6-dev > > PS: Our increased use of triplets comes from an automatic indexing > process of content and document transfer which exchanges information > with Sphinx Search while identifying the 30 most used words in a > document which is written back to the wiki and stored as semantic > triplet on the related NS_IMAGE object. > > Cheers, > > mwjames > > I think proper anwer would be move to Linux and use 4store, although some people recently complained that queries on internal objects do not work correctly with 4store. BTW, if you have gigabit LAN or faster (fiber) you may try setting up 4store at different host in your LAN, while keeping SMW in Windows. Dmitriy |
From: Markus K. <ma...@se...> - 2012-01-19 12:08:48
|
On 14/01/12 14:33, James Hong Kong wrote: > Hi, Hi James, > > Starting with SMW 1.7 and MW 1.18, we began to convert our old legacy > document system into a SMW-MW based system which right now left us > with more than 700.00 triplets stored in SMW but at the same time > decreased our response time on SMW-related queries. > > Somewhere around 200.000 triplets (it does not mean the number is a > threshold) we recognized an increased impact on query performance > where now every time we execute a query we feel the pinch. We are not > talking about in-template query performance as seen by the > Wikia/Familypedia example (we abandoned such practices some time ago). > Nowadays we encourage users to execute all complex queries either via > Special:Ask or provide an input form to run a RunQuery and yes we are > using APC to improve caching and response time in general. > > We tried to look at external solutions such as 4Store which is not > supported on Windows, Virtuoso has no real documentation available to > make it work with SMW (at least we couldn't find one), and Jena which > seems to require SMW+ leaving us with the native SMW store itself and > we would like to keep it that way as every external software means an > additional fault point and maintenances effort. Getting Virtuoso to work properly is my next goal, but you are right that there is no official support there yet. There are already hacks to get it work but they have not been integrated into SMW so far. > > == Architectural question == > > #1 Could their be an indexing problem on behalf of one of the primary > SMW table key indexes? Possibly, but none that I am aware of. Did you find out anything about the queries that cause the problems and the indexes that they are using? If you think that more/different indexes would help, you can also modify them manually to the SMW tables to see if this makes a difference (though running SMW_setup.php would undo these changes). > > # 2 Does SMW natively support MySQL internal > query-cache-type/query-cache-size option to improve query performance? > We made sure MySQL is using query-cache-type/query-cache-size option > but somehow this don't show any effect for SMW-related queries. SMW does not do anything with query-cache-type/query-cache-size. So it should not overwrite your global settings in this respect, but maybe the performance problem is not caused there. SMW does have a simple Concept mechanism to manually manage query caches (kept in a database table). If you have a particularly common/heavily used query or query-part, then this could be an option. If you have thousands of very different queries that do not share similar conditions, then this will hardly be feasible. > > #3 Would a different approach to handle query data namely storing > query data in a temporary in-memory table bring advantages compared to > the current approach of accessing SMW disk tables every-time a query > is executed? Would an in-memory concept for queried data (SMW data is > mirrored into a temporary in-memory table for READ purpose only at the > time of the actual MySQL session and every time MySQL is restarted > temporary in-memory tables have to been rebuild) improve query and > access performance of SMW related triplets. I guess (I don't know) > neither MyISM or InnoDB would do have an impact since the bottleneck > seems the disk access to execute queries on behalf of triplets stored > in SMW-related tables. It might of course be possible to optimize the MySQL-based query engine for better performance. It would also be possible to make use of memory caches in some cases, though this needs some thought about how to manage these caches. But overall, I would not put too much development effort into optimizing MySQL query performance in particular, given that there are projects like Virtuoso and 4Store who spend all their time doing mainly that. Connecting Virtuoso should not be so hard (mainly we are facing some protocol issues that I did not have time to look at yet; if Virtuoso would support SPARQL 1.1, then it should be working out of the box; we are mainly talking about proprietary tweaks in the query syntax here). > > Of course their is always a way to improve performance by using better > hardware (RAID, SSD to improve output performance) but this a last > resort approach which we would like to avoid for the moment. Yes, I agree. I will increase the priority for finally getting Virtuoso working a bit; there are other open threads on this list related to this. Regards, Markus |