Markus, some simple solution to the problem can be an addition of id range parameter to SMW_refreshData - not only first id to start with, but a last one as well - this way it'll be easy to split the whole task into smaller chunks to avoid memory leaks and to (possibly) run them in parallel. I was thinking about writing a patch like that, but not sure when I'll get back to this problem.

P.S. MediaWiki I'm running this on is from latest branch, e.g. 1.11.x

           Sergey


On Nov 24, 2007 8:30 AM, Markus Krötzsch < mak@aifb.uni-karlsruhe.de> wrote:
On Dienstag, 6. November 2007, Sergey Chernyshev wrote:
> It seems that SMW_refreshData gets slower with growing size of the dataset.
>
> I didn't do much of troubleshooting of the issue, but first 50000 pages
> from my dataset were processed faster then second 50000 pages.

I noticed the same on our servers, and I suspect some memory leak to account
for that. It is possible that MediaWiki is part of the reason -- we had a
similar problem some time ago and it turned out that MediaWiki's link-cache
had no size limit (so batch-processing 1Mio pages really generated a large
array in memory). Similar caches may be the reason for the renewed slowdown,
but we were unable to analyse this issue in detail. Anyway, the MW version is
an important part of debugging here.

>
> I'm going to start upgrade over for RC2 and will try to look at it in terms
> of speed of the process, but I think there might be a reason for it in some
> indexes getting bigger with more data (which can be avoided by dropping
> indexes prior to refresh and rebuilding them right after) or MySQL not
> liking that many temporary tables created so rapidly.

I would rather suspect the PHP side to be the reason, but on enever knows. I
do not expect changes between SMW1.0-RCs. Basically the refresh process did
not change much for a long time, but the speed issues only occurred recently
(again suggesting that some change in MW may be the reason). SMW also has
some unbound caches, but these are for properties and should hardly get large
enough on current wikis to be relevant here.

>
> Also, I'm wondering if parts of the dataset can be processed in parallel?
> it seems that single run of the script doesn't load CPU that much and
> alternates between PHP and MySQL processes which is not optimal for
> multi-processor boxes where these loads can be spread across all the CPUs.

Possibly, but refreshing often is a low-priority task since the wiki should be
usable during refreshing. So it might be an advantage if it works in the
background without eating too much resources at a time (which by the above
observation is probably not really the case either ;-).


Markus

>
>          Sergey



--
Markus Krötzsch
Institut AIFB, Universät Karlsruhe (TH), 76128 Karlsruhe
phone +49 (0)721 608 7362        fax +49 (0)721 608 5998
mak@aifb.uni-karlsruhe.de        www  http://korrekt.org