Hi Eric,

I like your plan.  JMeter is a great tool for simulating and recording search query results.  Rerunning these tests repeatedly and consistently and seeing the impact of configuration changes on consistent repeatable tests really allows you to see the true impact of these changes.  And yes, I agree that a nice random set of search words is a good starting point.  I think the key, as you point out, is how to start that list.

I the past I have used the keywords from a prior search requests gathered from the web access log files as a samples.  But since you want to do testing before the actual site is up and running my suggestion is to use a "seed" of sample words.  I think the real question is how many keywords is enough?  My experience is that the difference between 100 keywords and 1000 keywords is not that much.  But 10,000 keywords should certainly be enough.

Here is one suggestion to get the first 100 from your documents and then find similar words using a tool such as google sets:

http://labs.google.com/sets?hl=en&q1=civil+war&q2=lincoln&q3=gettysburg&q4=richmond&q5=&btn=Large+Set

This will return a large number of words and phrases that you can then edit and continue to refine.

You can then simulate these searches using tools in JMeter that playback log files such as the AccessLogSampler.  I will see if I can put together an short writeup on this.

- Dan

On Thu, Oct 28, 2010 at 2:34 PM, Palmer, Eric <epalmer@richmond.edu> wrote:
Hello all,

We are building a digital libraries collection application around some tei versions of USA civil war documents. The application for some time will run on a VMWARE ESX VM (RHEL). Users will perform a variety of full text, ngram and maybe range queries, some of which are moderately complex (they analyze related information such as if we search for “loyal”, count how many speeches each speaker made that also said loyal, same for dates, same for the locations that the speaker was from).

Today many of the queries run in <1 sec and some in a couple of seconds.

We are going to construct a  jmeter test that will run concurrent xquery rest queries on the collection.  We want to understand eXist-db memory and cpu usage and affect on performance due to concurrent queries running.

This is not a formal test but just an informal test. Ultimately, we expect to move this application to a physical server and off the VM.  But until such time as we can get that server purchased and operational we want to be able to get sense of performance.

We have extracted word counts for all non-stop words and we have nodes that are also uniquely identified with an attribute like <noename xml:id=sp123>

What we want to understand is, how careful do I need to be about randomizing the cuncurrent threads use of keywords so that I minimize cache performance bias. In the wild, the users are likely to search for anything and not the same things over and over again.

I can take the list of word counts, and create multiple lists of them that are randomized and feed each one into a jmeter thread but that could be a lot of work. Just wondering if that is needed.

For completeness some of the query results are moderately large (30,000 bytes+).

Once the application is live we will send a URL out for everyone to look at it. For those of you that know US Civil War history, our University’s President, Dr. Ayers, will demo this application on the 147 anniversary of President Lincolns Gettysburg address and he will do so in Gettysburg.


thanks in advance for your help.

Eric Palmer
U of Richmond




--
Dan McCreary
Semantic Solutions Architect
office: (952) 931-9198
cell: (612) 986-1552