Thread: Re: [OpenSTA-users] Severe load ramp-up limitation encountered
Brought to you by:
dansut
|
From: Dan D. <ddo...@me...> - 2007-07-12 19:34:05
|
>Dan Downing wrote:=20 >> The Test:=A0 Two load=20 >> servers (dual 1Ghz P4s, 1 GB memory).=A0 On each, test had 20 Task =20 >> Groups with this one script, in v-user "burst"=20 >> groups of 50, 100, 150,...1000, doing a single iteration.=A0 Each grou= p to be launch 3 minutes apart=20 >> (allowing the web server to settle from the previous burst).=A0=A0 Ram= p rate of 300 users per minute (batch=20 >> settings:=A0 1/5/1 (interval btw=20 >> batches/vusers per batch/batch ramp time).=20 =20 >Danny Faught wrote:=20 >I'm curious if you tried to set up this scenario using a single task gro= up.=A0 I would guess >that if you could get what you want that way, it wo= uld put less stress on your load >generators.=20 >--=20 >Danny R. Faught=20 =20 Danny, Thanks for your suggestion.=A0 While I did not try modeling the *e= ntire set* of Task Groups into a single Task Group, I *did* (just this mo= rning) re-run *just one* Task Group with 400 users -- and managed to peg t= he cpu and drive up the pages/sec. just the same.=20 =20 So, whatever resources I am taxing, it occurs even with a single Task Gro= up.=20 =20 =20 =2E..Dan =20 Dan Downing =20 www.mentora.com =20 |
|
From: Dan D. <ddo...@me...> - 2007-07-12 20:47:10
|
=20 > >Dan Downing wrote: =20 >>> The Test: Two load =20 >>> servers (dual 1Ghz P4s, 1 GB memory). On each, test had 20 Task =20 >>> Groups with this one script, in v-user "burst" =20 >>> groups of 50, 100, 150,...1000, doing a single iteration. Each group t= o=A0=A0=20 >>> be launch 3 minutes apart =20 >>> (allowing the web server to settle from the previous burst). Ramp rat= e=A0=A0=20 >>> of 300 users per minute (batch =20 >>> settings: 1/5/1 (interval btw =20 >>> batches/vusers per batch/batch ramp time). =20 > =20 >>Danny Faught wrote: =20 >>I'm curious if you tried to set up this scenario using a single task=A0= =A0=20 >>group. I would guess >that if you could get what you want that way, it=A0= =A0=20 >>would put less stress on your load >generators. =20 >>--=A0=A0=20 >>Danny R. Faught =20 > =20 > Danny, Thanks for your suggestion. While I did not try modeling the=A0=A0= =20 > *entire set* of Task Groups into a single Task Group, I *did* (just thi= s=A0=A0=20 > morning) re-run *just one* Task Group with 400 users -- and managed to p= eg=A0=A0=20 > the cpu and drive up the pages/sec. just the same. =20 > =20 > So, whatever resources I am taxing, it occurs even with a single Task=A0= =A0=20 > Group. =20 =20 Bernie wrote: =20 =20 =20 =20 >I'm having a little trouble visualizing what is happening. Could=20 you reply with the VU >parameters (i.e. batch start options, total=20 number of users, number of vir users for timers=20 >and http result)=20 for the single task group test? =20 =20 =20 >>A picture is worth a few hundred words (depreciation).=A0 See attached.= =20 Hmmm...can't send an attachment!#@?=20 OK; verbal it is.=20 1 task group, 400 vusers; 5 vu per batch, 1 sec between batches, 1 second= batch ramp-up.=20 =20 =2E..Dan=20 =20 Dan Downing=20 =20 www.mentora.com=20 =20 |
|
From: Dan D. <ddo...@me...> - 2007-07-15 13:07:17
|
Dan Sutcliffe wrote: >So, my questions then would be: what are the similarities of the 2 failure? and, what are >the differences between these and the tests that worked as you expected? The similarities between the two examples: 1 - Run on same W2K load server 2 - In the first attempt, script had only the *small* (subsecind) WAITs btw PRIMARY GETs and GET URIs (commented out any longer than 1 second) 3 - Similar aggressive load ramp per Task Group -- 100vu/goup, 50 users/batch, 3 seconds/batch, 1 sec. batch ramp-up (later reduced to 1 vu/batch, 3 sec/batch, 1 sec. bath rampup) The differences: 1 - Many pages, each with many resources, using multiple conids, lots of LOAD RESPONSEs 2 - Reading a file with 26 comma-delimited script parameters, then calling a routine that parsed these out into local variables >> In the second 'cpu-pegging' example that I did not describe -- a much >> more complex script with minimal millisecond WAITs -- cpu-pegging >> problem was solved by inserting randomized 10-20 second WAITs between >> the 26 script steps. >But then you weren't creating as much load ... ? Correct; 1/20th the vusers, about 1 minute end-end response time for the script. >It is an interesting fact that the (larger) WAITs did help though. I wonder if the smaller >WAITs were seen by the executor as just not enough to cause it to pause (once it got >behind) and therefore the potential context switches were few and far between. >Did you try this technique on your other failure and it didn't help? Definitely, it is the 10-20 secind WAITs that solved the cpu-pegging problem (this after we worked on tuning the data-parsing code, which we thought might be the problem--it wasn't). >I just went back and had another look at your script - the interesting point is that you >only use a single connection id, was the script actually recorded this way? Yes, it was recorded this way; I did not notice there was a single conid for all the GETs til, you two mentioned it. >Because the requests all occur on a single connection then the chances are once load >ramps up all of your WAITs will be totally ignored and you have absolutely no need for any >SYNCHRONIZE between them. The final SYNCHRONIZE at the end of the scripts serves >absolutely no purpose as there are no connections open at that point to synchronize with. >If anything, I would have made your script end: >Although I don't think any of this is the source of your issue just from the fact that >Bernie has run this exact script without issues. >Your script is also using HTTP/1.0 but with Keep-Alive - can you compare this (with the >connection usage) to the way that the LR repacement was scripted? Just out of interest. Yeah. I will have to retest this on my other laptop from home with my Verizon FIOS 15 Megabit connection--and will send another report. >> Good suggestion about looping, though I resist looping in the script >> because the Summary Results monitor only refreshes when the script >> completes a Commander-controlled iteration -- and you lose run-time >> feedback. >Interesting you mention run-time feedback - what are you watching? Was watching the opensta Summary Results, plus perfmon on our load driver. >Bernie: were you monitoring your test at runtime? > > The "Failed processing TOF" error is usually accompanied by another, > > more meaningful error - I'd be very interested if there is one and > > if, what it is ... > > There was no other error reported. >Might just be memory shortage being hit, or could be some sort of corruption. My gut >feeling is that once 'something' in your test goes 'pear shaped' then it's all down hill >from there - what the original problem is holds the most interest for me. Roger this. ...Dan www.mentora.com ...Dan Dan Downing www.mentora.com |
|
From: Bernie V. <Ber...@iP...> - 2007-07-12 19:40:37
|
Dan, I'm having a little trouble visualizing what is happening. Could you reply with the VU parameters (i.e. batch start options, total number of users, number of vir users for timers and http result) for the single task group test? -Bernie ----- Original Message ----- From: "Dan Downing" <ddo...@me...> To: <ope...@li...> Sent: Thursday, July 12, 2007 3:34 PM Subject: Re: [OpenSTA-users] Severe load ramp-up limitation encountered > >Dan Downing wrote: >>> The Test: Two load >>> servers (dual 1Ghz P4s, 1 GB memory). On each, test had 20 Task >>> Groups with this one script, in v-user "burst" >>> groups of 50, 100, 150,...1000, doing a single iteration. Each group to >>> be launch 3 minutes apart >>> (allowing the web server to settle from the previous burst). Ramp rate >>> of 300 users per minute (batch >>> settings: 1/5/1 (interval btw >>> batches/vusers per batch/batch ramp time). > >>Danny Faught wrote: >>I'm curious if you tried to set up this scenario using a single task >>group. I would guess >that if you could get what you want that way, it >>would put less stress on your load >generators. >>-- >>Danny R. Faught > > Danny, Thanks for your suggestion. While I did not try modeling the > *entire set* of Task Groups into a single Task Group, I *did* (just this > morning) re-run *just one* Task Group with 400 users -- and managed to peg > the cpu and drive up the pages/sec. just the same. > > So, whatever resources I am taxing, it occurs even with a single Task > Group. > > > > ...Dan > > > Dan Downing > > > www.mentora.com > > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 express and take > control of your XML. No limits. Just data. Click to get it now. > http://sourceforge.net/powerbar/db2/ > -- > OpenSTA-users mailing list Ope...@li... > Subscribe/Unsubscribe/Options: > http://lists.sf.net/lists/listinfo/opensta-users > Posting Guidelines: > http://portal.opensta.org/faq.php?topic=UserMailingList |
|
From: Bernie V. <Ber...@iP...> - 2007-07-12 20:06:08
|
> Dan, > > I'm having a little trouble visualizing what is happening. Could you reply > with the VU parameters (i.e. batch start options, total number of users, > number of vir users for timers and http result) for the single task group > test? I hate to respond to my own posts, but "never mind". You included the info I asked for in a prior post. I can't replicate your problem, but I am currently sitting behind a 10Mb cable modem and am therefore running a different test. I was able to ramp up to 1000 users with no errors (expect some 10060 network timeouts which seem reasonable given my slow link) My runtime parameters were: Total number of users for this task Group: 1000 # virtual users for timer results 1000 # of virtual users for http results: 1 Batch start options: interval between batches: 1 Number of virt users per batch: 5 Batch ramp up time (seconds): 1 My script was set to loop for 1 hour and consisted of; Start Timer T_DANDOWNING PRIMARY GET URI "http://syn1.sellpoint.net/QA/smloadtest.html HTTP/1.0" ON 1 & HEADER DEFAULT_HEADERS & ,WITH {"Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, " & "application/x-shockwave-flash, application/vnd.ms-excel, application/vnd.ms-powerpoint, " & "application/msword, */*", & "Accept-Language: en-us", & "Connection: Keep-Alive"} SYNCHRONIZE REQUESTS End Timer T_DANDOWNING You dont have any large local variables defined do you? FYI- I forced my 3.06 Ghz P4 Pentium processor to run at 1.5 Ghz, I have 2GB of memory. CPU utilizaiotn never got much over 20% and I saw no abnormal pagingl. -Bernie |
|
From: Danny R. F. <fa...@te...> - 2007-07-12 20:46:28
|
Dan Downing wrote: > So, whatever resources I am taxing, it occurs even with a single Task Group. You didn't mention any think time in your script. If you don't have any delays, then I would anticipate that even a single VU would peg your CPU. -- Danny R. Faught Tejas Software Consulting http://tejasconsulting.com/ |
|
From: Dan D. <ddo...@me...> - 2007-07-13 13:41:47
|
>Dan Downing wrote: >> So, whatever resources I am taxing, it occurs even with a single Task Group. >Danny Faught wrote: >You didn't mention any think time in your script. If you don't have any delays, then I would >anticipate that even a single VU would peg your CPU. Well that is true--no WAITs in this one -- given it is a single page; an only one iteration, so no pacing WAITs between iterations either. That said, LR had no cpu-pegging with these bursts of load...and it returned reasonable times (0.5 seconds at the low end). ...Dan Dan Downing www.mentora.com |
|
From: Daniel S. <da...@Op...> - 2007-07-13 15:14:16
|
Danny Faught wrote: > > You didn't mention any think time in your script. If you don't > > have any delays, then I would >anticipate that even a single VU > > would peg your CPU. Dan Downing wrote: > Well that is true -- no WAITs in this one -- given it is a single > page; and only one iteration, so no pacing WAITs between iterations > either. Well we know this is not exactly true now from your declaration and full script posting in the reply to Bernie. That said Danny's advice realy needs to be re-iterated: removing all of your WAITs is almost never a good idea, unless all you want to do is stress all your servers to breaking point and are not interested in the timing just what breaks when this happens ... > That said, LR had no cpu-pegging with these bursts of load... and > it returned reasonable times (0.5 seconds at the low end). I don't think we're comparing like with like though - I've never really used LR but from what I understand from discussions I've had with people who do and have used it, its replay actually works quite differently than the OpenSTA Executor, and may actually be self throttling to some extent. ie. when the primary return starts to slow down then the secondaries will get relevant delays before they are sent. The SCL replay will just attempt to keep sending those secondary GETs at the interval that is given in the script. This is one of the reasons it is OK to put a SYNCHRONIZE after your primary GET - it 'sort of' simulates the total slow down as the primary response times get longer ... although it isn't ideal because you lose the actual simulation of secondary GETs starting before the primary has completely finished, which may well happen in a real browser. Not sure if this helps you solve your problem but ... Cheers /dan -- Daniel Sutcliffe <Da...@Op...> OpenSTA part-time caretaker - http://OpenSTA.org/ |
|
From: Danny R. F. <fa...@te...> - 2007-07-13 16:16:26
|
Dan Downing wrote: > Well that is true--no WAITs in this one -- given it is a single page; an only one iteration, so no pacing WAITs between iterations either. Okay, you're right - for a scenario where you're accessing only a single page on the site, there is no think time to simulate. But there is still the browser processing time between the requests to load the secondary elements of the page. When you take a recording, these are the sub-second delays inserted between secondary gets. I think delays of even a tenth of a second would make a huge impact on your CPU usage. I haven't thought through how the use of synchronization affects this, and whether it could be used in lieu of artificial waits. -- Danny R. Faught Tejas Software Consulting http://tejasconsulting.com/ |
|
From: Bernie V. <Ber...@iP...> - 2007-07-13 19:10:24
|
Daniel wrote: >> That said, LR had no cpu-pegging with these bursts of load... and > >it returned reasonable times (0.5 seconds at the low end). > > I don't think we're comparing like with like though I am sure we are not. Notice the connection ids are all 1 in the full script Dan posted. LR is surely sending the secondary gets in parallel and most likely prior to the primary get finishing. That would alert me to the possibility that response times reported by the two tools may very well be different. - I've never > really used LR but from what I understand from discussions I've had > with people who do and have used it, its replay actually works > quite differently than the OpenSTA Executor, and may actually be > self throttling to some extent. ie. when the primary return starts > to slow down then the secondaries will get relevant delays before > they are sent. >The SCL replay will just attempt to keep sending > those secondary GETs at the interval that is given in the script. > > This is one of the reasons it is OK to put a SYNCHRONIZE after your > primary GET - it 'sort of' simulates the total slow down as the > primary response times get longer ... although it isn't ideal because > you lose the actual simulation of secondary GETs starting before the > primary has completely finished, which may well happen in a real > browser. I suspect Dan edited the connection Ids to be all 1, or perhaps the scirpt is synthetic and was not the product of a recording. In any event, I'd consider putting the primary get on id 1, followed by a synchronize command, followed by the secondary gets each having a seperate channel Id, followed by a synchronize and then an end timer command. Of course I could be talking out of my hat here... not knowing all the details and this could all be putting too fine a point on things given that Dan has a much bigger problem on his hands in that the test won't run! Good luck Dan. -Bernie |
|
From: Dan D. <ddo...@me...> - 2007-07-13 22:13:49
|
Bernie wrote: >>Daniel wrote: >>I am sure we are not. Notice the connection ids are all 1 in the >full script Dan posted. LR is surely sending the secondary gets in >parallel and most likely prior to the primary get finishing. That >would alert me to the possibility that response times reported by >the two tools may very well be different. >> I've never really used LR but from what I understand from discussions I've had with people who do and have used it, its replay actually works quite differently than the OpenSTA Executor, and may actually be self throttling to some extent. ie. when the primary return starts to slow down then the secondaries will get relevant delays before they are sent. >>The SCL replay will just attempt to keep sending those secondary GETs at the interval that is given in the script. >> This is one of the reasons it is OK to put a SYNCHRONIZE after your primary GET - it 'sort of' simulates the total slow down as the primary response times get longer ... although it isn't ideal because you lose the actual simulation of secondary GETs starting before the primary has completely finished, which may well happen in a real browser. >I suspect Dan edited the connection Ids to be all 1, or perhaps the script is >synthetic and was not the product of a recording. In any event, I'd consider >putting the primary get on id 1, followed by a synchronize command, followed >by the secondary gets each having a seperate channel Id, followed by a >synchronize and then an end timer command. Of course I could be talking out >of my hat here... not knowing all the details and this could all be putting >too fine a point on things given that Dan has a much bigger problem on his >hands in that the test won't run! DanD responds: Bernie, this script was recorded, connection IDs were not edited; but I will be trying your suggestions anyway...given how perplexing this behavior was, and how suspect it made me that perhaps I've been reporting inacurate response times on *other* projects where cpus was *not* pegged. ...Dan Dan Downing www.mentora.com |
|
From: Daniel S. <da...@Op...> - 2007-07-16 14:14:20
|
Dan Downing wrote: > given how perplexing this behavior was, and how suspect it made me > that perhaps I've been reporting inacurate response times on > *other* projects where cpus was *not* pegged. Sorry but I don't follow your logic here. Because the timings you take when the timing system is overloaded are proven inaccurate, you then suspect that the timings may be inaccurate when the system is running normally ... ??? I think that any timings you see will be affected by 2 possible overload problems: - Processing delays in time from script saying 'END TIMER' to when the system actually gets around to timestamping the record for the results. - Processing delays from the actual tasks you are timing taking longer not due to the loaded system taking longer but because the virtual client can't handle its stages in the process as quick as it should be able to. Innevitably you get some of both added in to increase any recorded time but NOTHING in this would lead me to worry about the same system in a none overloaded state. I believe that ANY toolset that measures timing data cannot have that data trusted when it is running at or near any of its limits. The problem we need to find for you though is WHY the load generating system suddenly became overloaded when it should have easily been able to cope with the tasks at hand. The results after the overload point are worthless and should be ignored. Also, comparing timing results between different tools is notoriously difficult and shouldn't be taken as evidence of anything unless you are 100% sure that exactly the same tasks were being measured and the start/stop timers were being activated at exactly the same time by event. Given the known differences between LR and OpenSTA this is very difficult to achieve - best just compare results to previous runs of the same test using the same toolset. Cheers /dan -- Daniel Sutcliffe <Da...@Op...> OpenSTA part-time caretaker - http://OpenSTA.org |