This project demonstrates the Random Pause method of performance tuning. In it we take a program distilled from the author's experience, and over a series of iterations, increase it's performance by about 2-3 orders of magnitude.
It is called CIM for Computer Integrated Manufacturing. Imagine a factory containing a number of workstations, for painting, machining, etc., and a material handling system that moves pieces of work (JOBS) from one workstation to the next. Each JOB consists of a series of batch operations (BOPS), and each BOP consists of a sequence of TASKS to be done at the workstations, interleaved with material handling (MH) steps be move each piece from one workstation to the next. These are done in parallel, to keep the equipment busy. Entry of new jobs is throttled, so it doesn't get clogged up with work.
This program simulates this system. The TASKS and MH movements don't actually happen, they are just enumerated, to simulate the whole system. Each time a JOB finishes, a message "Ack Job xxxx" is printed. Then at the end in prints out the average number of microseconds needed to simulate one job.
Remember, the purpose of this is to demonstrate how to speed up software. Most programs will have less room than this for speedup. Some will have more. Regardless, with this technique any program can be brought near to optimal.
In CIM2, the initial simulation, on my machine, the time is 2700 us/job, or just shy of 3 milliseconds. This is an average taken over 10,000 JOBs, so the actual wall clock time is 27 seconds. During that time, I randomly hit the pause button a few times. Each time, I capture the call stack. The stacks I captured are in the ReadMe.txt file in the project. There were 5 stack samples. Looking at the samples, it was evident that 3 out of the 5 samples were in the use of std::vector<T>. Two of them were from one function, and one was from another. They were all from different lines.
What can you conclude from this? Simply that roughly 60% of the time was going into vector<T> operations.
So the way I personally thought I would try to improve that was to switch to a different vector class, so CIM3 was the result of using CTypedPtrArray instead of vector. CIM3 achieves 1800 usec/job, roughly a 50% speedup, or a factor of 1.5.
Then I did the same thing to CIM3, taking 10 stack samples. 4 of them (roughly 40%) are spent calling the indexing operator on CTypedPtrArray, even though they were all done in different places. That suggests doing array indexing directly, not going through the indexing operator function. (It's fair to say that if I were doing a Release build, those would be inlined. The problem with that is it would be hard to do further sampling with a Release build, so I stayed with Debug. Clearly that's just a matter of strategy.)
So I did that, and that was CIM4. Now the performance was 1500 usec/job. That eliminated 300 out of 1800 usec, or about a 1/6 reduction in time. That's not as large as the roughly 40% that might have been eliminated, but I'll take it. That's a 1800/1500 = 1.2 or a 20% speedup.
Again I took 10 samples. 5 of them were in new/delete. 3 of them were in adding/removing items in the arrays. I thought I could save the new/delete time by keeping pools of used objects. I thought I could save the add/remove time by doing do-it-yourself linked lists rather than arrays. The result was CIM5. I actually did it in two stages. First I made the do-it-yourself linked lists, and the time dropped to 1300 usec/job. That's 1500/1300 = 1.15 or a 15% speedup. In the second stage I implemented the pooled objects. Then the time dropped to 440 usec/job.
That's 1300/440 = 2.95 = 195% speedup! (Overall now, the speedup is 2700/440 = 6.14)
Again, I took 5 samples. Three of them were in a macro called NTH that I had written for extracting the nth element of a linked list. That suggested rather than keeping an index into the linked list as if it were an array, keep a pointer into the list. It's only being incremented in a very simple way along the list. No more NTH.
The result was CIM6, with 140 usec/job. That's 440/140 = 3.14. (Overall speedup: 2700/140 = 19.3)
Now I took 4 samples, and every single one was somewhere in the "cout" call. So I thought I would see what would happen if I simply suppressed the output. When I did that, the time dropped to 3.7 usec/job. Now the overall speedup is 2700/3.7 = 730 !!
I could have kept going, but I stopped there.
What's the lesson to learn? This is a totally manual method, and it does not require fancy measurement. It does require reading the samples and understanding them. The profilers I've tried have been little or no help in suggesting to me what might profitably be changed. They seem to have the idea that performance opportunities are localized to specific routines, and if you agree with that, you would not have made more than a fraction of the changes that were made here. So if you are fond of a profiler, make sure it finds you enough compounded speedups to earn its keep.