Re: [kvm-ppc-devel] [PATCH] [1/3] kvmppc: add detailed instruction emulation statistics

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hollis Blanchard wrote:
> On Monday 14 April 2008 07:23:56 ehr...@li... wrote:
>> From: Christian Ehrhardt <ehr...@li...>
>>
>> This extends the kvm_stat reports for kvm on embedded powerpc. Since kvmppc
>> is using emulation (no hardware support yet) this gives people interested
>> in performance a detailed split of the emul_instruction counter already
>> available. This statistic does not cover e.g. operants of the commands, but
>> that way it should have only a small perf impact (never break what you want
>> to measure). This feature is configurable in .config under the kvmppc
>> virtualization itself.
> 
> This array-based approach seems to add a lot of lines of code, and it's also 
> copy/paste stuff that is just begging for a typo (e.g. miscounting STHX in 
> the STHUX bucket) or being forgotten entirely when adding new emulation code.
> 
> A more general approach would be to record just the opcode/extended opcode in 
> a variable-sized structure, allocating a new bucket as we encounter a 
> previously unseen instruction. I'm thinking of something like this:

I already thought about some bucket like counting, but I focused small runtime overhead above everything else here by hijacking the already existent switch/case and avoiding any new if's & function calls.
I agree that the array like solution is messy in the aspect of lines and is more error prone by forgetting something.

> log_instruction(inst) {
> 	bucket = hash_u32(inst, 5);
> 	list_for_each(l, vcpu->instlog[bucket]) {
> 		struct instlog = list_entry(l);
> 		if (instlog->inst == inst) {
> 			instlog->count++;
> 			break;
> 		}
> 	}
> }
> 
> emulate(inst) {
> 	log = get_op(inst);
> 	switch (get_op(inst)) {
> 		...
> 	case 31:
> 		log |= get_xop(inst);
> 		switch (inst) {
> 			...
> 		}
> 	}
> 	log_instruction(log);
> }

I agree that this looks better, but is has per instruction:
1x hash function
a loop wich runs ~1-2 times
an if (do we actually have long pipelines that suffer from wrong branch prediction?)
When we add more functions here like mapping sprn's and whatever else comes in mind later this might get even more complex, while this statistics should be the low overhead stats.
The thing I want to point out is that the tracing with the relay channel also only have (speaking of the reduced version discussed in the other thread with only 3 integers) 3 integers to move into a buffer and some overhead to select the right buffer. The userspace application reads 4096b => 341 records per scheduling of the reading app and we might even increase that size. I really like the "full trace" feeling of the relay based patch, so I don't like to increase the functionality&overhead of this patch which was intended to be low-overhead.

As stated above I agree with you about lines of code and error proneness of that patch. What about that:
- for integration into our upstream code only the relay based approach is take into account, giving a full trace while it has some overhead
- we keep the slight overhead, but error prone kvm_stat based patch in our private queues and only use it for our own measurements until we either need or change it once we have more experience with the relay based tracing (e.g. know the perf. overhead of that better)
- That way we would have two very similar solutions that share code and are configurable so developers can use them as they want
- additionally I think again about Arnd's comments and if we might need/want a runtime switch e.g. proc/sys interface to enable/disable that tracing functions on the fly

Comments welcome

> It looks like we could build a hash table pretty easily with hash_long() and 
> list_entry stuff (see fs/mbcache.c for example). So far you've found 17 
> different instructions types emulated in the "boot" workload, so 32 entries 
> would probably be a reasonable hash size. The same approach could be used for 
> SPR accesses, where you've hit 31 different registers. Really I think those 
> numbers won't vary much by workload, but rather by guest kernel...
> 

-- 

Grüsse / regards, 
Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization