## [Lse-tech] Re: [PATCH] 2.6.3-rc3-mm1: sched-group-power

 [Lse-tech] Re: [PATCH] 2.6.3-rc3-mm1: sched-group-power From: Nick Piggin - 2004-02-21 01:45:35 ```Rick Lindsley wrote: >So let me try a diagram. Each of these groups of numbers represent a >cpu_group, and the labels to the left are individual sched_domains. > >SD1 01234567 >SD2-SD3 0123 4567 >SD4-SD7 01 23 45 67 >SD8-SD15 0 1 2 3 4 5 6 7 > >Currently, we assume each cpu has a power of 1, so each cpu group in >domains SD8-SD15 would have a power of 1, each cpu group in SD4-SD7 >would have a power of 2, each of SD2 and SD3 would have a power of 4, >and collectively, all CPUs as represented in SD1 would have a power of 8. >Of course, we don't really make use of this assumption but this just >enumerates our assumption that all nodes, all cpus are created equal. > > Well we used to sum up the number of CPUs in each group, so it wasn't quite that bad. We assumed all CPUs are created equal. >Your new power code would assign each cpu group a static power other >than this, making SMT pairs, for instance, 1.2 instead of 2. In the >case of four siblings, 1.4 instead of 4. Correct? In the example above, >SD2 and SD3 would have a power rating of 2.4, and SD1 would have a power >rating of 4*1.2 or 4.8, right? > > Right. >With your current code, we only consult the power ratings if we've already >decided that we are currently "balanced enough". > Well we do work out the per group loads by dividing with the power rating instead of cpus-in-the-group too. > I'd go one step further >and say that manipulating for power only makes sense if you have an idle >processor somewhere. If all processors are busy, then short of some >quality-of-process assessment, how can you improve power? (You could >improve fairness, I suppose, but that would require lots more stats and >history than we have here.) If one set of procs is slower than another, >won't that make itself apparent by a longer queue developing there? (or >shorter queues forming somewhere else?) and it being load-balanced >by the existing algorithm? Seems to me we only need to make power >decisions when we want to consider an idle processor stealing a task (a >possibly *running* task) from another processor because this processor >is faster/stronger/better. > > Yeah, probably we could change that test to: if (*imbalance <= SCHED_LOAD_SCALE / 2 && this_load < SCHED_LOAD_SCALE) Either way, if the calculation should be done in such a way that if your CPUs are not idle, then it wouldn't predict a performance increase. No doubt there is room for improvement, but hopefully it is now at a "good enough" stage... ```

 [Lse-tech] Re: [PATCH] 2.6.3-rc3-mm1: sched-group-power From: Rick Lindsley - 2004-02-20 01:26:15 ```Nick, I'm not sure what capability this patch adds .. perhaps some words of explanation. So we have SMT/HT situations where we'd prefer to balance across cores; that is, if 0, 1, 2, and 3 share a core and 4, 5, 6, and 7 share a core, you'd like two processes to arrange themselves so one is on [0123] and another is on [4567]. This is what the SD_IDLE flag indicated before. With this patch, we can "weight" the load imposed by certain cpus, right? What advantage does this give us? On a given machine, won't the "weight" of any one set of SMT siblings and cores be uniform with respect to all the cores and siblings anyway? Rick ```
 [Lse-tech] Re: [PATCH] 2.6.3-rc3-mm1: sched-group-power From: Rick Lindsley - 2004-02-20 23:56:17 ```So let me try a diagram. Each of these groups of numbers represent a cpu_group, and the labels to the left are individual sched_domains. SD1 01234567 SD2-SD3 0123 4567 SD4-SD7 01 23 45 67 SD8-SD15 0 1 2 3 4 5 6 7 Currently, we assume each cpu has a power of 1, so each cpu group in domains SD8-SD15 would have a power of 1, each cpu group in SD4-SD7 would have a power of 2, each of SD2 and SD3 would have a power of 4, and collectively, all CPUs as represented in SD1 would have a power of 8. Of course, we don't really make use of this assumption but this just enumerates our assumption that all nodes, all cpus are created equal. Your new power code would assign each cpu group a static power other than this, making SMT pairs, for instance, 1.2 instead of 2. In the case of four siblings, 1.4 instead of 4. Correct? In the example above, SD2 and SD3 would have a power rating of 2.4, and SD1 would have a power rating of 4*1.2 or 4.8, right? With your current code, we only consult the power ratings if we've already decided that we are currently "balanced enough". I'd go one step further and say that manipulating for power only makes sense if you have an idle processor somewhere. If all processors are busy, then short of some quality-of-process assessment, how can you improve power? (You could improve fairness, I suppose, but that would require lots more stats and history than we have here.) If one set of procs is slower than another, won't that make itself apparent by a longer queue developing there? (or shorter queues forming somewhere else?) and it being load-balanced by the existing algorithm? Seems to me we only need to make power decisions when we want to consider an idle processor stealing a task (a possibly *running* task) from another processor because this processor is faster/stronger/better. Rick ```
 [Lse-tech] Re: [PATCH] 2.6.3-rc3-mm1: sched-group-power From: Nick Piggin - 2004-02-21 01:45:35 ```Rick Lindsley wrote: >So let me try a diagram. Each of these groups of numbers represent a >cpu_group, and the labels to the left are individual sched_domains. > >SD1 01234567 >SD2-SD3 0123 4567 >SD4-SD7 01 23 45 67 >SD8-SD15 0 1 2 3 4 5 6 7 > >Currently, we assume each cpu has a power of 1, so each cpu group in >domains SD8-SD15 would have a power of 1, each cpu group in SD4-SD7 >would have a power of 2, each of SD2 and SD3 would have a power of 4, >and collectively, all CPUs as represented in SD1 would have a power of 8. >Of course, we don't really make use of this assumption but this just >enumerates our assumption that all nodes, all cpus are created equal. > > Well we used to sum up the number of CPUs in each group, so it wasn't quite that bad. We assumed all CPUs are created equal. >Your new power code would assign each cpu group a static power other >than this, making SMT pairs, for instance, 1.2 instead of 2. In the >case of four siblings, 1.4 instead of 4. Correct? In the example above, >SD2 and SD3 would have a power rating of 2.4, and SD1 would have a power >rating of 4*1.2 or 4.8, right? > > Right. >With your current code, we only consult the power ratings if we've already >decided that we are currently "balanced enough". > Well we do work out the per group loads by dividing with the power rating instead of cpus-in-the-group too. > I'd go one step further >and say that manipulating for power only makes sense if you have an idle >processor somewhere. If all processors are busy, then short of some >quality-of-process assessment, how can you improve power? (You could >improve fairness, I suppose, but that would require lots more stats and >history than we have here.) If one set of procs is slower than another, >won't that make itself apparent by a longer queue developing there? (or >shorter queues forming somewhere else?) and it being load-balanced >by the existing algorithm? Seems to me we only need to make power >decisions when we want to consider an idle processor stealing a task (a >possibly *running* task) from another processor because this processor >is faster/stronger/better. > > Yeah, probably we could change that test to: if (*imbalance <= SCHED_LOAD_SCALE / 2 && this_load < SCHED_LOAD_SCALE) Either way, if the calculation should be done in such a way that if your CPUs are not idle, then it wouldn't predict a performance increase. No doubt there is room for improvement, but hopefully it is now at a "good enough" stage... ```
 [Lse-tech] Re: [PATCH] 2.6.3-rc3-mm1: sched-group-power From: Nick Piggin - 2004-02-20 01:44:01 ```Rick Lindsley wrote: >Nick, I'm not sure what capability this patch adds .. perhaps some words >of explanation. > >So we have SMT/HT situations where we'd prefer to balance across cores; >that is, if 0, 1, 2, and 3 share a core and 4, 5, 6, and 7 share a core, >you'd like two processes to arrange themselves so one is on [0123] and >another is on [4567]. This is what the SD_IDLE flag indicated before. > >With this patch, we can "weight" the load imposed by certain cpus, right? >What advantage does this give us? On a given machine, won't the "weight" >of any one set of SMT siblings and cores be uniform with respect to all >the cores and siblings anyway? > > It is difficult to propogate the SD_FLAG_IDLE attribute up multiple domains. For example, with SMT + CPU + NODE domains you can get into the following situation: 01, 23 are 4 siblings in 2 cores on node 0, 45, 67 are " " " on node 1. The top level balancing domain now spans 01234567, and wants to balance between groups 0123, and 4567. We don't want SD_FLAG_IDLE semantics here, because that would mean if two tasks were running on node 0, one would be migrated to node 1. We want to migrate 1 task if one node is idle, and the other has 3 processes running for example. Also this copes with siblings becoming much more powerful, or some groups with SMT turned off, some on (think hotplug cpu), different speed CPUs, etc. ```