You can subscribe to this list here.
| 2009 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(32) |
Jun
(66) |
Jul
(102) |
Aug
(78) |
Sep
(106) |
Oct
(137) |
Nov
(147) |
Dec
(147) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2010 |
Jan
(71) |
Feb
(139) |
Mar
(86) |
Apr
(76) |
May
(57) |
Jun
(10) |
Jul
(12) |
Aug
(6) |
Sep
(8) |
Oct
(12) |
Nov
(12) |
Dec
(18) |
| 2011 |
Jan
(16) |
Feb
(19) |
Mar
(3) |
Apr
(1) |
May
(16) |
Jun
(17) |
Jul
(74) |
Aug
(22) |
Sep
(18) |
Oct
(24) |
Nov
(21) |
Dec
(30) |
| 2012 |
Jan
(31) |
Feb
(16) |
Mar
(22) |
Apr
(25) |
May
(18) |
Jun
(13) |
Jul
(83) |
Aug
(49) |
Sep
(20) |
Oct
(60) |
Nov
(35) |
Dec
(28) |
| 2013 |
Jan
(39) |
Feb
(61) |
Mar
(35) |
Apr
(21) |
May
(45) |
Jun
(56) |
Jul
(20) |
Aug
(9) |
Sep
(10) |
Oct
(31) |
Nov
(8) |
Dec
(4) |
| 2014 |
Jan
(6) |
Feb
(7) |
Mar
(7) |
Apr
(6) |
May
(4) |
Jun
(8) |
Jul
(5) |
Aug
(2) |
Sep
(4) |
Oct
(4) |
Nov
(11) |
Dec
(5) |
| 2015 |
Jan
(4) |
Feb
(4) |
Mar
(3) |
Apr
(4) |
May
(9) |
Jun
(4) |
Jul
(15) |
Aug
(8) |
Sep
(16) |
Oct
(18) |
Nov
(15) |
Dec
(7) |
| 2016 |
Jan
(20) |
Feb
(9) |
Mar
(15) |
Apr
(24) |
May
(16) |
Jun
(28) |
Jul
(22) |
Aug
(23) |
Sep
(18) |
Oct
(30) |
Nov
(40) |
Dec
(9) |
| 2017 |
Jan
(1) |
Feb
(8) |
Mar
(37) |
Apr
(26) |
May
(25) |
Jun
(46) |
Jul
(24) |
Aug
(9) |
Sep
|
Oct
|
Nov
|
Dec
|
|
From: Rik v. R. <ri...@re...> - 2011-01-21 00:18:03
|
On 01/07/2011 05:03 PM, Satoru Moriya wrote: > The result is following. > > | default | case 1 | case 2 | > ---------------------------------------------------------- > wmark_min_kbytes | 5752 | 5752 | 5752 | > wmark_low_kbytes | 7190 | 16384 | 32768 | (KB) > wmark_high_kbytes | 8628 | 20480 | 40960 | > ---------------------------------------------------------- > real | 503 | 364 | 337 | > user | 3 | 5 | 4 | (msec) > sys | 153 | 149 | 146 | > ---------------------------------------------------------- > page fault | 32768 | 32768 | 32768 | > kswapd_wakeup | 1809 | 335 | 228 | (times) > direct reclaim | 5 | 0 | 0 | > > As you can see, direct reclaim was performed 5 times and > its exec time was 503 msec in the default case. On the other > hand, in case 1 (large delta case ) no direct reclaim was > performed and its exec time was 364 msec. Saving 1.5 seconds on a one-off workload is probably not worth the complexity of giving a system administrator yet another set of tunables to mess with. However, I suspect it may be a good idea if the kernel could adjust these watermarks automatically, since direct reclaim could lead to quite a big performance penalty. I do not know which events should be used to increase and decrease the watermarks, but I have some ideas: - direct reclaim (increase) - kswapd has trouble freeing pages (increase) - kswapd frees enough memory at DEF_PRIORITY (decrease) - next to no direct reclaim events in the last N (1000?) reclaim events (decrease) I guess we will also need to be sure that the watermarks are never raised above some sane upper threshold. Maybe 4x or 5x the default? -- All rights reversed |
|
From: David R. <rie...@go...> - 2011-01-13 22:25:02
|
On Thu, 13 Jan 2011, Satoru Moriya wrote: > Currently watermark[low,high] are set by following calculation (lowmem case). > > watermark[low] = watermark[min] * 1.25 > watermark[high] = watermark[min] * 1.5 > > So the difference between watermarks are following: > > min <-- min/4 --> low <-- min/4 --> high > > I think the differences, "min/4", are too small in my case. > Of course I can make them bigger if I set min_free_kbytes to bigger value. > But it means kernel keeps more free memory for PF_MEMALLOC case unnecessarily. > > So I suggest changing coefficients(1.25, 1.5). Also it's better > to make them accessible from user space to tune in response to application > requirements. > Userspace can't possibly be held responsible for tuning internal VM parameters in response to certain workloads like this; if you have evidence that different coefficients work better in different circumstances, then present the criteria for which you intend to change them from the command line via your new tunables and let's work to make the VM more extendable to serve those workloads well. This should be done by showing how background reclaim is ineffective, we enter direct compaction or reclaim too aggressively, we don't wait for writeout long enough, we prematurely kill applications when unnecessary, etc, which would undoubtedly have if you're going to make any sane adjustments via these new tunables. |
|
From: David R. <rie...@go...> - 2011-01-13 22:20:43
|
On Thu, 13 Jan 2011, Satoru Moriya wrote: > > You didn't mention why it wouldn't be possible to modify > > setup_per_zone_wmarks() in some way for your configuration so this happens > > automatically. If you can find a deterministic way to set these > > watermarks from userspace, you should be able to do it in the kernel as > > well based on the configuration. > > Do you mean that we should introduce a mechanism into kernel that changes > watermarks dynamically depending on its loads (such as cpu frequency control) > or we should change the calculation method in setup_per_zone_wmarks()? > The watermarks you're exposing through this patchset to userspace for the first time are meant to be internal to the VM. Userspace is not intended to manipulate them in an effort to cover-up deficiencies within the memory manager itself. If you have actual cases where tuning the watermarks from userspace is helpful, then it logically means: - the VM is acting incorrectly in response to situations where it approaches the tunable min watermark (all watermarks are a function of the min watermark) which shouldn't representative in just a handfull of cases, and - you can deterministically do the same calculation within the kernel itself. I'm skeptical that any tuning is actually helpful to your workload that doesn't also indicate a problem internal to the VM itself. I think what would be more helpful is if you would show how the watermarks currently don't trigger fast enough (or aggressive enough) and then address the issue in the kernel itself so everyone can benefit from your work, whether that's adjusting where the watermarks are based on external factors or whether the semantics of those watermarks are to slightly change. |
|
From: Satoru M. <sat...@hd...> - 2011-01-13 22:09:08
|
Hi David, Thank you for your comments. On 01/07/2011 05:23 PM, David Rientjes wrote: > On Fri, 7 Jan 2011, Satoru Moriya wrote: >> >> [Problem] >> The thresholds kswapd/direct reclaim starts(ends) depend on >> watermark[min,low,high] and currently all watermarks are set >> based on min_free_kbytes. min_free_kbytes is the amount of >> free memory that Linux VM should keep at least. >> > > Not completely, it also depends on the amount of lowmem (because of the > reserve setup next) and the amount of memory in each zone. Right. Thanks. >> [Solution] >> To avoid the situation above, this patch set introduces new >> tunables /proc/sys/vm/wmark_min_kbytes, wmark_low_kbytes and >> wmark_high_kbytes. Each entry controls watermark[min], >> watermark[low] and watermark[high] separately. >> By using these parameters one can make the difference between >> min and low bigger than the amount of memory which applications >> require. >> > > I really dislike this because it adds additional tunables that should > already be handled correctly by the VM and it's very difficult for users > to know what to tune these values to; these watermarks (with the exception > of min) are supposed to be internal to the VM implementation. The patchset targeted enterprise system and in that area users expect that they can tune the system by themselves to fulfill their requirements. > You didn't mention why it wouldn't be possible to modify > setup_per_zone_wmarks() in some way for your configuration so this happens > automatically. If you can find a deterministic way to set these > watermarks from userspace, you should be able to do it in the kernel as > well based on the configuration. Do you mean that we should introduce a mechanism into kernel that changes watermarks dynamically depending on its loads (such as cpu frequency control) or we should change the calculation method in setup_per_zone_wmarks()? I think it is difficult to control watermarks automatically in kernel because required memory varies widely among applications. On the other hand, sysctl parameters help us fit the kernel to each system's requirement flexibly. > I think we should invest time in making sure the VM works for any type of > workload thrown at it instead of relying on userspace making lots of > adjustments. |
|
From: Satoru M. <sat...@hd...> - 2011-01-13 22:08:59
|
On 01/07/2011 05:39 PM, David Rientjes wrote: > The semantics of any watermark is to trigger events to happen at a > specific level, so they should be static with respect to a frame of > reference (which in the VM case is the min watermark with respect to the > size of the zone). If you're going to adjust the min watermark, it's then > _mandatory_ to adjust the others to that frame of reference, you shouldn't > need to tune them independently. Currently watermark[low,high] are set by following calculation (lowmem case). watermark[low] = watermark[min] * 1.25 watermark[high] = watermark[min] * 1.5 So the difference between watermarks are following: min <-- min/4 --> low <-- min/4 --> high I think the differences, "min/4", are too small in my case. Of course I can make them bigger if I set min_free_kbytes to bigger value. But it means kernel keeps more free memory for PF_MEMALLOC case unnecessarily. So I suggest changing coefficients(1.25, 1.5). Also it's better to make them accessible from user space to tune in response to application requirements. > The problem that Satoru is reporting probably has nothing to do with the > watermarks themselves but probably requires more aggressive action by > kswapd and/or memory compaction. More aggressive action may reduce the possibility of the problem reported. But we can't avoid the problem completely because applications may allocate/access faster than reclaiming/compaction. |
|
From: David R. <rie...@go...> - 2011-01-07 22:39:36
|
On Fri, 7 Jan 2011, Ying Han wrote: > On the other hand, having the low/high wmark consider more characters > other than the > size of the zone sounds useful. The semantics of any watermark is to trigger events to happen at a specific level, so they should be static with respect to a frame of reference (which in the VM case is the min watermark with respect to the size of the zone). If you're going to adjust the min watermark, it's then _mandatory_ to adjust the others to that frame of reference, you shouldn't need to tune them independently. The problem that Satoru is reporting probably has nothing to do with the watermarks themselves but probably requires more aggressive action by kswapd and/or memory compaction. |
|
From: Ying H. <yi...@go...> - 2011-01-07 22:35:42
|
On Fri, Jan 7, 2011 at 2:23 PM, David Rientjes <rie...@go...> wrote: > On Fri, 7 Jan 2011, Satoru Moriya wrote: > >> This patchset introduces a new knob to control each watermark >> separately. >> >> [Purpose] >> To control the timing at which kswapd/direct reclaim starts(ends) >> based on memory pressure and/or application characteristics >> because direct reclaim makes a memory alloc/access latency worse. >> (We'd like to avoid direct reclaim to keep latency low even if >> under the high memory pressure.) >> >> [Problem] >> The thresholds kswapd/direct reclaim starts(ends) depend on >> watermark[min,low,high] and currently all watermarks are set >> based on min_free_kbytes. min_free_kbytes is the amount of >> free memory that Linux VM should keep at least. >> > > Not completely, it also depends on the amount of lowmem (because of the > reserve setup next) and the amount of memory in each zone. > >> This means the difference between thresholds at which kswapd >> starts and direct reclaim starts depends on the amount of free >> memory. >> >> On the other hand, the amount of required memory depends on >> applications. Therefore when it allocates/access memory more >> than the difference between watemark[low] and watermark[min], >> kernel sometimes runs direct reclaim before allocation and >> it makes application latency bigger. >> >> [Solution] >> To avoid the situation above, this patch set introduces new >> tunables /proc/sys/vm/wmark_min_kbytes, wmark_low_kbytes and >> wmark_high_kbytes. Each entry controls watermark[min], >> watermark[low] and watermark[high] separately. >> By using these parameters one can make the difference between >> min and low bigger than the amount of memory which applications >> require. >> > > I really dislike this because it adds additional tunables that should > already be handled correctly by the VM and it's very difficult for users > to know what to tune these values to; these watermarks (with the exception > of min) are supposed to be internal to the VM implementation. > > You didn't mention why it wouldn't be possible to modify > setup_per_zone_wmarks() in some way for your configuration so this happens > automatically. If you can find a deterministic way to set these > watermarks from userspace, you should be able to do it in the kernel as > well based on the configuration. > > I think we should invest time in making sure the VM works for any type of > workload thrown at it instead of relying on userspace making lots of > adjustments. I agree in general that adding the APIs to each wmarks sounds like a over-kill, and hard for user to configure most of the time. On the other hand, having the low/high wmark consider more characters other than the size of the zone sounds useful. But I am not sure how to approach that entirely in the kernel if we like the reclaim behavior to be reflected from the different workload. --Ying > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to maj...@kv.... For more info on Linux MM, > see: http://www.linux-mm.org/ . > Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ > Don't email: <a href=mailto:"do...@kv..."> em...@kv... </a> > |
|
From: David R. <rie...@go...> - 2011-01-07 22:28:02
|
On Fri, 7 Jan 2011, Satoru Moriya wrote: > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt > index 30289fa..e10b279 100644 > --- a/Documentation/sysctl/vm.txt > +++ b/Documentation/sysctl/vm.txt > @@ -349,7 +349,8 @@ min_free_kbytes: > > This is used to force the Linux VM to keep a minimum number > of kilobytes free. The VM uses this number to compute a > -watermark[WMARK_MIN] value for each lowmem zone in the system. > +watermark[WMARK_MIN] for each lowmem zone and > +watermark[WMARK_LOW/WMARK_HIGH] for each zone in the system. > Each lowmem zone gets a number of reserved free pages based > proportionally on its size. > WMARK_MIN is changed for all zones. |
|
From: David R. <rie...@go...> - 2011-01-07 22:24:12
|
On Fri, 7 Jan 2011, Satoru Moriya wrote: > This patchset introduces a new knob to control each watermark > separately. > > [Purpose] > To control the timing at which kswapd/direct reclaim starts(ends) > based on memory pressure and/or application characteristics > because direct reclaim makes a memory alloc/access latency worse. > (We'd like to avoid direct reclaim to keep latency low even if > under the high memory pressure.) > > [Problem] > The thresholds kswapd/direct reclaim starts(ends) depend on > watermark[min,low,high] and currently all watermarks are set > based on min_free_kbytes. min_free_kbytes is the amount of > free memory that Linux VM should keep at least. > Not completely, it also depends on the amount of lowmem (because of the reserve setup next) and the amount of memory in each zone. > This means the difference between thresholds at which kswapd > starts and direct reclaim starts depends on the amount of free > memory. > > On the other hand, the amount of required memory depends on > applications. Therefore when it allocates/access memory more > than the difference between watemark[low] and watermark[min], > kernel sometimes runs direct reclaim before allocation and > it makes application latency bigger. > > [Solution] > To avoid the situation above, this patch set introduces new > tunables /proc/sys/vm/wmark_min_kbytes, wmark_low_kbytes and > wmark_high_kbytes. Each entry controls watermark[min], > watermark[low] and watermark[high] separately. > By using these parameters one can make the difference between > min and low bigger than the amount of memory which applications > require. > I really dislike this because it adds additional tunables that should already be handled correctly by the VM and it's very difficult for users to know what to tune these values to; these watermarks (with the exception of min) are supposed to be internal to the VM implementation. You didn't mention why it wouldn't be possible to modify setup_per_zone_wmarks() in some way for your configuration so this happens automatically. If you can find a deterministic way to set these watermarks from userspace, you should be able to do it in the kernel as well based on the configuration. I think we should invest time in making sure the VM works for any type of workload thrown at it instead of relying on userspace making lots of adjustments. |
|
From: Satoru M. <sat...@hd...> - 2011-01-07 22:09:37
|
This patch introduces three new sysctls to /proc/sys/vm:
wmark_min_kbytes, wmark_low_kbytes and wmark_high_kbytes.
Each entry is used to compute watermark[min], watermark[low]
and watermark[high] for each zone.
These parameters are also updated when min_free_kbytes are
changed because originally they are set based on min_free_kbytes.
On the other hand, min_free_kbytes is updated when wmark_free_kbytes
changes.
By using the parameters one can adjust the difference among
watermark[min], watermark[low] and watermark[high] and as a result
one can tune the kernel reclaim behaviour to fit their requirement.
Signed-off-by: Satoru Moriya <sat...@hd...>
---
Documentation/sysctl/vm.txt | 37 +++++++++++++++
include/linux/mmzone.h | 6 ++
kernel/sysctl.c | 28 +++++++++++-
mm/page_alloc.c | 109 +++++++++++++++++++++++++++++++++++++++++++
4 files changed, 179 insertions(+), 1 deletions(-)
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index e10b279..674681d 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -55,6 +55,9 @@ Currently, these files are in /proc/sys/vm:
- stat_interval
- swappiness
- vfs_cache_pressure
+- wmark_high_kbytes
+- wmark_low_kbytes
+- wmark_min_kbytes
- zone_reclaim_mode
==============================================================
@@ -360,6 +363,8 @@ become subtly broken, and prone to deadlock under high loads.
Setting this too high will OOM your machine instantly.
+This is also updated when wmark_min_free_kbytes changes.
+
=============================================================
min_slab_ratio:
@@ -664,6 +669,38 @@ causes the kernel to prefer to reclaim dentries and inodes.
==============================================================
+wmark_high_kbytes
+
+Contains the amount of free memory above which kswapd stops reclaiming pages.
+
+The Linux VM uses this number to compute a watermark[WMARK_HIGH] value for
+each zone in the system. This is also updated when min_free_kbytes is updated.
+The minimum is wmark_low_kbytes.
+
+==============================================================
+
+wmark_low_kbytes
+
+Contains the amount of free memory below which kswapd starts to reclaim pages.
+
+The Linux VM uses this number to compute a watermark[WMARK_LOW] value for
+each zone in the system. This is also updated when min_free_kbytes changes.
+The minimum is wmark_min_kbytes and maximum is wmark_high_kbytes.
+
+==============================================================
+
+wmark_min_kbytes
+
+Contains the amount of minimum free memory which Linux VM keep. If the amount
+of free memory is less than it, the VM reclaims memory first and then
+allocates (except PF_MEMALLOC allocations).
+
+The Linux VM uses this number to compute a watermark[WMARK_MIN] value for
+each lowmem zone in the system. This is also updated when min_free_kbytes is
+updated. The minimum is 0 and maximum is wmark_low_kbytes.
+
+==============================================================
+
zone_reclaim_mode:
Zone_reclaim_mode allows someone to set more or less aggressive approaches to
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 39c24eb..d2f4b40 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -771,6 +771,12 @@ int min_free_kbytes_sysctl_handler(struct ctl_table *, int,
extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1];
int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
+int wmark_min_kbytes_sysctl_handler(struct ctl_table *, int,
+ void __user *, size_t *, loff_t *);
+int wmark_low_kbytes_sysctl_handler(struct ctl_table *, int,
+ void __user *, size_t *, loff_t *);
+int wmark_high_kbytes_sysctl_handler(struct ctl_table *, int,
+ void __user *, size_t *, loff_t *);
int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index ae5cbb1..060244d 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -94,6 +94,7 @@ extern char core_pattern[];
extern unsigned int core_pipe_limit;
extern int pid_max;
extern int min_free_kbytes;
+extern int wmark_min_kbytes, wmark_low_kbytes, wmark_high_kbytes;
extern int pid_max_min, pid_max_max;
extern int sysctl_drop_caches;
extern int percpu_pagelist_fraction;
@@ -1326,7 +1327,32 @@ static struct ctl_table vm_table[] = {
.extra2 = &one,
},
#endif
-
+ {
+ .procname = "wmark_min_kbytes",
+ .data = &wmark_min_kbytes,
+ .maxlen = sizeof(wmark_min_kbytes),
+ .mode = 0644,
+ .proc_handler = wmark_min_kbytes_sysctl_handler,
+ .extra1 = &zero,
+ .extra2 = &wmark_low_kbytes,
+ },
+ {
+ .procname = "wmark_low_kbytes",
+ .data = &wmark_low_kbytes,
+ .maxlen = sizeof(wmark_low_kbytes),
+ .mode = 0644,
+ .proc_handler = wmark_low_kbytes_sysctl_handler,
+ .extra1 = &wmark_min_kbytes,
+ .extra2 = &wmark_high_kbytes,
+ },
+ {
+ .procname = "wmark_high_kbytes",
+ .data = &wmark_high_kbytes,
+ .maxlen = sizeof(wmark_high_kbytes),
+ .mode = 0644,
+ .proc_handler = wmark_high_kbytes_sysctl_handler,
+ .extra1 = &wmark_low_kbytes,
+ },
/*
* NOTE: do not add new entries to this table unless you have read
* Documentation/sysctl/ctl_unnumbered.txt
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ff7e158..7cd9cbf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -172,6 +172,9 @@ static char * const zone_names[MAX_NR_ZONES] = {
};
int min_free_kbytes = 1024;
+int wmark_min_kbytes = 1024;
+int wmark_low_kbytes = 1024;
+int wmark_high_kbytes = 1024;
static unsigned long __meminitdata nr_kernel_pages;
static unsigned long __meminitdata nr_all_pages;
@@ -4926,10 +4929,77 @@ void setup_per_zone_wmarks(void)
spin_unlock_irqrestore(&zone->lock, flags);
}
+ wmark_min_kbytes = min_free_kbytes;
+ wmark_low_kbytes = min_free_kbytes + (min_free_kbytes >> 2);
+ wmark_high_kbytes = min_free_kbytes + (min_free_kbytes >> 1);
+
/* update totalreserve_pages */
calculate_totalreserve_pages();
}
+/**
+ * setup_per_zone_wmark - called when wmark_{min|low|high}_kbytes changes
+ *
+ * The watermark[min,low,high] values for each zone are set with respect
+ * to wmark_min_kbytes, wmark_low_kbytes and wmark_high_kbytes.
+ */
+void setup_per_zone_wmark(int wmark)
+{
+ unsigned long pages;
+ unsigned long lowmem_pages = 0;
+ struct zone *zone;
+ unsigned long flags;
+
+ switch (wmark) {
+ case WMARK_MIN:
+ pages = wmark_min_kbytes >> (PAGE_SHIFT - 10);
+ min_free_kbytes = wmark_min_kbytes;
+ break;
+ case WMARK_LOW:
+ pages = wmark_low_kbytes >> (PAGE_SHIFT - 10);
+ break;
+ case WMARK_HIGH:
+ pages = wmark_high_kbytes >> (PAGE_SHIFT - 10);
+ break;
+ default:
+ return;
+ }
+
+ /* Calculate total number of !ZONE_HIGHMEM pages */
+ for_each_zone(zone) {
+ if (!is_highmem(zone))
+ lowmem_pages += zone->present_pages;
+ }
+
+ for_each_zone(zone) {
+ u64 tmp;
+
+ spin_lock_irqsave(&zone->lock, flags);
+ tmp = (u64)pages * zone->present_pages;
+ do_div(tmp, lowmem_pages);
+
+ if (wmark == WMARK_MIN && is_highmem(zone)) {
+ int min_pages;
+
+ min_pages = zone->present_pages / 1024;
+ if (min_pages < SWAP_CLUSTER_MAX)
+ min_pages = SWAP_CLUSTER_MAX;
+ if (min_pages > 128)
+ min_pages = 128;
+ zone->watermark[wmark] = min_pages;
+ } else {
+ zone->watermark[wmark] = tmp;
+ }
+
+ if (wmark == WMARK_MIN)
+ setup_zone_migrate_reserve(zone);
+ spin_unlock_irqrestore(&zone->lock, flags);
+ }
+
+ if (wmark == WMARK_HIGH)
+ calculate_totalreserve_pages();
+}
+
/*
* The inactive anon list should be small enough that the VM never has to
* do too much work, but large enough that each inactive page has a chance
@@ -5029,6 +5099,45 @@ int min_free_kbytes_sysctl_handler(ctl_table *table, int write,
return 0;
}
+int wmark_min_kbytes_sysctl_handler(ctl_table *table, int write,
+ void __user *buffer, size_t *length, loff_t *ppos)
+{
+ int ret;
+
+ ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
+ if (ret < 0 || !write)
+ return ret;
+
+ setup_per_zone_wmark(WMARK_MIN);
+ return ret;
+}
+
+int wmark_low_kbytes_sysctl_handler(ctl_table *table, int write,
+ void __user *buffer, size_t *length, loff_t *ppos)
+{
+ int ret;
+
+ ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
+ if (ret < 0 || !write)
+ return ret;
+
+ setup_per_zone_wmark(WMARK_LOW);
+ return ret;
+}
+
+int wmark_high_kbytes_sysctl_handler(ctl_table *table, int write,
+ void __user *buffer, size_t *length, loff_t *ppos)
+{
+ int ret;
+
+ ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
+ if (ret < 0 || !write)
+ return ret;
+
+ setup_per_zone_wmark(WMARK_HIGH);
+ return ret;
+}
+
#ifdef CONFIG_NUMA
int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write,
void __user *buffer, size_t *length, loff_t *ppos)
--
1.7.1
|
|
From: Satoru M. <sat...@hd...> - 2011-01-07 22:07:37
|
Document that changing min_free_kbytes affects not only watermark[min] but also watermark[low,high]. Signed-off-by: Satoru Moriya <sat...@hd...> --- Documentation/sysctl/vm.txt | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 30289fa..e10b279 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -349,7 +349,8 @@ min_free_kbytes: This is used to force the Linux VM to keep a minimum number of kilobytes free. The VM uses this number to compute a -watermark[WMARK_MIN] value for each lowmem zone in the system. +watermark[WMARK_MIN] for each lowmem zone and +watermark[WMARK_LOW/WMARK_HIGH] for each zone in the system. Each lowmem zone gets a number of reserved free pages based proportionally on its size. -- 1.7.1 |
|
From: Satoru M. <sat...@hd...> - 2011-01-07 22:07:09
|
This patchset introduces a new knob to control each watermark
separately.
[Purpose]
To control the timing at which kswapd/direct reclaim starts(ends)
based on memory pressure and/or application characteristics
because direct reclaim makes a memory alloc/access latency worse.
(We'd like to avoid direct reclaim to keep latency low even if
under the high memory pressure.)
[Problem]
The thresholds kswapd/direct reclaim starts(ends) depend on
watermark[min,low,high] and currently all watermarks are set
based on min_free_kbytes. min_free_kbytes is the amount of
free memory that Linux VM should keep at least.
This means the difference between thresholds at which kswapd
starts and direct reclaim starts depends on the amount of free
memory.
On the other hand, the amount of required memory depends on
applications. Therefore when it allocates/access memory more
than the difference between watemark[low] and watermark[min],
kernel sometimes runs direct reclaim before allocation and
it makes application latency bigger.
[Solution]
To avoid the situation above, this patch set introduces new
tunables /proc/sys/vm/wmark_min_kbytes, wmark_low_kbytes and
wmark_high_kbytes. Each entry controls watermark[min],
watermark[low] and watermark[high] separately.
By using these parameters one can make the difference between
min and low bigger than the amount of memory which applications
require.
[Example]
This is an example of the problem and solution above.
- System Memory: 2GB
- High memory pressure
In this case, min_free_kbytes and watermarks are automatically
set as follows.
(Here, watermark shows sum of the each zone's watermark.)
min_free_kbytes: 5752
watermark[min] : 5752
watermark[low] : 7190
watermark[high]: 8628
If application allocates/accesses 2000 kbytes memory (bigger
than 1438(= 7190 - 5752)), direct reclaim may occur.
By introducing this patch, one can set watermark[low] to bigger
than 7752 which makes the difference between min and low bigger
than 2000. This results in avoidance of direct reclaim without
changing watermark[min].
[Test]
I ran a simple test like below:
System memory: 2GB
$ dd if=/dev/zero of=/tmp/tmp_file &
$ time mapped-file-stream 1 $((1024 * 1024 * 64))
The result is following.
| default | case 1 | case 2 |
----------------------------------------------------------
wmark_min_kbytes | 5752 | 5752 | 5752 |
wmark_low_kbytes | 7190 | 16384 | 32768 | (KB)
wmark_high_kbytes | 8628 | 20480 | 40960 |
----------------------------------------------------------
real | 503 | 364 | 337 |
user | 3 | 5 | 4 | (msec)
sys | 153 | 149 | 146 |
----------------------------------------------------------
page fault | 32768 | 32768 | 32768 |
kswapd_wakeup | 1809 | 335 | 228 | (times)
direct reclaim | 5 | 0 | 0 |
As you can see, direct reclaim was performed 5 times and
its exec time was 503 msec in the default case. On the other
hand, in case 1 (large delta case ) no direct reclaim was
performed and its exec time was 364 msec.
(*) mapped-file-stream
This is a micro benchmark from Johannes Weiner that accesses a
large sparse-file through mmap().
http://lkml.org/lkml/2010/8/30/226
Any comments or suggestions are welcome .
Satoru Moriya (2):
Add explanation about min_free_kbytes to clarify its effect
Make watermarks tunable separately
Documentation/sysctl/vm.txt | 40 +++++++++++++++-
include/linux/mmzone.h | 6 ++
kernel/sysctl.c | 28 +++++++++++-
mm/page_alloc.c | 109 +++++++++++++++++++++++++++++++++++++++++++
4 files changed, 181 insertions(+), 2 deletions(-)
|
|
From: Hidetoshi S. <set...@jp...> - 2010-12-27 02:03:49
|
(2010/12/23 8:35), Seiji Aguchi wrote:
> Hi,
>
> [Purpose]
> Kexec may trigger additional hardware errors and multiply the damage
> if it works after MCE occurred because there are some hardware-related
> operations in kexec as follows.
> - Sending NMI to cpus
> - Initializing hardware during boot process of second kernel.
> - Accessing to memory and dumping it to disks.
>
> So, I propose adding a new option controlling kexec behaviour when MCE
> occurred.
> This patch prevents unnecessary hardware errors and avoid expanding
> the damage.
>
> [Patch Description]
> I added a sysctl option ,kernel.kexec_on_mce, controlling kexec behaviour
> when MCE occurred.
>
> - Permission
> - 0644
> - Value(default is "1")
> - non-zero: Kexec is enabled regardless of MCE.
> - 0: Kexec is disabled when MCE occurred.
>
> Matrix of kernel.kexec_on_mce value, MCE and kexec behaviour
>
> --------------------------------------------------
> kernel.kexec_on_mce| MCE | kexec behaviour
> --------------------------------------------------
> non-zero | occurred | enabled
> -------------------------------
> | not occurred | enabled
> --------------------------------------------------
> 0 | occurred | disabled
> |------------------------------
> | not occurred | enabled
> --------------------------------------------------
>
> Any comments and suggestions are welcome.
This reminds me of a quite similar patch that I've made a long time ago
but haven't posted.
Following is what I found still in a branch of my private git tree.
I guess it cannot be applied without rebase, but I think the description
of my patch could give you some different point of view etc.
Feel free to use this debris to improve yours.
Thanks,
H.Seto
<*__NOTE_THIS_PATCH_IS_NOT_READY_TO_APPLY__*>
=====
From: Hidetoshi Seto <set...@jp...>
Date: Fri, 10 Jul 2009 15:55:42 +0900
Subject: [PATCH] kdump, sysctl: kdump_on_safe
This patch adds a sysctl kdump_on_safe, to limit kdump to run only
on safe situation.
Quote from document in this patch:
> kdump_on_safe:
>
> When the system experiences panic, kdump will be triggered if
> crash kernel is configured. However the kdump might fail if
> the panic was caused by fatal error, such as hardware error
> reported by machine check exception. It should be rare case,
> but in the worst case, it will result in data corruption and/or
> fatal damage on the hardware.
>
> If this flag is 1, it prevents kdump from running on such
> unstable system situation. Default is 0.
This will be a possible option if your hardware can provide good error
report (in SEL etc.) and/or kernel can provide other data enough for
error investigation (console log, mcelog on x86 etc.), and you'd like
to reduce down-time by skipping kdump on such situation.
Signed-off-by: Hidetoshi Seto <set...@jp...>
---
Documentation/sysctl/kernel.txt | 15 +++++++++++++++
arch/x86/kernel/cpu/mcheck/mce.c | 3 +++
include/linux/kexec.h | 3 +++
kernel/kexec.c | 8 ++++++++
kernel/sysctl.c | 13 +++++++++++++
5 files changed, 42 insertions(+), 0 deletions(-)
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 3894eaa..9d66ab9 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -33,6 +33,7 @@ show up in /proc/sys/kernel:
- hotplug
- java-appletviewer [ binfmt_java, obsolete ]
- java-interpreter [ binfmt_java, obsolete ]
+- kdump_on_safe [ kexec ]
- kstack_depth_to_print [ X86 only ]
- l2cr [ PPC only ]
- modprobe ==> Documentation/debugging-modules.txt
@@ -247,6 +248,20 @@ This flag controls the L2 cache of G3 processor boards. If
==============================================================
+kdump_on_safe:
+
+When the system experiences panic, kdump will be triggered if
+crash kernel is configured. However the kdump might fail if
+the panic was caused by fatal error, such as hardware error
+reported by machine check exception. It should be rare case,
+but in the worst case, it will result in data corruption and/or
+fatal damage on the hardware.
+
+If this flag is 1, it prevents kdump from running on such
+unstable system situation. Default is 0.
+
+==============================================================
+
kstack_depth_to_print: (X86 only)
Controls the number of words to print when dumping the raw
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 3e2ab18..c93bb38 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -23,6 +23,7 @@
#include <linux/sysdev.h>
#include <linux/delay.h>
#include <linux/ctype.h>
+#include <linux/kexec.h>
#include <linux/sched.h>
#include <linux/sysfs.h>
#include <linux/types.h>
@@ -291,6 +292,8 @@ static void mce_panic(char *msg, struct mce *final, char *exp)
int cpu;
if (!fake_panic) {
+ set_kdump_might_fail();
+
/*
* Make sure only one CPU runs in machine check panic
*/
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 03e8e8d..41e9ab0 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -209,10 +209,13 @@ int __init parse_crashkernel(char *cmdline, unsigned long long system_ram,
int crash_shrink_memory(unsigned long new_size);
size_t crash_get_memory_size(void);
+extern int kdump_might_fail;
+static inline void set_kdump_might_fail(void) { kdump_might_fail = 1; }
#else /* !CONFIG_KEXEC */
struct pt_regs;
struct task_struct;
static inline void crash_kexec(struct pt_regs *regs) { }
static inline int kexec_should_crash(struct task_struct *p) { return 0; }
+static inline void set_kdump_might_fail(void) { }
#endif /* CONFIG_KEXEC */
#endif /* LINUX_KEXEC_H */
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 87ebe8a..182c2f3 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -40,6 +40,9 @@
#include <asm/system.h>
#include <asm/sections.h>
+int kdump_on_safe;
+int kdump_might_fail;
+
/* Per cpu memory for storing cpu states in case of system crash. */
note_buf_t __percpu *crash_notes;
@@ -1064,6 +1067,11 @@ asmlinkage long compat_sys_kexec_load(unsigned long entry,
void crash_kexec(struct pt_regs *regs)
{
+ if (kdump_on_safe && kdump_might_fail) {
+ printk(KERN_EMERG "kexec cancelled due to unstable system.\n");
+ return;
+ }
+
/* Take the kexec_mutex here to prevent sys_kexec_load
* running on one cpu from replacing the crash kernel
* we are using after a panic on a different cpu.
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 8686b0f..8564e5c 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -156,6 +156,10 @@ extern int unaligned_dump_stack;
extern struct ratelimit_state printk_ratelimit_state;
+#ifdef CONFIG_KEXEC
+extern int kdump_on_safe;
+#endif
+
#ifdef CONFIG_PROC_SYSCTL
static int proc_do_cad_pid(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp, loff_t *ppos);
@@ -926,6 +930,15 @@ static struct ctl_table kern_table[] = {
.proc_handler = proc_dointvec,
},
#endif
+#ifdef CONFIG_KEXEC
+ {
+ .procname = "kdump_on_safe",
+ .data = &kdump_on_safe,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+#endif
/*
* NOTE: do not add new entries to this table unless you have read
* Documentation/sysctl/ctl_unnumbered.txt
--
1.7.3.2
</*__NOTE_THIS_PATCH_IS_NOT_READY_TO_APPLY__*>
|
|
From: <ebi...@xm...> - 2010-12-25 21:40:40
|
"H. Peter Anvin" <hp...@zy...> writes: > On 12/25/2010 09:19 AM, Eric W. Biederman wrote: >>> >>> So, kdump may receive wrong identifier when it starts after MCE >>> occurred, because MCE is reported by memory, cache, and TLB errors >>> >>> In the worst case, kdump will overwrite user data if it recognizes a >>> disk saving user data as a dump disk. >> >> Absurdly unlikely there is a sha256 checksum verified over the >> kdump kernel before it starts booting. If you have very broken >> memory it is possible, but absurdly unlikely that the machine will >> even boot if you are having enough uncorrectable memory errors >> an hour to get past the sha256 checksum and then be corruppt. >> > > That wouldn't be the likely scenario (passing a sha256 checksum with the > wrong data due to a random event will never happen for all the computers > on Earth before the Sun destroys the planet). However, in a > failing-memory scenario, the much more likely scenario is that kdump > starts up, verifies the signature, and *then* has corruption causing it > to write to the wrong disk or whatnot. This is inherent in any scheme > that allows writing to hard media after a failure (as opposed to, say, > dumping to the network.) Then kdump kernel should also panic if we detect an uncorrected ECC error. So this doesn't appear to open any new holes for disk corruption. kexec on panic can also be used for taking crash dumps over the network. What happens with the data is totally defined by userspace code in an initrd. Which is why extra policy knobs should be where they can be used. Eric |
|
From: H. P. A. <hp...@zy...> - 2010-12-25 18:37:28
|
On 12/25/2010 09:19 AM, Eric W. Biederman wrote: >> >> So, kdump may receive wrong identifier when it starts after MCE >> occurred, because MCE is reported by memory, cache, and TLB errors >> >> In the worst case, kdump will overwrite user data if it recognizes a >> disk saving user data as a dump disk. > > Absurdly unlikely there is a sha256 checksum verified over the > kdump kernel before it starts booting. If you have very broken > memory it is possible, but absurdly unlikely that the machine will > even boot if you are having enough uncorrectable memory errors > an hour to get past the sha256 checksum and then be corruppt. > That wouldn't be the likely scenario (passing a sha256 checksum with the wrong data due to a random event will never happen for all the computers on Earth before the Sun destroys the planet). However, in a failing-memory scenario, the much more likely scenario is that kdump starts up, verifies the signature, and *then* has corruption causing it to write to the wrong disk or whatnot. This is inherent in any scheme that allows writing to hard media after a failure (as opposed to, say, dumping to the network.) -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. |
|
From: <ebi...@xm...> - 2010-12-25 17:20:29
|
Seiji Aguchi <sei...@hd...> writes: > Hi, > > Thank you for giving your comments. > >>So what is the problem you are trying to avoid, and why can't we do >>something in the kernels initialization path to avoid initializing >>when there is a problem? > > Kdump gets a dump disk identifier based on information from memory. > > So, kdump may receive wrong identifier when it starts after MCE > occurred, because MCE is reported by memory, cache, and TLB errors > > In the worst case, kdump will overwrite user data if it recognizes a > disk saving user data as a dump disk. Absurdly unlikely there is a sha256 checksum verified over the kdump kernel before it starts booting. If you have very broken memory it is possible, but absurdly unlikely that the machine will even boot if you are having enough uncorrectable memory errors an hour to get past the sha256 checksum and then be corruppt. > Kdump shouldn't write any data to disk when information from > hardware is incredible because saving user data is always first > priority. Which is what is already implemented. It looks to me like you are jumping at shadows, and adding complexity to the kernel with no gain, and significant cost. Eric |
|
From: Seiji A. <sei...@hd...> - 2010-12-25 15:24:34
|
Hi, Thank you for giving your comments. >So what is the problem you are trying to avoid, and why can't we do >something in the kernels initialization path to avoid initializing >when there is a problem? Kdump gets a dump disk identifier based on information from memory. So, kdump may receive wrong identifier when it starts after MCE occurred, because MCE is reported by memory, cache, and TLB errors In the worst case, kdump will overwrite user data if it recognizes a disk saving user data as a dump disk. Kdump shouldn't write any data to disk when information from hardware is incredible because saving user data is always first priority. Seiji |
|
From: <ebi...@xm...> - 2010-12-23 19:57:02
|
Seiji Aguchi <sei...@hd...> writes: > Hi, > > I agree with Borislav that kexec shouldn't start at all because we can't guarantee > a stable system anymore when MCE is reported. In the case of kexec on panic we can never guarantee a stable system. But the odds are much better of executing non-corrupt code and of telling people you had a hardware error if you go through the kexec on panic process. If I read Andi's patch correctly he was suggesting to not allow any more mces to be reported on that path. > On the other hand, I understand there are people like Andi who want to start kexec > even if MCE occurred. > > That is why I propose adding a new option controlling kexec behaviour > when MCE occurred. What do you gain but not doing the kexec on panic, when you have the system configured to take one. We already have the big policy knobs to enable or disable this kind of behavior. > I don't stick to "sysctl". I think adding a sysctl in this path or any unnecessary code will make things less reliable. Last time this happened to me (about a week ago). The kexec on panic from a ecc reported memory error worked just fine. Aka in the real world it seems to work. So what is the problem you are trying to avoid, and why can't we do something in the kernels initialization path to avoid initializing when there is a problem? Eric |
|
From: Seiji A. <sei...@hd...> - 2010-12-23 17:49:41
|
Hi, I agree with Borislav that kexec shouldn't start at all because we can't guarantee a stable system anymore when MCE is reported. On the other hand, I understand there are people like Andi who want to start kexec even if MCE occurred. That is why I propose adding a new option controlling kexec behaviour when MCE occurred. I don't stick to "sysctl". I suggest to add a new boot parameter instead of sysctl because users can't change their configuration once the boot parameter is set. I will resend the patch if it is acceptable. Regards, Seiji |
|
From: Borislav P. <bp...@al...> - 2010-12-23 09:38:19
|
On Thu, Dec 23, 2010 at 08:43:39AM +0100, Andi Kleen wrote: > > > > - Accessing to memory and dumping it to disks. > > A better solution for this is > > http://git.kernel.org/?p=linux/kernel/git/ak/linux-mce-2.6.git;a=commitdiff;h=fe61906edce9e70d02481a77a617ba1397573dce > and > http://git.kernel.org/?p=linux/kernel/git/ak/linux-mce-2.6.git;a=commit;h=cb58f049ae6709ddbab71be199390dc6852018cd > > I'm not a big friend of sysctls for things like this -- either the behaviour > makes sense and should be default or not. This doesn't add up. AFAICT, you're disabling MCE reporting for crash dumps and the original patch's intention was to control whether kexec should run after a machine check. And I agree with Greg that this shouldn't be configurable but instead on by default - if you get a critical error and you cannot guarantee a stable system anymore, kexec shouldn't start at all. That simple. Thanks. -- Regards/Gruss, Boris. |
|
From: Andi K. <an...@fi...> - 2010-12-23 07:43:52
|
> - Accessing to memory and dumping it to disks. A better solution for this is http://git.kernel.org/?p=linux/kernel/git/ak/linux-mce-2.6.git;a=commitdiff;h=fe61906edce9e70d02481a77a617ba1397573dce and http://git.kernel.org/?p=linux/kernel/git/ak/linux-mce-2.6.git;a=commit;h=cb58f049ae6709ddbab71be199390dc6852018cd I'm not a big friend of sysctls for things like this -- either the behaviour makes sense and should be default or not. -Andi |
|
From: Greg KH <gr...@su...> - 2010-12-23 00:30:34
|
On Wed, Dec 22, 2010 at 06:35:40PM -0500, Seiji Aguchi wrote: > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -81,6 +81,9 @@ > #include <linux/nmi.h> > #endif > > +#ifdef CONFIG_X86_MCE > +#include <asm/mce.h> > +#endif Please don't put ifdefs in .c files, you do that a lot for this option. Just make it work on all platforms and then you will not need the #ifdef. thanks, greg k-h |
|
From: Seiji A. <sei...@hd...> - 2010-12-23 00:16:57
|
Hi,
[Purpose]
Kexec may trigger additional hardware errors and multiply the damage
if it works after MCE occurred because there are some hardware-related
operations in kexec as follows.
- Sending NMI to cpus
- Initializing hardware during boot process of second kernel.
- Accessing to memory and dumping it to disks.
So, I propose adding a new option controlling kexec behaviour when MCE
occurred.
This patch prevents unnecessary hardware errors and avoid expanding
the damage.
[Patch Description]
I added a sysctl option ,kernel.kexec_on_mce, controlling kexec behaviour
when MCE occurred.
- Permission
- 0644
- Value(default is "1")
- non-zero: Kexec is enabled regardless of MCE.
- 0: Kexec is disabled when MCE occurred.
Matrix of kernel.kexec_on_mce value, MCE and kexec behaviour
--------------------------------------------------
kernel.kexec_on_mce| MCE | kexec behaviour
--------------------------------------------------
non-zero | occurred | enabled
-------------------------------
| not occurred | enabled
--------------------------------------------------
0 | occurred | disabled
|------------------------------
| not occurred | enabled
--------------------------------------------------
Any comments and suggestions are welcome.
Signed-off-by: Seiji Aguchi <sei...@hd...>
---
Documentation/sysctl/kernel.txt | 12 ++++++++++++
arch/x86/include/asm/mce.h | 2 ++
arch/x86/kernel/cpu/mcheck/mce.c | 4 ++++
include/linux/sysctl.h | 1 +
kernel/kexec.c | 7 +++++++
kernel/sysctl.c | 12 ++++++++++++
kernel/sysctl_binary.c | 1 +
mm/memory-failure.c | 9 +++++++++
8 files changed, 48 insertions(+), 0 deletions(-)
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 209e158..ce3240e 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -34,6 +34,7 @@ show up in /proc/sys/kernel:
- hotplug
- java-appletviewer [ binfmt_java, obsolete ]
- java-interpreter [ binfmt_java, obsolete ]
+- kexec_on_mce [ X86 only ]
- kstack_depth_to_print [ X86 only ]
- l2cr [ PPC only ]
- modprobe ==> Documentation/debugging-modules.txt
@@ -261,6 +262,17 @@ This flag controls the L2 cache of G3 processor boards. If
==============================================================
+kexec_on_mce: (X86 only)
+
+Controls the kexec behaviour when MCE occurred.
+Default value is 1.
+
+0: Kexec is disabled when MCE occurred.
+non-zero: Kexec is enabled regardless of MCE.
+
+
+==============================================================
+
kstack_depth_to_print: (X86 only)
Controls the number of words to print when dumping the raw
diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index c62c13c..062dabd 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -123,6 +123,8 @@ extern struct atomic_notifier_head x86_mce_decoder_chain;
extern int mce_disabled;
extern int mce_p5_enabled;
+extern int kexec_on_mce;
+extern int mce_flag;
#ifdef CONFIG_X86_MCE
int mcheck_init(void);
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 7a35b72..edbaf77 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -85,6 +85,8 @@ static int mce_dont_log_ce __read_mostly;
int mce_cmci_disabled __read_mostly;
int mce_ignore_ce __read_mostly;
int mce_ser __read_mostly;
+int kexec_on_mce = 1;
+int mce_flag;
struct mce_bank *mce_banks __read_mostly;
@@ -944,6 +946,8 @@ void do_machine_check(struct pt_regs *regs, long error_code)
percpu_inc(mce_exception_count);
+ mce_flag = 1;
+
if (notify_die(DIE_NMI, "machine check", regs, error_code,
18, SIGKILL) == NOTIFY_STOP)
goto out;
diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index 7bb5cb6..0ebe708 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -153,6 +153,7 @@ enum
KERN_MAX_LOCK_DEPTH=74, /* int: rtmutex's maximum lock depth */
KERN_NMI_WATCHDOG=75, /* int: enable/disable nmi watchdog */
KERN_PANIC_ON_NMI=76, /* int: whether we will panic on an unrecovered */
+ KERN_KEXEC_ON_MCE=77, /* int: whether we will dump memory on mce */
};
diff --git a/kernel/kexec.c b/kernel/kexec.c
index b55045b..3e5c41a 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -39,6 +39,7 @@
#include <asm/io.h>
#include <asm/system.h>
#include <asm/sections.h>
+#include <asm/mce.h>
/* Per cpu memory for storing cpu states in case of system crash. */
note_buf_t __percpu *crash_notes;
@@ -1074,6 +1075,12 @@ void crash_kexec(struct pt_regs *regs)
* of memory the xchg(&kexec_crash_image) would be
* sufficient. But since I reuse the memory...
*/
+#ifdef CONFIG_X86_MCE
+ if (!kexec_on_mce && mce_flag) {
+ printk(KERN_WARNING "Kexec is disabled because MCE occurred\n");
+ return;
+ }
+#endif
if (mutex_trylock(&kexec_mutex)) {
if (kexec_crash_image) {
struct pt_regs fixed_regs;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 5abfa15..3a64cd6 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -81,6 +81,9 @@
#include <linux/nmi.h>
#endif
+#ifdef CONFIG_X86_MCE
+#include <asm/mce.h>
+#endif
#if defined(CONFIG_SYSCTL)
@@ -963,6 +966,15 @@ static struct ctl_table kern_table[] = {
.proc_handler = proc_dointvec,
},
#endif
+#if defined(CONFIG_X86_MCE)
+ {
+ .procname = "kexec_on_mce",
+ .data = &kexec_on_mce,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+#endif
/*
* NOTE: do not add new entries to this table unless you have read
* Documentation/sysctl/ctl_unnumbered.txt
diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
index 1357c57..a25f971 100644
--- a/kernel/sysctl_binary.c
+++ b/kernel/sysctl_binary.c
@@ -138,6 +138,7 @@ static const struct bin_table bin_kern_table[] = {
{ CTL_INT, KERN_MAX_LOCK_DEPTH, "max_lock_depth" },
{ CTL_INT, KERN_NMI_WATCHDOG, "nmi_watchdog" },
{ CTL_INT, KERN_PANIC_ON_NMI, "panic_on_unrecovered_nmi" },
+ { CTL_INT, KERN_KEXEC_ON_MCE, "kexec_on_mce" },
{}
};
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 46ab2c0..3ec075a 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -52,6 +52,11 @@
#include <linux/swapops.h>
#include <linux/hugetlb.h>
#include <linux/memory_hotplug.h>
+
+#ifdef CONFIG_X86_MCE
+#include <asm/mce.h>
+#endif
+
#include "internal.h"
int sysctl_memory_failure_early_kill __read_mostly = 0;
@@ -949,6 +954,10 @@ int __memory_failure(unsigned long pfn, int trapno, int flags)
int res;
unsigned int nr_pages;
+#ifdef CONFIG_X86_MCE
+ mce_flag = 1;
+#endif
+
if (!sysctl_memory_failure_recovery)
panic("Memory failure from trap %d on page %lx", trapno, pfn);
--
1.7.1
|
|
From: Masami H. <mas...@hi...> - 2010-12-16 13:27:54
|
Hi,
Here is a combined patch of kprobes jump optimization (a.k.a. Djprobe)
for RHEL6.0 kernel 2.6.32-71.el6. With this patch, some kprobes can
be optimized by a jump and it reduces the probing overhead drastically.
This patch includes following commits:
b46b3d70c9c017d7c4ec49f7f3ffd0af5a622277
89ae465b0ee470f7d3f8a1c61353445c3acbbe2a
24851d2447830e6cba4c4b641cb73e713f312373
e9afe9e1b3fdbd56cca53959a2519e70db9c8095
f5ad31158d60946b9fd18c8a79c283a6bc432430
65e234ec2c4a0659ca22531dc1372a185f088517
a00e817f42663941ea0aa5f85a9d1c4f8b212839
1f0ab40976460bc4673fa204ce917a725185d8f2
98272ed0d2e6509fe7dc571e77956c99bf653bb6
4dae560f97fa438f373b53e14b30149c9e44a600
c2ef6661ce62e26a8c0978e521fab646128a144b
615d0ebbc782b67296e3226c293f520f93f93515
2cfa19780d61740f65790c5bae363b759d7c96fa
076dc4a65a6d99a16979e2c7917e669fb8c91ee5
4554dbcb85a4ed2abaa2b6fa15649b796699ec89
5ecaafdbf44b1ba400b746c60c401d54c7ee0863
d498f763950703c724c650db1d34a1c8679f9ca8
4610ee1d3638fa05ba8e87ccfa971db8e4033ae7
afd66255b9a48f5851326ddae50e2203fbf71dc9
b2be84df99ebc93599c69e931a3c4a5105abfabc
0f94eb634ef7af736dee5639aac1c2fe9635d089
f007ea2685692bafb386820144cf73a14016fc7c
3d55cc8a058ee96291d6d45b1e35121b9920eca3
c0f7ac3a9edde786bc129d37627953a8b8abefdf
e5a11016643d1ab7172193591506d33a844734cc
83ff56f46a8532488ee364bb93a9cb2a59490d33
c0614829c16ab9d31f1b7d40516decfbf3d32102
a197479848a2f1a2a5c07cffa6c31ab5e8c82797
737480a0d525dae13306296da08029dff545bc72
edbaadbe42b0b790618ec49d29626223529d8195
05662bdb64c746079de7ac4dc4fb4caa5e8e119f
6376b2297502e72255b7eb2893c6044ad5a7b5f4
6abded71d730322df96c5b7f4ab952ffd5a0080d
635c17c2b2b4e5cd34f5dcba19d751b4e58533c2
43948f50276eca010a22726860dfe9a4e8130136
404ba5d7bb958d3d788bdaa0debc0bdf60f13ffe
Below commits has been changed. Here I added some comments
about those changes.
[89ae465b0ee470f7d3f8a1c61353445c3acbbe2a]
diff --git a/arch/x86/kernel/kprobes.c b/arch/x86/kernel/kprobes.c
@@ -94,32 +95,11 @@
- s64 disp;
- int need_modrm;
-
-- /* Skip legacy instruction prefixes. */
-- while (1) {
-- switch (*insn) {
-- case 0x66:
-- case 0x67:
-- case 0x2e:
-- case 0x3e:
-- case 0x26:
-- case 0x64:
-- case 0x65:
-- case 0x36:
-- case 0xf0:
-- case 0xf3:
-- case 0xf2:
-- ++insn;
-- continue;
-- }
-- break;
-- }
+- /* Skip prefixes */
+- insn = skip_prefixes(insn);
+ struct insn insn;
+ kernel_insn_init(&insn, p->ainsn.insn);
-- /* Skip REX instruction prefix. */
-- if (is_REX_prefix(insn))
-- ++insn;
--
- if (*insn == 0x0f) {
- /* Two-byte opcode. */
- ++insn;
RHEL6 already includes "skip_prefixes" bugfix patch which fixes
BZ#607215. Since it changes this part of kprobes code, this
commit should be modified.
[4554dbcb85a4ed2abaa2b6fa15649b796699ec89]
@@ -36,18 +38,20 @@
return -EILSEQ;
/* insn: must be on special executable page on x86. */
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
-index 9907a03..c3340e8 100644
+index 9907a03..95d5787 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
-@@ -44,6 +44,7 @@
- #include <linux/debugfs.h>
- #include <linux/kdebug.h>
- #include <linux/memory.h>
-+#include <linux/ftrace.h>
+@@ -685,6 +685,9 @@ static inline int check_kprobe_rereg(struct kprobe *p)
+ return ret;
+ }
- #include <asm-generic/sections.h>
- #include <asm/cacheflush.h>
-@@ -703,7 +704,8 @@ int __kprobes register_kprobe(struct kprobe *p)
++/* Don't include ftrace.h since kabi will be broken. */
++extern int ftrace_text_reserved(void *start, void *end);
++
+ int __kprobes register_kprobe(struct kprobe *p)
+ {
+ int ret = 0;
+@@ -703,7 +706,8 @@ int __kprobes register_kprobe(struct kprobe *p)
preempt_disable();
if (!kernel_text_address((unsigned long) p->addr) ||
I'm not sure why but including ftrace.h broke kABI compatibility,
so I removed "#include <linux/ftrace.h>" line.
[d498f763950703c724c650db1d34a1c8679f9ca8]
-@@ -32,7 +32,7 @@ struct kprobe;
+@@ -33,7 +33,7 @@ struct kprobe;
typedef u8 kprobe_opcode_t;
#define BREAKPOINT_INSTRUCTION 0xcc
-#define RELATIVEJUMP_INSTRUCTION 0xe9
+#define RELATIVEJUMP_OPCODE 0xe9
- #define MAX_INSN_SIZE 16
#define MAX_STACK_SIZE 64
#define MIN_STACK_SIZE(ADDR) \
+ (((MAX_STACK_SIZE) < (((unsigned long)current_thread_info()) + \
diff --git a/arch/x86/kernel/kprobes.c b/arch/x86/kernel/kprobes.c
A patch which moves MAX_INSN_SIZE definition from asm/kprobes.h has
been applied in RHEL6.0 kernel. Thus this commit should be modified.
[b2be84df99ebc93599c69e931a3c4a5105abfabc]
@@ -200,12 +203,13 @@
#include <asm/uaccess.h>
#include <asm/processor.h>
-@@ -1450,6 +1451,17 @@ static struct ctl_table debug_table[] = {
- .proc_handler = proc_dointvec
+@@ -1691,6 +1692,18 @@ static struct ctl_table debug_table[] = {
+ .extra1 = &zero,
},
#endif
+#if defined(CONFIG_OPTPROBES)
+ {
++ .ctl_name = CTL_UNNUMBERED,
+ .procname = "kprobes-optimization",
+ .data = &sysctl_kprobes_optimization,
+ .maxlen = sizeof(int),
@@ -215,6 +219,6 @@
+ .extra2 = &one,
+ },
+#endif
- { }
+ { .ctl_name = 0 }
};
On RHEL6.0 and recent kernel changed sysctl format. This patch
also should be modified according that change.
[0f94eb634ef7af736dee5639aac1c2fe9635d089]
@@ -113,6 +114,8 @@
+ setup_singlestep(p, regs, kcb, 0);
return 1;
}
+ } else if (*addr != BREAKPOINT_INSTRUCTION) {
+@@ -549,7 +553,7 @@ static int __kprobes kprobe_handler(struct pt_regs *regs)
} else if (kprobe_running()) {
p = __get_cpu_var(current_kprobe);
if (p->break_handler && p->break_handler(p, regs)) {
RHEL6.0 includes BZ#585400 bugfix (829e92458532b1dbfeb972435d45bb060cdbf5a3
in upstream) patch, and it changes arch/x86/kernel/kprobes.c. This patch
has been affected by that change.
That's all.
Thank you,
--
Masami HIRAMATSU
2nd Dept. Linux Technology Center
Hitachi, Ltd., Systems Development Laboratory
E-mail: mas...@hi...
|
|
From: Satoru M. <sat...@hd...> - 2010-12-03 23:27:13
|
Hi, This is a test for kprobes module (init) function probe support. For details, please see README in tarball. Thanks, Satoru |