You can subscribe to this list here.
| 2009 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(32) |
Jun
(66) |
Jul
(102) |
Aug
(78) |
Sep
(106) |
Oct
(137) |
Nov
(147) |
Dec
(147) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2010 |
Jan
(71) |
Feb
(139) |
Mar
(86) |
Apr
(76) |
May
(57) |
Jun
(10) |
Jul
(12) |
Aug
(6) |
Sep
(8) |
Oct
(12) |
Nov
(12) |
Dec
(18) |
| 2011 |
Jan
(16) |
Feb
(19) |
Mar
(3) |
Apr
(1) |
May
(16) |
Jun
(17) |
Jul
(74) |
Aug
(22) |
Sep
(18) |
Oct
(24) |
Nov
(21) |
Dec
(30) |
| 2012 |
Jan
(31) |
Feb
(16) |
Mar
(22) |
Apr
(25) |
May
(18) |
Jun
(13) |
Jul
(83) |
Aug
(49) |
Sep
(20) |
Oct
(60) |
Nov
(35) |
Dec
(28) |
| 2013 |
Jan
(39) |
Feb
(61) |
Mar
(35) |
Apr
(21) |
May
(45) |
Jun
(56) |
Jul
(20) |
Aug
(9) |
Sep
(10) |
Oct
(31) |
Nov
(8) |
Dec
(4) |
| 2014 |
Jan
(6) |
Feb
(7) |
Mar
(7) |
Apr
(6) |
May
(4) |
Jun
(8) |
Jul
(5) |
Aug
(2) |
Sep
(4) |
Oct
(4) |
Nov
(11) |
Dec
(5) |
| 2015 |
Jan
(4) |
Feb
(4) |
Mar
(3) |
Apr
(4) |
May
(9) |
Jun
(4) |
Jul
(15) |
Aug
(8) |
Sep
(16) |
Oct
(18) |
Nov
(15) |
Dec
(7) |
| 2016 |
Jan
(20) |
Feb
(9) |
Mar
(15) |
Apr
(24) |
May
(16) |
Jun
(28) |
Jul
(22) |
Aug
(23) |
Sep
(18) |
Oct
(30) |
Nov
(40) |
Dec
(9) |
| 2017 |
Jan
(1) |
Feb
(8) |
Mar
(37) |
Apr
(26) |
May
(25) |
Jun
(46) |
Jul
(24) |
Aug
(9) |
Sep
|
Oct
|
Nov
|
Dec
|
|
From: Don Z. <dz...@re...> - 2012-01-11 17:26:25
|
On Wed, Jan 11, 2012 at 03:28:11PM +0800, Chen Gong wrote: > 于 2012/1/11 4:29, Seiji Aguchi 写道: > > > >>I agree with you. How about adding macros or something like WARN_ON(XX_ARCH) or > >>Kconfig to limit its scope? > > > >Thank you for giving me your idea. > >Your suggestions above will work for me because I'm a x86 user. > >If Tony agrees to it, I can update my patch. > > > >But, I'm hesitating to add WARN_ON() or change Kconfig only for specific arch > >because pstore aims for generic interface and this is related to its design. > >Also, ramoops is going to use pstore now. It doesn't depend on x86. > >I'm worried that ramoops users will complain about this change. > > > >So, I think a reasonable solution at this time is just adding some explanations > >about smp_send_stop() to documentation as follows. > > > >Users can use pstore with their own responsibility and ask developers > >if smp_send_stop() is reliable enough in panic situation on architecture they want to run. > > > >What do you think? > > > >--- > > Documentation/ABI/testing/pstore | 8 ++++++++ > > 1 files changed, 8 insertions(+), 0 deletions(-) > > > >diff --git a/Documentation/ABI/testing/pstore b/Documentation/ABI/testing/pstore > >index ff1df4e..5583729 100644 > >--- a/Documentation/ABI/testing/pstore > >+++ b/Documentation/ABI/testing/pstore > >@@ -11,6 +11,14 @@ Description: Generic interface to platform dependent persistent storage. > > of the console log is captured, but other interesting > > data can also be saved. > > > >+ In case of panic, pstore is invoked after smp_send_stop() > >+ ,a function call stopping other cpus, so that we can get > >+ logs simpler and cleaner with just one running cpu. > >+ > >+ As for x86, smp_send_stop() is reliable enough to work in > >+ panic situation. But we are not guaranteed that it works > >+ reliably on other architectures. > >+ > > # mount -t pstore -o kmsg_bytes=8000 - /dev/pstore > > > > $ ls -l /dev/pstore > > The explanation is great. but In my opinion, I still insist that > a WARN_ON() is necessary. What do you think, Tony and Don? I guess I still don't understand why. Who uses kmsg_dump besides x86? It seems like there was only 3 or 4 subsystems that were registering. Cheers, Don |
|
From: Chen G. <gon...@li...> - 2012-01-11 08:02:43
|
于 2012/1/11 4:29, Seiji Aguchi 写道: > >> I agree with you. How about adding macros or something like WARN_ON(XX_ARCH) or >> Kconfig to limit its scope? > > Thank you for giving me your idea. > Your suggestions above will work for me because I'm a x86 user. > If Tony agrees to it, I can update my patch. > > But, I'm hesitating to add WARN_ON() or change Kconfig only for specific arch > because pstore aims for generic interface and this is related to its design. > Also, ramoops is going to use pstore now. It doesn't depend on x86. > I'm worried that ramoops users will complain about this change. > > So, I think a reasonable solution at this time is just adding some explanations > about smp_send_stop() to documentation as follows. > > Users can use pstore with their own responsibility and ask developers > if smp_send_stop() is reliable enough in panic situation on architecture they want to run. > > What do you think? > > --- > Documentation/ABI/testing/pstore | 8 ++++++++ > 1 files changed, 8 insertions(+), 0 deletions(-) > > diff --git a/Documentation/ABI/testing/pstore b/Documentation/ABI/testing/pstore > index ff1df4e..5583729 100644 > --- a/Documentation/ABI/testing/pstore > +++ b/Documentation/ABI/testing/pstore > @@ -11,6 +11,14 @@ Description: Generic interface to platform dependent persistent storage. > of the console log is captured, but other interesting > data can also be saved. > > + In case of panic, pstore is invoked after smp_send_stop() > + ,a function call stopping other cpus, so that we can get > + logs simpler and cleaner with just one running cpu. > + > + As for x86, smp_send_stop() is reliable enough to work in > + panic situation. But we are not guaranteed that it works > + reliably on other architectures. > + > # mount -t pstore -o kmsg_bytes=8000 - /dev/pstore > > $ ls -l /dev/pstore The explanation is great. but In my opinion, I still insist that a WARN_ON() is necessary. What do you think, Tony and Don? |
|
From: Seiji A. <sei...@hd...> - 2012-01-10 20:30:20
|
>I agree with you. How about adding macros or something like WARN_ON(XX_ARCH) or >Kconfig to limit its scope? Thank you for giving me your idea. Your suggestions above will work for me because I'm a x86 user. If Tony agrees to it, I can update my patch. But, I'm hesitating to add WARN_ON() or change Kconfig only for specific arch because pstore aims for generic interface and this is related to its design. Also, ramoops is going to use pstore now. It doesn't depend on x86. I'm worried that ramoops users will complain about this change. So, I think a reasonable solution at this time is just adding some explanations about smp_send_stop() to documentation as follows. Users can use pstore with their own responsibility and ask developers if smp_send_stop() is reliable enough in panic situation on architecture they want to run. What do you think? --- Documentation/ABI/testing/pstore | 8 ++++++++ 1 files changed, 8 insertions(+), 0 deletions(-) diff --git a/Documentation/ABI/testing/pstore b/Documentation/ABI/testing/pstore index ff1df4e..5583729 100644 --- a/Documentation/ABI/testing/pstore +++ b/Documentation/ABI/testing/pstore @@ -11,6 +11,14 @@ Description: Generic interface to platform dependent persistent storage. of the console log is captured, but other interesting data can also be saved. + In case of panic, pstore is invoked after smp_send_stop() + ,a function call stopping other cpus, so that we can get + logs simpler and cleaner with just one running cpu. + + As for x86, smp_send_stop() is reliable enough to work in + panic situation. But we are not guaranteed that it works + reliably on other architectures. + # mount -t pstore -o kmsg_bytes=8000 - /dev/pstore $ ls -l /dev/pstore -- Seiji |
|
From: Chen G. <gon...@li...> - 2012-01-10 03:40:52
|
于 2012/1/10 1:59, Seiji Aguchi 写道: >> >> I don't know how to prove something is hardened other than not seeing any >> hangs or false reboots on in that piece of code. >> > > I don't know how we can prove reliability of smp_send_stop() of all archs as well. > > Another possible solution is replacing smp_send_stop() with > machine_crash_shutdown(). > This is a function call stopping other cpus reliably in panic situation > so that kdump can work . > But, I don't know whether it works in all archs. machine_crash_shutdown is defined in only some Archs so it is obvious that it can't be used in all platforms, BTW, this function is bracketed by CONFIG_KEXEC, which means it can't be used without this macro. So it is not suitable for this scenario. > We need further discussion with Eric for checking this idea is feasible... > > > I would like to go forward step by step rather than doing tough work at once. > I think just moving kmsg_dump() below smp_send_stop() is a good start for improving pstore. I agree with you. How about adding macros or something like WARN_ON(XX_ARCH) or Kconfig to limit its scope? |
|
From: Seiji A. <sei...@hd...> - 2012-01-09 18:00:38
|
> >I don't know how to prove something is hardened other than not seeing any >hangs or false reboots on in that piece of code. > I don't know how we can prove reliability of smp_send_stop() of all archs as well. Another possible solution is replacing smp_send_stop() with machine_crash_shutdown(). This is a function call stopping other cpus reliably in panic situation so that kdump can work . But, I don't know whether it works in all archs. We need further discussion with Eric for checking this idea is feasible... I would like to go forward step by step rather than doing tough work at once. I think just moving kmsg_dump() below smp_send_stop() is a good start for improving pstore. Seiji |
|
From: Don Z. <dz...@re...> - 2012-01-05 21:01:43
|
On Thu, Jan 05, 2012 at 03:10:25PM -0500, Seiji Aguchi wrote: > > >Aren't you worried about the comment about smp_send_stop() not > >being hardened to work in a panic situation? > > /* > > * Note smp_send_stop is the usual smp shutdown function, which > > * unfortunately means it may not be hardened to work in a panic > > This comment is wrong because Don improved smp_send_stop() by switching REBOOT_VECTOR to NMI. > And his patch has already merged to linux-next tree. I only fixed x86. Who knows what the other arches do.. I don't know how to prove something is hardened other than not seeing any hangs or false reboots on in that piece of code. Cheers, Don > > http://git.kernel.org/?p=linux/kernel/git/next/linux-next.git;a=commitdiff;h=3603a2512f9e69dc87914ba922eb4a0812b21cd6 > > So, current smp_send_stop() is hardened to work in a panic situation. > > I will remove this wrong comment. > > Seiji > > >-----Original Message----- > >From: Luck, Tony [mailto:ton...@in...] > >Sent: Thursday, January 05, 2012 2:07 PM > >To: Seiji Aguchi; Don Zickus > >Cc: lin...@vg...; Matthew Garrett; Vivek Goyal; Chen, Gong; ak...@li...; Brown, Len; > >'yin...@in...'; 'ak...@li...'; 'hu...@ch...'; 'mi...@el...'; jm...@na...; > >a.p...@ch...; nam...@gm...; dle...@li...; Satoru Moriya > >Subject: RE: [RFC][PATCH v4 -next 1/4] Move kmsg_dump(KMSG_DUMP_PANIC) below smp_send_stop() > > > >- kmsg_dump(KMSG_DUMP_PANIC); > >- > > /* > > * Note smp_send_stop is the usual smp shutdown function, which > > * unfortunately means it may not be hardened to work in a panic > >@@ -117,6 +115,8 @@ void panic(const char *fmt, ...) > > */ > > smp_send_stop(); > > > >+ kmsg_dump(KMSG_DUMP_PANIC); > >+ > > > >Aren't you worried about the comment about smp_send_stop() not > >being hardened to work in a panic situation? > > > >If it does work - we are clearly much better off moving the > >kmsg_dump() call down like this. It makes life much simpler > >and cleaner to work with just one running cpu. > > > >But if something goes wrong - we might not see the dump at all! > > > >How do we compare these cases and decide that it is better to > >trust that smp_send_stop() will return? > > > >-Tony |
|
From: Seiji A. <sei...@hd...> - 2012-01-05 20:11:10
|
>Aren't you worried about the comment about smp_send_stop() not >being hardened to work in a panic situation? > /* > * Note smp_send_stop is the usual smp shutdown function, which > * unfortunately means it may not be hardened to work in a panic This comment is wrong because Don improved smp_send_stop() by switching REBOOT_VECTOR to NMI. And his patch has already merged to linux-next tree. http://git.kernel.org/?p=linux/kernel/git/next/linux-next.git;a=commitdiff;h=3603a2512f9e69dc87914ba922eb4a0812b21cd6 So, current smp_send_stop() is hardened to work in a panic situation. I will remove this wrong comment. Seiji >-----Original Message----- >From: Luck, Tony [mailto:ton...@in...] >Sent: Thursday, January 05, 2012 2:07 PM >To: Seiji Aguchi; Don Zickus >Cc: lin...@vg...; Matthew Garrett; Vivek Goyal; Chen, Gong; ak...@li...; Brown, Len; >'yin...@in...'; 'ak...@li...'; 'hu...@ch...'; 'mi...@el...'; jm...@na...; >a.p...@ch...; nam...@gm...; dle...@li...; Satoru Moriya >Subject: RE: [RFC][PATCH v4 -next 1/4] Move kmsg_dump(KMSG_DUMP_PANIC) below smp_send_stop() > >- kmsg_dump(KMSG_DUMP_PANIC); >- > /* > * Note smp_send_stop is the usual smp shutdown function, which > * unfortunately means it may not be hardened to work in a panic >@@ -117,6 +115,8 @@ void panic(const char *fmt, ...) > */ > smp_send_stop(); > >+ kmsg_dump(KMSG_DUMP_PANIC); >+ > >Aren't you worried about the comment about smp_send_stop() not >being hardened to work in a panic situation? > >If it does work - we are clearly much better off moving the >kmsg_dump() call down like this. It makes life much simpler >and cleaner to work with just one running cpu. > >But if something goes wrong - we might not see the dump at all! > >How do we compare these cases and decide that it is better to >trust that smp_send_stop() will return? > >-Tony |
|
From: Luck, T. <ton...@in...> - 2012-01-05 19:07:29
|
- kmsg_dump(KMSG_DUMP_PANIC); - /* * Note smp_send_stop is the usual smp shutdown function, which * unfortunately means it may not be hardened to work in a panic @@ -117,6 +115,8 @@ void panic(const char *fmt, ...) */ smp_send_stop(); + kmsg_dump(KMSG_DUMP_PANIC); + Aren't you worried about the comment about smp_send_stop() not being hardened to work in a panic situation? If it does work - we are clearly much better off moving the kmsg_dump() call down like this. It makes life much simpler and cleaner to work with just one running cpu. But if something goes wrong - we might not see the dump at all! How do we compare these cases and decide that it is better to trust that smp_send_stop() will return? -Tony |
|
From: Seiji A. <sei...@hd...> - 2012-01-05 17:42:22
|
This patch skips spin_lock of efi_pstore_write() in panic path for these reasons. - efi_pstore_write() is serialized via smp_send_stop(). - SetVariable runtime service is guaranteed to work correctly if you call it a second time when you have taken an exception in trying to execute them already. We can avoid deadlock of spin_lock(&efivars->lock) in panic path and get panic log reliably. Signed-off-by: Seiji Aguchi <sei...@hd...> --- drivers/firmware/efivars.c | 13 +++++++++++-- 1 files changed, 11 insertions(+), 2 deletions(-) diff --git a/drivers/firmware/efivars.c b/drivers/firmware/efivars.c index d25599f..657e6f1 100644 --- a/drivers/firmware/efivars.c +++ b/drivers/firmware/efivars.c @@ -510,7 +510,15 @@ static int efi_pstore_write(enum pstore_type_id type, sprintf(stub_name, "dump-type%u-%u-", type, part); sprintf(name, "%s%lu", stub_name, get_seconds()); - spin_lock(&efivars->lock); + /* + * We don't need to take any locks in panic path for these reasons. + * - efi_pstore_write() is serialized via smp_send_stop(). + * - SetVariable runtime service is guaranteed to work correctly + * if you call it a second time when you have taken an exception + * in trying to execute them already. + */ + if (reason != KMSG_DUMP_PANIC) + spin_lock(&efivars->lock); for (i = 0; i < DUMP_NAME_LEN; i++) efi_name[i] = stub_name[i]; @@ -548,7 +556,8 @@ static int efi_pstore_write(enum pstore_type_id type, efivars->ops->set_variable(efi_name, &vendor, PSTORE_EFI_ATTRIBUTES, size, psi->buf); - spin_unlock(&efivars->lock); + if (reason != KMSG_DUMP_PANIC) + spin_unlock(&efivars->lock); if (found) efivar_unregister(found); -- 1.7.1 |
|
From: Seiji A. <sei...@hd...> - 2012-01-05 17:40:43
|
This patch skips subsequent kmsg_dump() function calls in panic path With this patch, we can avoid deadlock due to the subsequent calls. Actually, kmsg_dump(KMSG_DUMP_EMREG) is called after kmsg_dump(KMSG_DUMP_PANIC) when panic_timeout variable is set. Signed-off-by: Seiji Aguchi <sei...@hd...> --- kernel/printk.c | 10 ++++++++++ 1 files changed, 10 insertions(+), 0 deletions(-) diff --git a/kernel/printk.c b/kernel/printk.c index 5294426..186d792 100644 --- a/kernel/printk.c +++ b/kernel/printk.c @@ -1731,6 +1731,16 @@ void kmsg_dump(enum kmsg_dump_reason reason) const char *s1, *s2; unsigned long l1, l2; unsigned long flags; + static bool panicked; + + /* + * kmsg_dump() is skipped because we already got panic log. + */ + if (panicked) + return; + + if (reason == KMSG_DUMP_PANIC) + panicked = true; /* * Currently, "in_nmi() && reason != KMSG_DUMP_PANIC" case never happens -- 1.7.1 |
|
From: Seiji A. <sei...@hd...> - 2012-01-05 17:38:32
|
This patch does followings.
- Skip spin_locks in panic case in both kmsg_dump() and pstore_dump() because they are
serialized via smp_send_stop
- Add WARN_ON() in "in_nmi() and !panic" case into kmsg_dump(). Currently, this case never
happens because only kmsg_dump(KMSG_DUMP_PANIC) is called in NMI case.
But if someone adds new kmsg_dump() into NMI path in the future, kmsg_dump() may deadlock.
We can trap it and complain with this WARN_ON().
With this patch, kmsg_dump()/pstore_dump() work as follows.
panic case (KMSG_DUMP_PANIC):
- don't take lock because they are serialized.
not panic case (KMSG_DUMP_OOPS/KMSG_DUMP_EMERG/KMSG_DUMP_RESTART/KMSG_DUMP_HALT):
- take locks normally
Regarding as NMI case,
- kmsg_dump()/pstore_dump() don't take locks, so deadlock issue will not happen
because kmsg_dump() is called in just panic case with current implementation.
- If someone adds new kmsg_dump() into NMI path, WARN_ON() is called.
So we can trap it and ask to fix it.
Signed-off-by: Seiji Aguchi <sei...@hd...>
---
fs/pstore/platform.c | 16 ++++++----------
kernel/printk.c | 20 ++++++++++++++++++--
2 files changed, 24 insertions(+), 12 deletions(-)
diff --git a/fs/pstore/platform.c b/fs/pstore/platform.c
index 9ec22d3..426e77d 100644
--- a/fs/pstore/platform.c
+++ b/fs/pstore/platform.c
@@ -90,18 +90,17 @@ static void pstore_dump(struct kmsg_dumper *dumper,
int hsize, ret;
unsigned int part = 1;
unsigned long flags = 0;
- int is_locked = 0;
if (reason < ARRAY_SIZE(reason_str))
why = reason_str[reason];
else
why = "Unknown";
- if (in_nmi()) {
- is_locked = spin_trylock(&psinfo->buf_lock);
- if (!is_locked)
- pr_err("pstore dump routine blocked in NMI, may corrupt error record\n");
- } else
+ /*
+ * pstore_dump() is serialized in panic path.
+ * So, we don't need to take any locks.
+ */
+ if (reason != KMSG_DUMP_PANIC)
spin_lock_irqsave(&psinfo->buf_lock, flags);
oopscount++;
while (total < kmsg_bytes) {
@@ -131,10 +130,7 @@ static void pstore_dump(struct kmsg_dumper *dumper,
total += l1_cpy + l2_cpy;
part++;
}
- if (in_nmi()) {
- if (is_locked)
- spin_unlock(&psinfo->buf_lock);
- } else
+ if (reason != KMSG_DUMP_PANIC)
spin_unlock_irqrestore(&psinfo->buf_lock, flags);
}
diff --git a/kernel/printk.c b/kernel/printk.c
index 13c0a11..5294426 100644
--- a/kernel/printk.c
+++ b/kernel/printk.c
@@ -1732,13 +1732,29 @@ void kmsg_dump(enum kmsg_dump_reason reason)
unsigned long l1, l2;
unsigned long flags;
+ /*
+ * Currently, "in_nmi() && reason != KMSG_DUMP_PANIC" case never happens
+ * because ,in NMI path, only kmsg_dump(KMSG_DUMP_PANIC) is called.
+ * But if someone adds new kmsg_dump() into NMI path in the future,
+ * kmsg_dump() may deadlock. We can trap it and complain
+ * with this WARN_ON().
+ */
+ WARN_ON(in_nmi() && reason != KMSG_DUMP_PANIC);
+
/* Theoretically, the log could move on after we do this, but
there's not a lot we can do about that. The new messages
will overwrite the start of what we dump. */
- raw_spin_lock_irqsave(&logbuf_lock, flags);
+
+ /*
+ * kmsg_dump(KMSG_DUMP_PANIC) is serialized.
+ * So, we don't need to take any locks.
+ */
+ if (reason != KMSG_DUMP_PANIC)
+ raw_spin_lock_irqsave(&logbuf_lock, flags);
end = log_end & LOG_BUF_MASK;
chars = logged_chars;
- raw_spin_unlock_irqrestore(&logbuf_lock, flags);
+ if (reason != KMSG_DUMP_PANIC)
+ raw_spin_unlock_irqrestore(&logbuf_lock, flags);
if (chars > end) {
s1 = log_buf + log_buf_len - chars + end;
--
1.7.1
|
|
From: Seiji A. <sei...@hd...> - 2012-01-05 17:37:23
|
This patch just moves kmsg_dump(KMSG_DUMP_PANIC) below smp_send_stop for serializing logging process via smp_send_stop. Signed-off-by: Seiji Aguchi <sei...@hd...> Acked-by: Don Zickus <dz...@re...> --- kernel/panic.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/panic.c b/kernel/panic.c index 80aed44..da585b8 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -108,8 +108,6 @@ void panic(const char *fmt, ...) */ crash_kexec(NULL); - kmsg_dump(KMSG_DUMP_PANIC); - /* * Note smp_send_stop is the usual smp shutdown function, which * unfortunately means it may not be hardened to work in a panic @@ -117,6 +115,8 @@ void panic(const char *fmt, ...) */ smp_send_stop(); + kmsg_dump(KMSG_DUMP_PANIC); + atomic_notifier_call_chain(&panic_notifier_list, 0, buf); bust_spinlocks(0); -- 1.7.1 |
|
From: Seiji A. <sei...@hd...> - 2012-01-05 17:36:58
|
Hi, Discussion: As Don mentioned in following thread, it would be nice for pstore/kmsg_dump to serialize panic path because they can log messages reliably. https://lkml.org/lkml/2011/10/13/427 This patchset is based on his proposal switching smp_send_stop() from REBOOT_VECTOR to NMI. Change Log: v3 -> v4 - Add comment for explaining the purpose of WARN_ON() based on Don's comment (patch 2/4) https://lkml.org/lkml/2011/12/12/296 - Skip spin_lock of efi_pstore_write() in panic case based on discussion with Tony (patch 4/4) https://lkml.org/lkml/2012/1/3/151 - Apply this patchset to -next tree instead of linus -tree so that "patch 4/4" makes simple v2 -> v3 - Skip spin_locks in panic case in both kmsg_dump() and pstore_dump() instead of calling spin_lock_init() to avoid potential issues due to spin_lock_init() - Add WARN_ON() in "in_nmi() and !panic" case into kmsg_dump() so that we trap when someone adds new kmsg_dump() in NMI path in the future - Skip subsequent kmsg_dump() function calls to avoid deadlock. v1 -> v2 - Add trylocks to kmsg_dump()/pstore_dump() so that they can work in NMI context. - Divide a patch into two First one is just moving kmsg_dump(KMSG_DUMP_PANIC) below smp_send_stop() Second one is changing lock operations in kmsg_dump()/pstore_dump() v1 - Move kmsg_dump(KMSG_DUMP_PANIC) below smp_send_stop - Bust logbuf_lock of kmsg_dump() in panic path for avoiding deadlock - Bust psinfo->buf_lock of pstore_dump() in panic path for avoiding deadlock Patch Description: [RFC][PATCH v4 -next 1/4] Move kmsg_dump(KMSG_DUMP_PANIC) below smp_send_stop() - Just move kmsg_dump(KMSG_DUMP_PANIC) below smp_send_stop() [RFC][PATCH v4 -next 2/4] Skip spin_locks in panic case and Add WARN_ON() - Skip spin_locks in panic case in both kmsg_dump() and pstore_dump() - Add WARN_ON() in "in_nmi() and !panic" case into kmsg_dump() [RFC][PATCH v4 -next 3/4] Skip subsequent kmsg_dump() - Skip subsequent kmsg_dump() function calls in panic path [RFC][PATCH v4 -next 4/4] Skip spin_lock of efi_pstore_write() in panic case - Skip spin_lock of efi_pstore_write() in panic case to avoid deadlock Seiji |
|
From: Seiji A. <sei...@hd...> - 2012-01-03 18:22:53
|
>But this seems pretty clear that we don't need to self-NMI. Thank you for your explanation. Now I understand that we don't need self-NMI and efi_pstore can blindly continue in panic path because setvariable() will succeed even if the previous call that took the exception didn't succeed. I will submit an updated patch by removing spin_lock of efi_pstore_write() in panic path. Seiji |
|
From: Luck, T. <ton...@in...> - 2012-01-03 17:49:31
|
> I'll ask an Intel EFI guru whether I'm on target. Got a short note back from the EFI people: ---- The reason the paragraph exists in 7.1.1 is that certain services (as listed in table 31) are guaranteed to work if you call them a second time when you have taken an exception in trying to execute them already -- say SetVariable() face plants with an exception...because of the way our code is designed you can call it a second time even if the first call that took the exception didn't succeed. At any rate, I checked with a couple of the other experts around us and none of us can see any reason to send yourself an NMI just to preface a call to a UEFI runtime service. Thus, our collective opinion is that you are correct in your interpretation. ---- If you want to discuss this more - they apparently have an external mailing list some place on TianoCore.org. But this seems pretty clear that we don't need to self-NMI. -Tony |
|
From: Seiji A. <sei...@hd...> - 2012-01-03 16:05:02
|
>So I wonder whether there is any value in the cpu sending an NMI to itself to >meet the exact wording of this section in the spec. > >I'll ask an Intel EFI guru whether I'm on target. Thanks, Tony. I wait for their reply. Seiji |
|
From: Luck, T. <ton...@in...> - 2012-01-02 19:42:03
|
> 7.1.1 Exception for Machine Check, INIT, and NMI. I wonder if this section is poorly named ... I suspect that it may really mean "Exception for fatal situations". I don't think that the EFI code has any way to know that one of these events has happened (in theory it could tell that it was executing a machine check handler if it looked at MCIP bit in MCG_STATUS, but I doubt that it does that ... INIT/NMI would seem even harder to detect). > My suggestion is that panicked cpu sends NMI to itself and then it calls kmsg_dump(KMSG_DUMP_PANIC) in NMI. So I wonder whether there is any value in the cpu sending an NMI to itself to meet the exact wording of this section in the spec. I'll ask an Intel EFI guru whether I'm on target. -Tony |
|
From: Eric D. <eri...@gm...> - 2011-12-21 07:05:35
|
Le mardi 20 décembre 2011 à 18:40 -0500, Satoru Moriya a écrit : > We use trace_kfree_skb() which 63e03724b51 uses to detect packet > drop event. In addition to that, we would like to detect errors > in TCP layer for better trouble shooting. > If you feel we miss a counter in a particular spot, just add it, so that even a newbie can report "netstat -s" values in a message for analysis. Not re-sending a frame because of a temporary shortage of memory is not an error per se, TCP will recover eventually, but keeping track of these events might be useful (once we have not percpu counters for this, but regular atomic counters) |
|
From: Satoru M. <sat...@hd...> - 2011-12-20 23:40:50
|
On 12/17/2011 07:49 PM, Hagen Paul Pfeifer wrote: >> Sometimes network packets are dropped for some reason. In enterprise >> systems which require strict RAS functionality, we must know the >> reason why it happened and explain it to our customers even if using >> TCP. When we investigate the incidents, at first we try to find out >> whether the problem is in the server(kernel, application) or else >> (router, hub etc). And next we try to find out which layer >> (application/middleware/kernel(IP/TCP/UDP/..)etc.) the problem >> occurs. > > For the first question tcpdump may the right tool. We'd like to keep records on memory and save it to file when we detect problems so that we can keep tracing overhead low. We can also keep the amount of trace data lower than with tcpdump because we only record data when retransmission occurs. Capturing all the packets and saving them to file cannot satisfy our requirements. I should have written them in the cover-letter. I'll fix it. Also, we can analyze incidents in combination with the data from this tracepoint and from others easily. > For the later systemtap can be used. I mean we now have the > possibility to instrument the kernel at runtime, without bloating the > source. Yes, we can use systemtap to get the data we need. But systemtap is not included in kernel and we must maintain systemtap scripts to follow kernel modification. By adding tracepoint here, we can get useful data via ftrace/perf without any instruments which is not included in kernel. Of course, systemtap can insert a probe with this tracepoint too. > Anyway: is 63e03724b51 not suitable to gather the required information > easily? We use trace_kfree_skb() which 63e03724b51 uses to detect packet drop event. In addition to that, we would like to detect errors in TCP layer for better trouble shooting. Regards, Satoru |
|
From: Satoru M. <sat...@hd...> - 2011-12-20 18:13:20
|
On 12/16/2011 07:17 PM, Stephen Hemminger wrote: > >> Sometimes network packets are dropped for some reason. In enterprise >> systems which require strict RAS functionality, we must know the >> reason why it happened and explain it to our customers even if using >> TCP. When we investigate the incidents, at first we try to find out >> whether the problem is in the server(kernel, application) or else >> (router, hub etc). And next we try to find out which layer >> (application/middleware/kernel(IP/TCP/UDP/..)etc.) the problem >> occurs. > > I feel sorry for you, your users don't understand TCP. TCP > intentionally induces loss to measure capacity. This is one of the > fundamental principles of loss based congestion control. Maybe my explanation was not enough, I think... Yes, as you said above, TCP induces loss in principle. Actually, customers doesn't think that is a problem. But, at the same time packet drop occurs everywhere in network due to BUG, wrong configuration, broken hardware and/or etc. and sometimes it causes serious problems to customers' system. We provide Linux support service in our business and when a serious problem occurs, we collect and analyze logs and find where the problem is. With this tracepoint, we're able to know whether the problem is in OS or somewhere else. (Negative return value means packet drop happened inside OS.) That's a great help for us narrow down the root cause of the problem. I'll rewrite the cover letter when I post v2. Regards, Satoru |
|
From: Satoru M. <sat...@hd...> - 2011-12-20 16:56:12
|
On 12/19/2011 07:10 AM, David Laight wrote: > >> This is just a cleanup patch for making easy to hook return value >> with a tracepoint. > > Another way to temporarily add such a trace, is to temporarily > add a wrapper function. I'm afraid that I don't know how to add a wrapper function temporarily. Could you please tell me how to do that? Jprobe? Regards, Satoru |
|
From: Seiji A. <sei...@hd...> - 2011-12-19 23:37:51
|
Hi, Thank you for coordinating discussion. >Hitachi's whole goal behind >kmsg_dump is to record every possible reason for rebooting the machine for >inspection on reboot, I think if I understood Seiji privately. Exactly. >Now we have a scenario (though rare), where kmsg_dump can fail to save a >dump because someone already has the lock. >Seiji, has does this failure impact kmsg_dump's goal? When we use EFI-enabled system, we can avoid this failure because ,as you can see below(spec of EFI 2.3), kernel in NMI can call set_variable runtime service even if previous call is busy. <snip> 7.1 Runtime Services Rules and Restrictions 7.1.1 Exception for Machine Check, INIT, and NMI. . . . If the OS determines that the Machine Check is non-recoverable, the OS level handler may ignore the normal restrictions and may invoke the runtime services described in Table 31 even in the case where a previous call was busy. The system firmware will honor the new runtime service call(s) and the operation of the previous interrupted call is not guaranteed. Any interrupted runtime functions will not be restarted. The INIT and NMI events follow the same restrictions. <snip> So, next steps we should take are as follows. - Fix efi_pstore so "efivars->lock" in efi_pstore_write() skips in panic path. - Fix smp_send_stop() (or introduce new function call) so that kmsg_dump(KMSG_DUMP_PANIC) is always called in NMI. Currently, panicked cpu sends NMI to all cpus other than itself in smp_send_stop() (with Don's patch). My suggestion is that panicked cpu sends NMI to itself and then it calls kmsg_dump(KMSG_DUMP_PANIC) in NMI. Seiji |
|
From: Don Z. <dz...@re...> - 2011-12-19 21:01:15
|
On Mon, Dec 12, 2011 at 10:48:52AM -0800, Luck, Tony wrote:
> > if (reason == KMSG_DUMP_PANIC) {
> > if(is_spin_locked(&psinfo->buf_lock))
> > pr_err("lock is taken.\n");
> > else {
> > spin_lock_irqsave(&psinfo->buf_lock, flags);
> > }
> >
> > However, this won't work for this reason.
> > - printk() must not be called in serialized path because deadlock of logbuf_lock may cause.
>
> There is also the issue that kmsg has already computed the addresses
> of useful pieces in __log_buf at this point - so any printk() we
> add here is not going to be saved into pstore. So a user relying
> on pstore to see what happened, won't see any messages we add here.
Ah. Good point.
Now we have a scenario (though rare), where kmsg_dump can fail to save a
dump because some one already has the lock. Hitachi's whole goal behind
kmsg_dump is to record every possible reason for rebooting the machine for
inspection on reboot, I think if I understood Seiji privately.
Seiji, has does this failure impact kmsg_dump's goal?
Cheers,
Don
|
|
From: David L. <Dav...@AC...> - 2011-12-19 12:11:50
|
> This is just a cleanup patch for making easy to hook return value > with a tracepoint. Another way to temporarily add such a trace, is to temporarily add a wrapper function. David |
|
From: Huang Y. <yin...@in...> - 2011-12-19 01:24:38
|
On Fri, 2011-12-16 at 00:03 +0800, Seiji Aguchi wrote: > Hi, > > > > >No. I have different understanding. > > > >Because error log address range resides in NVRAM, the contents will be > >reserved even after reboot. So we do not need read/clear operations at > >all, and should place all records in error log address range. > > I have a quick question about your comment above. > Do you know whether this optimization of read/clear operations works > on machines which APEI is enabled by WHEA _OSC call? I have no machine with !!(erst_erange.attr & ERST_RANGE_NVRAM) == 1 Do you have? Best Regards, Huang Ying |