dle-develop Mailing List for Dependable Linux Effort (Page 31)

Brought to you by: h-aoki, hiramatu, s-oshima, y-sugita

dle-develop — Dependable Linux Effort mailing list for developers

You can subscribe to this list here.

2009	Jan	Feb	Mar	Apr	May (32)	Jun (66)	Jul (102)	Aug (78)	Sep (106)	Oct (137)	Nov (147)	Dec (147)
2010	Jan (71)	Feb (139)	Mar (86)	Apr (76)	May (57)	Jun (10)	Jul (12)	Aug (6)	Sep (8)	Oct (12)	Nov (12)	Dec (18)
2011	Jan (16)	Feb (19)	Mar (3)	Apr (1)	May (16)	Jun (17)	Jul (74)	Aug (22)	Sep (18)	Oct (24)	Nov (21)	Dec (30)
2012	Jan (31)	Feb (16)	Mar (22)	Apr (25)	May (18)	Jun (13)	Jul (83)	Aug (49)	Sep (20)	Oct (60)	Nov (35)	Dec (28)
2013	Jan (39)	Feb (61)	Mar (35)	Apr (21)	May (45)	Jun (56)	Jul (20)	Aug (9)	Sep (10)	Oct (31)	Nov (8)	Dec (4)
2014	Jan (6)	Feb (7)	Mar (7)	Apr (6)	May (4)	Jun (8)	Jul (5)	Aug (2)	Sep (4)	Oct (4)	Nov (11)	Dec (5)
2015	Jan (4)	Feb (4)	Mar (3)	Apr (4)	May (9)	Jun (4)	Jul (15)	Aug (8)	Sep (16)	Oct (18)	Nov (15)	Dec (7)
2016	Jan (20)	Feb (9)	Mar (15)	Apr (24)	May (16)	Jun (28)	Jul (22)	Aug (23)	Sep (18)	Oct (30)	Nov (40)	Dec (9)
2017	Jan (1)	Feb (8)	Mar (37)	Apr (26)	May (25)	Jun (46)	Jul (24)	Aug (9)	Sep	Oct	Nov	Dec

Flat | Threaded

<< < 1 .. 29 30 31 32 33 .. 99 > >> (Page 31 of 99)

Re: [Dle-develop] [RFC][PATCH 2/2] efi_pstore: add a sequence counter to a variable name

From: Mike W. <mi...@go...> - 2012-10-18 23:48:08

On Fri, Oct 5, 2012 at 12:02 PM, Seiji Aguchi <sei...@hd...> wrote:
> [Problem]
>
> Currently, a variable name, which distinguishes each entry, consists of type, id and ctime.
> But if multiple events happens in a short time, a second/third event may fail to log because
> efi_pstore can't distinguish each event with current variable name.
>
> [Solution]
>
>   A reasonable way to identify all events precisely is introducing a sequence counter to
>   the variable name.
>
> [Patch Description]
>
>   The sequence counter has already supported in a pstore layer with "oopscount".
>   So, this patch adds it to a variable name.
>   Also, it is passed to read/erase callbacks of platform drivers in accordance with
>   the modification of the variable name.
>
>   <before applying this patch>
>  a variable name of first event: dump-type0-1-12345678
>  a variable name of second event: dump-type0-1-12345678
>
>   type:0
>   id:1
>   ctime:12345678
>
>  If multiple events happen in a short time, efi_pstore can't distinguish them because
>  variable names are same among them.
>
>   <after applying this patch>
>
>  it can be distinguishable by adding a sequence counter as follows.
>
>  a variable name of first event: dump-type0-1-1-12345678
>  a variable name of Second event: dump-type0-1-2-12345678
>
>   type:0
>   id:1
>   sequence counter: 1(first event), 2(second event)
>   ctime:12345678

This patch plumbs the count through the same way as you plumbed the
ctime through in patch 1/2, so I'm going to hold off on commenting on
this patch until we better understand i_ctime use is resolved there.
I suspect that we don't want to continue adding more arguments through
pstore ops this way.

Re: [Dle-develop] [RFC][PATCH 1/2] efi_pstore: hold multiple logs

From: Mike W. <mi...@go...> - 2012-10-18 23:42:20

Resending as text.  See below.

On Fri, Oct 5, 2012 at 12:01 PM, Seiji Aguchi <sei...@hd...> wrote:
> [Problem]
>
>   Currently, efi_pstore driver simply overwrites existing panic messages in NVRAM.
>   So, in the following scenario, we will lose 1st panic messages.
>
>     1. kernel panics.
>     2. efi_pstore is kicked and write panic messages to NVRAM.
>     3. system reboots.
>     4. kernel panics again before a user checks the 1st panic messages in NVRAM.
>
> [Solution]
>
>   This patch takes an approach just holding multiple logs to avoid overwriting existing
>   entries.
>
> [Patch Description]
>
>   This patch changes write/erase/read callbacks as follows.
>
>   - Write callback
>     - Check if there are enough spaces to write logs with QueryVariableInfo()
>       to avoid handling out of space situation. (It is suggested by Matthew Garrett.)
>

I would prefer to see the exposing query_variable_info callback as a
separate patch.  The check in the patch looks good.

>     - Remove a logic erasing existing entries.
>
>   - Erase callback
>     - Freshly create a logic rather than sharing it with a write callback because
>       erasing entries doesn't need in a write callback.
>

This sort of change is a lot easier to review if you copy and paste
the routine in a patch separate from this one.

>     - Add ctime to an argument.
>       Currently, a variable name, which is used to distinguish each log entry, consists of type,
>       id and ctime. But an erase callback does not use ctime.

I do not like the way this is plumbed. See below.

>
>       If efi_pstore supported just one log, type and id were enough.
>
>       On the other hand, in case of supporting multiple logs, ctime is needed to precisely
>       distinguish each entry at erasing time.
>       Therefore this patch adds ctime to an argument of an erase callback is needed.
>
>    - Read callback
>     - Add ctime to an argument.
>        This change is needed to identify the entry at reading time.
>
>   <Example>
>
>  As you can see below, we can distinguish first and second events with ctime.
>
>  a variable name of first event: dump-type0-1-12345678
>  a variable name of second event: dump-type0-1-23456789
>
>   type:0
>   id:1
>   ctime:12345678, 23456789
>
> Signed-off-by: Seiji Aguchi <sei...@hd...>
> ---
>  drivers/acpi/apei/erst.c   |    4 +-
>  drivers/firmware/efivars.c |   87 ++++++++++++++++++++++++++++----------------
>  fs/pstore/inode.c          |    3 +-
>  fs/pstore/ram.c            |    2 +-
>  include/linux/efi.h        |    1 +
>  include/linux/pstore.h     |    2 +-
>  6 files changed, 63 insertions(+), 36 deletions(-)
>
> diff --git a/drivers/acpi/apei/erst.c b/drivers/acpi/apei/erst.c
> index e4d9d24..0bd6ae4 100644
> --- a/drivers/acpi/apei/erst.c
> +++ b/drivers/acpi/apei/erst.c
> @@ -938,7 +938,7 @@ static int erst_writer(enum pstore_type_id type, enum kmsg_dump_reason reason,
>                        u64 *id, unsigned int part,
>                        size_t size, struct pstore_info *psi);
>  static int erst_clearer(enum pstore_type_id type, u64 id,
> -                       struct pstore_info *psi);
> +                       struct timespec time, struct pstore_info *psi);
>
>  static struct pstore_info erst_info = {
>         .owner          = THIS_MODULE,
> @@ -1102,7 +1102,7 @@ static int erst_writer(enum pstore_type_id type, enum kmsg_dump_reason reason,
>  }
>
>  static int erst_clearer(enum pstore_type_id type, u64 id,
> -                       struct pstore_info *psi)
> +                       struct timespec time, struct pstore_info *psi)
>  {
>         return erst_clear(id);
>  }
> diff --git a/drivers/firmware/efivars.c b/drivers/firmware/efivars.c
> index d10c987..0eaaba3 100644
> --- a/drivers/firmware/efivars.c
> +++ b/drivers/firmware/efivars.c
> @@ -701,23 +701,70 @@ static int efi_pstore_write(enum pstore_type_id type,
>                 unsigned int part, size_t size, struct pstore_info *psi)
>  {
>         char name[DUMP_NAME_LEN];
> -       char stub_name[DUMP_NAME_LEN];
>         efi_char16_t efi_name[DUMP_NAME_LEN];
>         efi_guid_t vendor = LINUX_EFI_CRASH_GUID;
>         struct efivars *efivars = psi->data;
> -       struct efivar_entry *entry, *found = NULL;
>         int i, ret = 0;
> +       u64 storage_space, remaining_space, max_variable_size;
> +       efi_status_t status = EFI_NOT_FOUND;
> +
> +       spin_lock(&efivars->lock);
> +
> +       /*
> +        * Check if there is a space enough to log.
> +        * size: a size of logging data
> +        * DUMP_NAME_LEN * 2: a maximum size of variable name
> +        *

extra line

> +        */
> +       status = efivars->ops->query_variable_info(PSTORE_EFI_ATTRIBUTES,
> +                                                  &storage_space,
> +                                                  &remaining_space,
> +                                                  &max_variable_size);
> +       if (status || remaining_space < size + DUMP_NAME_LEN * 2) {
> +               spin_unlock(&efivars->lock);
> +               *id = part;
> +               return -ENOSPC;
> +       }
> +
> +       sprintf(name, "dump-type%u-%u-%lu", type, part, get_seconds());
> +
> +       for (i = 0; i < DUMP_NAME_LEN; i++)
> +               efi_name[i] = name[i];
> +
> +       efivars->ops->set_variable(efi_name, &vendor, PSTORE_EFI_ATTRIBUTES,
> +                                  size, psi->buf);
> +
> +       spin_unlock(&efivars->lock);
> +
> +       if (size)
> +               ret = efivar_create_sysfs_entry(efivars,
> +                                         utf16_strsize(efi_name,
> +                                                       DUMP_NAME_LEN * 2),
> +                                         efi_name, &vendor);

What is happening here?  Why is size checked for non-zero?

> +
> +       *id = part;
> +       return ret;
> +};
> +
> +static int efi_pstore_erase(enum pstore_type_id type, u64 id,
> +                           struct timespec time, struct pstore_info *psi)
> +{
> +       char name[DUMP_NAME_LEN];
> +       efi_char16_t efi_name[DUMP_NAME_LEN];
> +       efi_guid_t vendor = LINUX_EFI_CRASH_GUID;
> +       struct efivars *efivars = psi->data;
> +       struct efivar_entry *entry, *found = NULL;
> +       int i;
>
> -       sprintf(stub_name, "dump-type%u-%u-", type, part);
> -       sprintf(name, "%s%lu", stub_name, get_seconds());
> +       sprintf(name, "dump-type%u-%llu-%lu", type, id, time.tv_sec);
>
>         spin_lock(&efivars->lock);
>
>         for (i = 0; i < DUMP_NAME_LEN; i++)
> -               efi_name[i] = stub_name[i];
> +               efi_name[i] = name[i];
>
>         /*
> -        * Clean up any entries with the same name
> +        * Clean up an entry with the same name
>          */
>
>         list_for_each_entry(entry, &efivars->list, list) {
> @@ -728,9 +775,6 @@ static int efi_pstore_write(enum pstore_type_id type,
>                 if (utf16_strncmp(entry->var.VariableName, efi_name,
>                                   utf16_strlen(efi_name)))
>                         continue;
> -               /* Needs to be a prefix */
> -               if (entry->var.VariableName[utf16_strlen(efi_name)] == 0)
> -                       continue;
>
>                 /* found */
>                 found = entry;
> @@ -738,37 +782,17 @@ static int efi_pstore_write(enum pstore_type_id type,
>                                            &entry->var.VendorGuid,
>                                            PSTORE_EFI_ATTRIBUTES,
>                                            0, NULL);
> +               break;
>         }
>
>         if (found)
>                 list_del(&found->list);
>
> -       for (i = 0; i < DUMP_NAME_LEN; i++)
> -               efi_name[i] = name[i];
> -
> -       efivars->ops->set_variable(efi_name, &vendor, PSTORE_EFI_ATTRIBUTES,
> -                                  size, psi->buf);
> -
>         spin_unlock(&efivars->lock);
>
>         if (found)
>                 efivar_unregister(found);
>
> -       if (size)
> -               ret = efivar_create_sysfs_entry(efivars,
> -                                         utf16_strsize(efi_name,
> -                                                       DUMP_NAME_LEN * 2),
> -                                         efi_name, &vendor);
> -
> -       *id = part;
> -       return ret;
> -};
> -
> -static int efi_pstore_erase(enum pstore_type_id type, u64 id,
> -                           struct pstore_info *psi)
> -{
> -       efi_pstore_write(type, 0, &id, (unsigned int)id, 0, psi);
> -
>         return 0;
>  }
>  #else
> @@ -797,7 +821,7 @@ static int efi_pstore_write(enum pstore_type_id type,
>  }
>
>  static int efi_pstore_erase(enum pstore_type_id type, u64 id,
> -                           struct pstore_info *psi)
> +                           struct timespec time, struct pstore_info *psi)
>  {
>         return 0;
>  }
> @@ -1237,6 +1261,7 @@ efivars_init(void)
>         ops.get_variable = efi.get_variable;
>         ops.set_variable = efi.set_variable;
>         ops.get_next_variable = efi.get_next_variable;
> +       ops.query_variable_info = efi.query_variable_info;
>         error = register_efivars(&__efivars, &ops, efi_kobj);
>         if (error)
>                 goto err_put;
> diff --git a/fs/pstore/inode.c b/fs/pstore/inode.c
> index 4ab572e..4300af6 100644
> --- a/fs/pstore/inode.c
> +++ b/fs/pstore/inode.c
> @@ -175,7 +175,8 @@ static int pstore_unlink(struct inode *dir, struct dentry *dentry)
>         struct pstore_private *p = dentry->d_inode->i_private;
>
>         if (p->psi->erase)
> -               p->psi->erase(p->type, p->id, p->psi);
> +               p->psi->erase(p->type, p->id, dentry->d_inode->i_ctime,
> +                             p->psi);

This doesn't look right.  What guarantees that the i_ctime means
anything across reboots?

>
>         return simple_unlink(dir, dentry);
>  }
> diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c
> index 0b311bc..177b281 100644
> --- a/fs/pstore/ram.c
> +++ b/fs/pstore/ram.c
> @@ -237,7 +237,7 @@ static int ramoops_pstore_write_buf(enum pstore_type_id type,
>  }
>
>  static int ramoops_pstore_erase(enum pstore_type_id type, u64 id,
> -                               struct pstore_info *psi)
> +                               timespec time, struct pstore_info *psi)
>  {
>         struct ramoops_context *cxt = psi->data;
>         struct persistent_ram_zone *prz;
> diff --git a/include/linux/efi.h b/include/linux/efi.h
> index 8670eb1..c47ec36 100644
> --- a/include/linux/efi.h
> +++ b/include/linux/efi.h
> @@ -643,6 +643,7 @@ struct efivar_operations {
>         efi_get_variable_t *get_variable;
>         efi_get_next_variable_t *get_next_variable;
>         efi_set_variable_t *set_variable;
> +       efi_query_variable_info_t *query_variable_info;
>  };
>
>  struct efivars {
> diff --git a/include/linux/pstore.h b/include/linux/pstore.h
> index c892587..71f43e0 100644
> --- a/include/linux/pstore.h
> +++ b/include/linux/pstore.h
> @@ -60,7 +60,7 @@ struct pstore_info {
>                         unsigned int part, const char *buf, size_t size,
>                         struct pstore_info *psi);
>         int             (*erase)(enum pstore_type_id type, u64 id,
> -                       struct pstore_info *psi);
> +                       struct timespec time, struct pstore_info *psi);
>         void            *data;
>  };
>
> -- 1.7.1

Re: [Dle-develop] [RFC][PATCH 1/2] efi_pstore: hold multiple logs

From: Mike W. <mi...@go...> - 2012-10-18 23:11:19

On Fri, Oct 5, 2012 at 12:01 PM, Seiji Aguchi <sei...@hd...> wrote:

> [Problem]
>
>   Currently, efi_pstore driver simply overwrites existing panic messages
> in NVRAM.
>   So, in the following scenario, we will lose 1st panic messages.
>
>     1. kernel panics.
>     2. efi_pstore is kicked and write panic messages to NVRAM.
>     3. system reboots.
>     4. kernel panics again before a user checks the 1st panic messages in
> NVRAM.
>
> [Solution]
>
>   This patch takes an approach just holding multiple logs to avoid
> overwriting existing
>   entries.
>
> [Patch Description]
>
>   This patch changes write/erase/read callbacks as follows.
>
>   - Write callback
>     - Check if there are enough spaces to write logs with
> QueryVariableInfo()
>       to avoid handling out of space situation. (It is suggested by
> Matthew Garrett.)
>

I would prefer to see the exposing query_variable_info callback as a
separate patch.  The check in the patch looks good.


>
>     - Remove a logic erasing existing entries.
>
>   - Erase callback
>     - Freshly create a logic rather than sharing it with a write callback
> because
>       erasing entries doesn't need in a write callback.
>

This sort of change is a lot easier to review if you copy and paste the
routine in a patch separate from this one.


>     - Add ctime to an argument.
>       Currently, a variable name, which is used to distinguish each log
> entry, consists of type,
>       id and ctime. But an erase callback does not use ctime.
>

I do not like the way this is plumbed. See below.

      If efi_pstore supported just one log, type and id were enough.
>
>       On the other hand, in case of supporting multiple logs, ctime is
> needed to precisely
>       distinguish each entry at erasing time.
>       Therefore this patch adds ctime to an argument of an erase callback
> is needed.
>
>    - Read callback
>     - Add ctime to an argument.
>        This change is needed to identify the entry at reading time.
>
>   <Example>
>
>  As you can see below, we can distinguish first and second events with
> ctime.
>
>  a variable name of first event: dump-type0-1-12345678
>  a variable name of second event: dump-type0-1-23456789
>
>   type:0
>   id:1
>   ctime:12345678, 23456789
>
> Signed-off-by: Seiji Aguchi <sei...@hd...>
> ---
>  drivers/acpi/apei/erst.c   |    4 +-
>  drivers/firmware/efivars.c |   87
> ++++++++++++++++++++++++++++----------------
>  fs/pstore/inode.c          |    3 +-
>  fs/pstore/ram.c            |    2 +-
>  include/linux/efi.h        |    1 +
>  include/linux/pstore.h     |    2 +-
>  6 files changed, 63 insertions(+), 36 deletions(-)
>
> diff --git a/drivers/acpi/apei/erst.c b/drivers/acpi/apei/erst.c
> index e4d9d24..0bd6ae4 100644
> --- a/drivers/acpi/apei/erst.c
> +++ b/drivers/acpi/apei/erst.c
> @@ -938,7 +938,7 @@ static int erst_writer(enum pstore_type_id type, enum
> kmsg_dump_reason reason,
>                        u64 *id, unsigned int part,
>                        size_t size, struct pstore_info *psi);
>  static int erst_clearer(enum pstore_type_id type, u64 id,
> -                       struct pstore_info *psi);
> +                       struct timespec time, struct pstore_info *psi);
>
>  static struct pstore_info erst_info = {
>         .owner          = THIS_MODULE,
> @@ -1102,7 +1102,7 @@ static int erst_writer(enum pstore_type_id type,
> enum kmsg_dump_reason reason,
>  }
>
>  static int erst_clearer(enum pstore_type_id type, u64 id,
> -                       struct pstore_info *psi)
> +                       struct timespec time, struct pstore_info *psi)
>  {
>         return erst_clear(id);
>  }
> diff --git a/drivers/firmware/efivars.c b/drivers/firmware/efivars.c
> index d10c987..0eaaba3 100644
> --- a/drivers/firmware/efivars.c
> +++ b/drivers/firmware/efivars.c
> @@ -701,23 +701,70 @@ static int efi_pstore_write(enum pstore_type_id type,
>                 unsigned int part, size_t size, struct pstore_info *psi)
>  {
>         char name[DUMP_NAME_LEN];
> -       char stub_name[DUMP_NAME_LEN];
>         efi_char16_t efi_name[DUMP_NAME_LEN];
>         efi_guid_t vendor = LINUX_EFI_CRASH_GUID;
>         struct efivars *efivars = psi->data;
> -       struct efivar_entry *entry, *found = NULL;
>         int i, ret = 0;
> +       u64 storage_space, remaining_space, max_variable_size;
> +       efi_status_t status = EFI_NOT_FOUND;
> +
> +       spin_lock(&efivars->lock);
> +
> +       /*
> +        * Check if there is a space enough to log.
> +        * size: a size of logging data
> +        * DUMP_NAME_LEN * 2: a maximum size of variable name
> +        *
>
extra line


> +        */
> +       status = efivars->ops->query_variable_info(PSTORE_EFI_ATTRIBUTES,
> +                                                  &storage_space,
> +                                                  &remaining_space,
> +                                                  &max_variable_size);
> +       if (status || remaining_space < size + DUMP_NAME_LEN * 2) {
> +               spin_unlock(&efivars->lock);
> +               *id = part;
> +               return -ENOSPC;
> +       }
> +
> +       sprintf(name, "dump-type%u-%u-%lu", type, part, get_seconds());
> +
> +       for (i = 0; i < DUMP_NAME_LEN; i++)
> +               efi_name[i] = name[i];
> +
> +       efivars->ops->set_variable(efi_name, &vendor,
> PSTORE_EFI_ATTRIBUTES,
> +                                  size, psi->buf);
> +
> +       spin_unlock(&efivars->lock);
> +
> +       if (size)
> +               ret = efivar_create_sysfs_entry(efivars,
> +                                         utf16_strsize(efi_name,
> +                                                       DUMP_NAME_LEN * 2),
> +                                         efi_name, &vendor);
>

What is happening here?  Why is size checked for non-zero?


> +
> +       *id = part;
> +       return ret;
> +};
> +
> +static int efi_pstore_erase(enum pstore_type_id type, u64 id,
> +                           struct timespec time, struct pstore_info *psi)
> +{
> +       char name[DUMP_NAME_LEN];
> +       efi_char16_t efi_name[DUMP_NAME_LEN];
> +       efi_guid_t vendor = LINUX_EFI_CRASH_GUID;
> +       struct efivars *efivars = psi->data;
> +       struct efivar_entry *entry, *found = NULL;
> +       int i;
>
> -       sprintf(stub_name, "dump-type%u-%u-", type, part);
> -       sprintf(name, "%s%lu", stub_name, get_seconds());
> +       sprintf(name, "dump-type%u-%llu-%lu", type, id, time.tv_sec);
>
>         spin_lock(&efivars->lock);
>
>         for (i = 0; i < DUMP_NAME_LEN; i++)
> -               efi_name[i] = stub_name[i];
> +               efi_name[i] = name[i];
>
>         /*
> -        * Clean up any entries with the same name
> +        * Clean up an entry with the same name
>          */
>
>         list_for_each_entry(entry, &efivars->list, list) {
> @@ -728,9 +775,6 @@ static int efi_pstore_write(enum pstore_type_id type,
>                 if (utf16_strncmp(entry->var.VariableName, efi_name,
>                                   utf16_strlen(efi_name)))
>                         continue;
> -               /* Needs to be a prefix */
> -               if (entry->var.VariableName[utf16_strlen(efi_name)] == 0)
> -                       continue;
>
>                 /* found */
>                 found = entry;
> @@ -738,37 +782,17 @@ static int efi_pstore_write(enum pstore_type_id type,
>                                            &entry->var.VendorGuid,
>                                            PSTORE_EFI_ATTRIBUTES,
>                                            0, NULL);
> +               break;
>         }
>
>         if (found)
>                 list_del(&found->list);
>
> -       for (i = 0; i < DUMP_NAME_LEN; i++)
> -               efi_name[i] = name[i];
> -
> -       efivars->ops->set_variable(efi_name, &vendor,
> PSTORE_EFI_ATTRIBUTES,
> -                                  size, psi->buf);
> -
>         spin_unlock(&efivars->lock);
>
>         if (found)
>                 efivar_unregister(found);
>
> -       if (size)
> -               ret = efivar_create_sysfs_entry(efivars,
> -                                         utf16_strsize(efi_name,
> -                                                       DUMP_NAME_LEN * 2),
> -                                         efi_name, &vendor);
> -
> -       *id = part;
> -       return ret;
> -};
> -
> -static int efi_pstore_erase(enum pstore_type_id type, u64 id,
> -                           struct pstore_info *psi)
> -{
> -       efi_pstore_write(type, 0, &id, (unsigned int)id, 0, psi);
> -
>         return 0;
>  }
>  #else
> @@ -797,7 +821,7 @@ static int efi_pstore_write(enum pstore_type_id type,
>  }
>
>  static int efi_pstore_erase(enum pstore_type_id type, u64 id,
> -                           struct pstore_info *psi)
> +                           struct timespec time, struct pstore_info *psi)
>  {
>         return 0;
>  }
> @@ -1237,6 +1261,7 @@ efivars_init(void)
>         ops.get_variable = efi.get_variable;
>         ops.set_variable = efi.set_variable;
>         ops.get_next_variable = efi.get_next_variable;
> +       ops.query_variable_info = efi.query_variable_info;
>         error = register_efivars(&__efivars, &ops, efi_kobj);
>         if (error)
>                 goto err_put;
> diff --git a/fs/pstore/inode.c b/fs/pstore/inode.c
> index 4ab572e..4300af6 100644
> --- a/fs/pstore/inode.c
> +++ b/fs/pstore/inode.c
> @@ -175,7 +175,8 @@ static int pstore_unlink(struct inode *dir, struct
> dentry *dentry)
>         struct pstore_private *p = dentry->d_inode->i_private;
>
>         if (p->psi->erase)
> -               p->psi->erase(p->type, p->id, p->psi);
> +               p->psi->erase(p->type, p->id, dentry->d_inode->i_ctime,
> +                             p->psi);
>

This doesn't look right.  What guarantees that the i_ctime means anything
across reboots?


>
>         return simple_unlink(dir, dentry);
>  }
> diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c
> index 0b311bc..177b281 100644
> --- a/fs/pstore/ram.c
> +++ b/fs/pstore/ram.c
> @@ -237,7 +237,7 @@ static int ramoops_pstore_write_buf(enum
> pstore_type_id type,
>  }
>
>  static int ramoops_pstore_erase(enum pstore_type_id type, u64 id,
> -                               struct pstore_info *psi)
> +                               timespec time, struct pstore_info *psi)
>
struct


>  {
>         struct ramoops_context *cxt = psi->data;
>         struct persistent_ram_zone *prz;
> diff --git a/include/linux/efi.h b/include/linux/efi.h
> index 8670eb1..c47ec36 100644
> --- a/include/linux/efi.h
> +++ b/include/linux/efi.h
> @@ -643,6 +643,7 @@ struct efivar_operations {
>         efi_get_variable_t *get_variable;
>         efi_get_next_variable_t *get_next_variable;
>         efi_set_variable_t *set_variable;
> +       efi_query_variable_info_t *query_variable_info;
>  };
>
>  struct efivars {
> diff --git a/include/linux/pstore.h b/include/linux/pstore.h
> index c892587..71f43e0 100644
> --- a/include/linux/pstore.h
> +++ b/include/linux/pstore.h
> @@ -60,7 +60,7 @@ struct pstore_info {
>                         unsigned int part, const char *buf, size_t size,
>                         struct pstore_info *psi);
>         int             (*erase)(enum pstore_type_id type, u64 id,
> -                       struct pstore_info *psi);
> +                       struct timespec time, struct pstore_info *psi);
>         void            *data;
>  };
>
> -- 1.7.1
>

Re: [Dle-develop] [RFC][PATCH v5]trace, x86: add x86 irq vector tracepoints

From: Seiji A. <sei...@hd...> - 2012-10-18 18:41:48

Any comment?

Seiji

> -----Original Message-----
> From: Seiji Aguchi
> Sent: Thursday, October 11, 2012 1:25 PM
> To: 'H. Peter Anvin'; 'Steven Rostedt'
> Cc: 'Thomas Gleixner (tg...@li...)'; 'lin...@vg...'; ''mi...@el...' (mi...@el...)'; 'x8...@ke...'; 'dle-
> de...@li...'; Satoru Moriya; 'Borislav Petkov'
> Subject: [RFC][PATCH v5]trace,x86: add x86 irq vector tracepoints
> 
> Change log
> 
>  v4 -> v5
>  - Rebased to 3.6.0
> 
>  - Introduce a logic switching IDT at enabling/disabling TP time
>    so that a time penalty makes a zero when tracepoints are disabled.
>    This IDT is created only when CONFIG_TRACEPOINTS is enabled.
> 
>  - Remove arch_irq_vector_entry/exit and add followings again
>    so that we can add each tracepoint in a generic way.
>    - error_apic_vector
>    - thermal_apic_vector
>    - threshold_apic_vector
>    - spurious_apic_vector
>    - x86_platform_ipi_vector
> 
>  - Drop nmi tracepoints to begin with apic interrupts and discuss a logic switching
>    IDT first.
> 
>  - Move irq_vectors.h in the directory of arch/x86/include/asm/trace because
>    I'm not sure if a logic switching IDT is sharable with other architectures.
> 
>  v3 -> v4
>  - Add a latency measurement of each tracepoint
>  - Rebased to 3.6-rc6
> 
>  v2 -> v3
>  - Remove an invalidate_tlb_vector event because it was replaced by a call function vector
>    in a following commit.
>    http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=52aec3308db85f4e9f5c8b9f5dc4fbd0138c6fa4
> 
>  v1 -> v2
>  - Modify variable name from irq to vector.
>  - Merge arch-specific tracepoints below to an arch_irq_vector_entry/exit.
>    - error_apic_vector
>    - thermal_apic_vector
>    - threshold_apic_vector
>    - spurious_apic_vector
>    - x86_platform_ipi_vector
> 
> [Purpose of this patch]
> 
> As Vaibhav explained in the thread below, tracepoints for irq vectors are useful.
> 
> http://www.spinics.net/lists/mm-commits/msg85707.html
> 
> <snip>
> The current interrupt traces from irq_handler_entry and irq_handler_exit provide when an interrupt is handled.  They provide good
> data about when the system has switched to kernel space and how it affects the currently running processes.
> 
> There are some IRQ vectors which trigger the system into kernel space, which are not handled in generic IRQ handlers.  Tracing such
> events gives us the information about IRQ interaction with other system events.
> 
> The trace also tells where the system is spending its time.  We want to know which cores are handling interrupts and how they are
> affecting other processes in the system.  Also, the trace provides information about when the cores are idle and which interrupts are
> changing that state.
> <snip>
> 
> On the other hand, my usecase is tracing just local timer event and getting a value of instruction pointer.
> 
>   I suggested to add an argument local timer event to get instruction pointer before.
>   But there is another way to get it with external module like systemtap.
>   So, I don't need to add any argument to irq vector tracepoints now.
> 
> [Patch Description]
> 
> Vaibhav's patch shared a trace point ,irq_vector_entry/irq_vector_exit, in all events.
> But there is an above use case to trace specific irq_vector rather than tracing all events.
> In this case, we are concerned about overhead due to unwanted events.
> 
> This patch adds following tracepoints instead of introducing irq_vector_entry/exit.
> so that we can enable them independently.
>    - local_timer_vector
>    - reschedule_vector
>    - call_function_vector
>    - call_function_single_vector
>    - irq_work_entry_vector
>    - error_apic_vector
>    - thermal_apic_vector
>    - threshold_apic_vector
>    - spurious_apic_vector
>    - x86_platform_ipi_vector
> 
> Also, it introduces a logic switching IDT at enabling/disabling time so that a time penalty makes a complete zero when tracepoints are
> disabled. Detailed explanations are as follows.
>  - Create new irq handlers inserted tracepoints by using macros.
>  - Create a new IDT, trace_idt_table, at boot time by duplicating original IDT, idt table, and
>    registering the new handers for tracpoints.
>  - Switch IDT to new one at enabling TP time.
>  - Restore to an original IDT at disabling TP time.
> The new IDT is created only when CONFIG_TRACEPOINTS is enabled to avoid being used for other purposes.
> 
> Signed-off-by: Seiji Aguchi <sei...@hd...>
> ---
>  arch/x86/include/asm/desc.h              |   27 +++++
>  arch/x86/include/asm/entry_arch.h        |   32 +++++
>  arch/x86/include/asm/hw_irq.h            |   14 +++
>  arch/x86/include/asm/trace/irq_vectors.h |  153 ++++++++++++++++++++++++
>  arch/x86/kernel/Makefile                 |    1 +
>  arch/x86/kernel/apic/apic.c              |  186 +++++++++++++++++-------------
>  arch/x86/kernel/cpu/mcheck/therm_throt.c |   26 +++--
>  arch/x86/kernel/cpu/mcheck/threshold.c   |   27 +++--
>  arch/x86/kernel/entry_64.S               |   33 ++++++
>  arch/x86/kernel/head_64.S                |    6 +
>  arch/x86/kernel/irq.c                    |   44 ++++---
>  arch/x86/kernel/irq_work.c               |   22 +++-
>  arch/x86/kernel/irqinit.c                |    2 +
>  arch/x86/kernel/smp.c                    |   68 ++++++++----
>  arch/x86/kernel/tracepoint.c             |  102 ++++++++++++++++
>  15 files changed, 600 insertions(+), 143 deletions(-)  create mode 100644 arch/x86/include/asm/trace/irq_vectors.h
>  create mode 100644 arch/x86/kernel/tracepoint.c
> 
> diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h index 8bf1c06..52becf4 100644
> --- a/arch/x86/include/asm/desc.h
> +++ b/arch/x86/include/asm/desc.h
> @@ -345,6 +345,33 @@ static inline void set_intr_gate(unsigned int n, void *addr)
>  	_set_gate(n, GATE_INTERRUPT, addr, 0, 0, __KERNEL_CS);  }
> 
> +#ifdef CONFIG_TRACEPOINTS
> +extern gate_desc trace_idt_table[];
> +extern void trace_idt_table_init(void); static inline void
> +_trace_set_gate(int gate, unsigned type, void *addr,
> +				   unsigned dpl, unsigned ist, unsigned seg) {
> +	gate_desc s;
> +
> +	pack_gate(&s, type, (unsigned long)addr, dpl, ist, seg);
> +	/*
> +	 * does not need to be atomic because it is only done once at
> +	 * setup time
> +	 */
> +	write_idt_entry(trace_idt_table, gate, &s); }
> +
> +static inline void trace_set_intr_gate(unsigned int n, void *addr) {
> +	BUG_ON((unsigned)n > 0xFF);
> +	_trace_set_gate(n, GATE_INTERRUPT, addr, 0, 0, __KERNEL_CS); } #else
> +static inline void trace_idt_table_init(void) { } #endif
> +
>  extern int first_system_vector;
>  /* used_vectors is BITMAP for irq is not managed by percpu vector_irq */  extern unsigned long used_vectors[]; diff --git
> a/arch/x86/include/asm/entry_arch.h b/arch/x86/include/asm/entry_arch.h
> index 40afa00..8ef3900 100644
> --- a/arch/x86/include/asm/entry_arch.h
> +++ b/arch/x86/include/asm/entry_arch.h
> @@ -45,3 +45,35 @@ BUILD_INTERRUPT(threshold_interrupt,THRESHOLD_APIC_VECTOR)
>  #endif
> 
>  #endif
> +
> +#ifdef CONFIG_TRACEPOINTS
> +#ifdef CONFIG_SMP
> +BUILD_INTERRUPT(trace_reschedule_interrupt, RESCHEDULE_VECTOR)
> +BUILD_INTERRUPT(trace_call_function_interrupt, CALL_FUNCTION_VECTOR)
> +BUILD_INTERRUPT(trace_call_function_single_interrupt,
> +		CALL_FUNCTION_SINGLE_VECTOR)
> +#endif
> +
> +BUILD_INTERRUPT(trace_x86_platform_ipi, X86_PLATFORM_IPI_VECTOR)
> +
> +#ifdef CONFIG_X86_LOCAL_APIC
> +
> +BUILD_INTERRUPT(trace_apic_timer_interrupt, LOCAL_TIMER_VECTOR)
> +BUILD_INTERRUPT(trace_error_interrupt, ERROR_APIC_VECTOR)
> +BUILD_INTERRUPT(trace_spurious_interrupt, SPURIOUS_APIC_VECTOR)
> +
> +#ifdef CONFIG_IRQ_WORK
> +BUILD_INTERRUPT(trace_irq_work_interrupt, IRQ_WORK_VECTOR) #endif
> +
> +#ifdef CONFIG_X86_THERMAL_VECTOR
> +BUILD_INTERRUPT(trace_thermal_interrupt, THERMAL_APIC_VECTOR) #endif
> +
> +#ifdef CONFIG_X86_MCE_THRESHOLD
> +BUILD_INTERRUPT(trace_threshold_interrupt, THRESHOLD_APIC_VECTOR)
> +#endif
> +
> +#endif
> +
> +#endif /* CONFIG_TRACEPOINTS */
> diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h index eb92a6e..4472a78 100644
> --- a/arch/x86/include/asm/hw_irq.h
> +++ b/arch/x86/include/asm/hw_irq.h
> @@ -76,6 +76,20 @@ extern void threshold_interrupt(void);  extern void call_function_interrupt(void);  extern void
> call_function_single_interrupt(void);
> 
> +#ifdef CONFIG_TRACEPOINTS
> +/* Interrupt handlers registered during init_IRQ */ extern void
> +trace_apic_timer_interrupt(void);
> +extern void trace_x86_platform_ipi(void); extern void
> +trace_error_interrupt(void); extern void
> +trace_irq_work_interrupt(void); extern void
> +trace_spurious_interrupt(void); extern void
> +trace_thermal_interrupt(void); extern void
> +trace_reschedule_interrupt(void);
> +extern void trace_threshold_interrupt(void); extern void
> +trace_call_function_interrupt(void);
> +extern void trace_call_function_single_interrupt(void);
> +#endif /* CONFIG_TRACEPOINTS */
> +
>  /* IOAPIC */
>  #define IO_APIC_IRQ(x) (((x) >= NR_IRQS_LEGACY) || ((1<<(x)) & io_apic_irqs))  extern unsigned long io_apic_irqs; diff --git
> a/arch/x86/include/asm/trace/irq_vectors.h b/arch/x86/include/asm/trace/irq_vectors.h
> new file mode 100644
> index 0000000..47858f1
> --- /dev/null
> +++ b/arch/x86/include/asm/trace/irq_vectors.h
> @@ -0,0 +1,153 @@
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM irq_vectors
> +
> +#if !defined(_TRACE_IRQ_VECTORS_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_IRQ_VECTORS_H
> +
> +#include <linux/tracepoint.h>
> +
> +extern void trace_irq_vector_regfunc(void); extern void
> +trace_irq_vector_unregfunc(void);
> +
> +#define DECLARE_IRQ_VECTOR_EVENT(name)				\
> +TRACE_EVENT_FN(name,						\
> +	TP_PROTO(int vector),					\
> +								\
> +	TP_ARGS(vector),					\
> +								\
> +	TP_STRUCT__entry(					\
> +		__field(	int,	vector	)		\
> +	),							\
> +								\
> +	TP_fast_assign(						\
> +		__entry->vector = vector;			\
> +	),							\
> +								\
> +	TP_printk("vector=%d", __entry->vector),		\
> +	trace_irq_vector_regfunc, trace_irq_vector_unregfunc	\
> +);
> +
> +/*
> + * local_timer_entry - called before enterring a local timer interrupt
> + * vector handler
> + */
> +DECLARE_IRQ_VECTOR_EVENT(local_timer_entry)
> +
> +/*
> + * local_timer_exit - called immediately after the interrupt vector
> + * handler returns
> + */
> +DECLARE_IRQ_VECTOR_EVENT(local_timer_exit)
> +
> +/*
> + * reschedule_entry - called before enterring a reschedule vector
> +handler  */
> +DECLARE_IRQ_VECTOR_EVENT(reschedule_entry)
> +
> +/*
> + * reschedule_exit - called immediately after the interrupt vector
> + * handler returns
> + */
> +DECLARE_IRQ_VECTOR_EVENT(reschedule_exit)
> +
> +/*
> + * spurious_apic_entry - called before enterring a spurious apic vector
> +handler  */
> +DECLARE_IRQ_VECTOR_EVENT(spurious_apic_entry)
> +
> +/*
> + * spurious_apic_exit - called immediately after the interrupt vector
> + * handler returns
> + */
> +DECLARE_IRQ_VECTOR_EVENT(spurious_apic_exit)
> +
> +/*
> + * error_apic_entry - called before enterring an error apic vector
> +handler  */
> +DECLARE_IRQ_VECTOR_EVENT(error_apic_entry)
> +
> +/*
> + * error_apic_exit - called immediately after the interrupt vector
> + * handler returns
> + */
> +DECLARE_IRQ_VECTOR_EVENT(error_apic_exit)
> +
> +/*
> + * x86_platform_ipi_entry - called before enterring a x86 platform ipi
> +interrupt
> + * vector handler
> + */
> +DECLARE_IRQ_VECTOR_EVENT(x86_platform_ipi_entry)
> +
> +/*
> + * x86_platform_ipi_exit - called immediately after the interrupt
> +vector
> + * handler returns
> + */
> +DECLARE_IRQ_VECTOR_EVENT(x86_platform_ipi_exit)
> +
> +/*
> + * irq_work_entry - called before enterring a irq work interrupt
> + * vector handler
> + */
> +DECLARE_IRQ_VECTOR_EVENT(irq_work_entry)
> +
> +/*
> + * irq_work_exit - called immediately after the interrupt vector
> + * handler returns
> + */
> +DECLARE_IRQ_VECTOR_EVENT(irq_work_exit)
> +
> +/*
> + * call_function_entry - called before enterring a call function
> +interrupt
> + * vector handler
> + */
> +DECLARE_IRQ_VECTOR_EVENT(call_function_entry)
> +
> +/*
> + * call_function_exit - called immediately after the interrupt vector
> + * handler returns
> + */
> +DECLARE_IRQ_VECTOR_EVENT(call_function_exit)
> +
> +/*
> + * call_function_single_entry - called before enterring a call function
> + * single interrupt vector handler
> + */
> +DECLARE_IRQ_VECTOR_EVENT(call_function_single_entry)
> +
> +/*
> + * call_function_single_exit - called immediately after the interrupt
> +vector
> + * handler returns
> + */
> +DECLARE_IRQ_VECTOR_EVENT(call_function_single_exit)
> +
> +/*
> + * threshold_apic_entry - called before enterring a threshold apic
> +interrupt
> + * vector handler
> + */
> +DECLARE_IRQ_VECTOR_EVENT(threshold_apic_entry)
> +
> +/*
> + * threshold_apic_exit - called immediately after the interrupt vector
> + * handler returns
> + */
> +DECLARE_IRQ_VECTOR_EVENT(threshold_apic_exit)
> +
> +/*
> + * thermal_apic_entry - called before enterring a thermal apic
> +interrupt
> + * vector handler
> + */
> +DECLARE_IRQ_VECTOR_EVENT(thermal_apic_entry)
> +
> +/*
> + * thrmal_apic_exit - called immediately after the interrupt vector
> + * handler returns
> + */
> +DECLARE_IRQ_VECTOR_EVENT(thermal_apic_exit)
> +
> +#undef TRACE_INCLUDE_PATH
> +#define TRACE_INCLUDE_PATH ../../arch/x86/include/asm/trace #define
> +TRACE_INCLUDE_FILE irq_vectors #endif /*  _TRACE_IRQ_VECTORS_H */
> +
> +/* This part must be outside protection */ #include
> +<trace/define_trace.h>
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index 91ce48f..fe4635d 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -100,6 +100,7 @@ obj-$(CONFIG_OF)			+= devicetree.o
>  obj-$(CONFIG_UPROBES)			+= uprobes.o
> 
>  obj-$(CONFIG_PERF_EVENTS)		+= perf_regs.o
> +obj-$(CONFIG_TRACEPOINTS)		+= tracepoint.o
> 
>  ###
>  # 64 bit specific files
> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index b17416e..abbee29 100644
> --- a/arch/x86/kernel/apic/apic.c
> +++ b/arch/x86/kernel/apic/apic.c
> @@ -55,6 +55,9 @@
>  #include <asm/tsc.h>
>  #include <asm/hypervisor.h>
> 
> +#define CREATE_TRACE_POINTS
> +#include <asm/trace/irq_vectors.h>
> +
>  unsigned int num_processors;
> 
>  unsigned disabled_cpus __cpuinitdata;
> @@ -879,27 +882,34 @@ static void local_apic_timer_interrupt(void)
>   * [ if a single-CPU system runs an SMP kernel then we call the local
>   *   interrupt as well. Thus we cannot inline the local irq ... ]
>   */
> -void __irq_entry smp_apic_timer_interrupt(struct pt_regs *regs) -{
> -	struct pt_regs *old_regs = set_irq_regs(regs);
> -
> -	/*
> -	 * NOTE! We'd better ACK the irq immediately,
> -	 * because timer handling can be slow.
> -	 */
> -	ack_APIC_irq();
> -	/*
> -	 * update_process_times() expects us to have done irq_enter().
> -	 * Besides, if we don't timer interrupts ignore the global
> -	 * interrupt lock, which is the WrongThing (tm) to do.
> -	 */
> -	irq_enter();
> -	exit_idle();
> -	local_apic_timer_interrupt();
> -	irq_exit();
> -
> -	set_irq_regs(old_regs);
> -}
> +#define SMP_APIC_TIMER_INTERRUPT(trace, trace_enter, trace_exit)	\
> +void __irq_entry smp_##trace##apic_timer_interrupt(struct pt_regs *regs)\
> +{									\
> +	struct pt_regs *old_regs = set_irq_regs(regs);			\
> +									\
> +	/*								\
> +	 * NOTE! We'd better ACK the irq immediately,			\
> +	 * because timer handling can be slow.				\
> +	 */								\
> +	ack_APIC_irq();							\
> +	/*								\
> +	 * update_process_times() expects us to have done irq_enter().	\
> +	 * Besides, if we don't timer interrupts ignore the global	\
> +	 * interrupt lock, which is the WrongThing (tm) to do.		\
> +	 */								\
> +	irq_enter();							\
> +	exit_idle();							\
> +	trace_enter;							\
> +	local_apic_timer_interrupt();					\
> +	trace_exit;							\
> +	irq_exit();							\
> +									\
> +	set_irq_regs(old_regs);						\
> +}
> +
> +SMP_APIC_TIMER_INTERRUPT(,,)
> +SMP_APIC_TIMER_INTERRUPT(trace_, trace_local_timer_entry(LOCAL_TIMER_VECTOR),
> +			 trace_local_timer_exit(LOCAL_TIMER_VECTOR))
> 
>  int setup_profiling_timer(unsigned int multiplier)  { @@ -1875,71 +1885,91 @@ int __init APIC_init_uniprocessor(void)
>  /*
>   * This interrupt should _never_ happen with our APIC/SMP architecture
>   */
> -void smp_spurious_interrupt(struct pt_regs *regs) -{
> -	u32 v;
> -
> -	irq_enter();
> -	exit_idle();
> -	/*
> -	 * Check if this really is a spurious interrupt and ACK it
> -	 * if it is a vectored one.  Just in case...
> -	 * Spurious interrupts should not be ACKed.
> -	 */
> -	v = apic_read(APIC_ISR + ((SPURIOUS_APIC_VECTOR & ~0x1f) >> 1));
> -	if (v & (1 << (SPURIOUS_APIC_VECTOR & 0x1f)))
> -		ack_APIC_irq();
> -
> -	inc_irq_stat(irq_spurious_count);
> -
> -	/* see sw-dev-man vol 3, chapter 7.4.13.5 */
> -	pr_info("spurious APIC interrupt on CPU#%d, "
> -		"should never happen.\n", smp_processor_id());
> -	irq_exit();
> -}
> +#define SMP_SPURIOUS_INTERRUPT(trace, trace_enter, trace_exit)		\
> +void smp_##trace##spurious_interrupt(struct pt_regs *regs)		\
> +{									\
> +	u32 v;								\
> +									\
> +	irq_enter();							\
> +	exit_idle();							\
> +	trace_enter;							\
> +	/*								\
> +	 * Check if this really is a spurious interrupt and ACK it	\
> +	 * if it is a vectored one.  Just in case...			\
> +	 * Spurious interrupts should not be ACKed.			\
> +	 */								\
> +	v = apic_read(APIC_ISR + ((SPURIOUS_APIC_VECTOR & ~0x1f) >> 1));\
> +	if (v & (1 << (SPURIOUS_APIC_VECTOR & 0x1f)))			\
> +		ack_APIC_irq();						\
> +									\
> +	inc_irq_stat(irq_spurious_count);				\
> +									\
> +	/* see sw-dev-man vol 3, chapter 7.4.13.5 */			\
> +	pr_info("spurious APIC interrupt on CPU#%d, "			\
> +		"should never happen.\n", smp_processor_id());		\
> +	trace_exit;							\
> +	irq_exit();							\
> +}
> +
> +SMP_SPURIOUS_INTERRUPT(,,)
> +SMP_SPURIOUS_INTERRUPT(trace_, trace_spurious_apic_entry(SPURIOUS_APIC_VECTOR),
> +		       trace_spurious_apic_exit(SPURIOUS_APIC_VECTOR))
> 
>  /*
>   * This interrupt should never happen with our APIC/SMP architecture
>   */
> -void smp_error_interrupt(struct pt_regs *regs) -{
> -	u32 v0, v1;
> -	u32 i = 0;
> -	static const char * const error_interrupt_reason[] = {
> -		"Send CS error",		/* APIC Error Bit 0 */
> -		"Receive CS error",		/* APIC Error Bit 1 */
> -		"Send accept error",		/* APIC Error Bit 2 */
> -		"Receive accept error",		/* APIC Error Bit 3 */
> -		"Redirectable IPI",		/* APIC Error Bit 4 */
> -		"Send illegal vector",		/* APIC Error Bit 5 */
> -		"Received illegal vector",	/* APIC Error Bit 6 */
> -		"Illegal register address",	/* APIC Error Bit 7 */
> -	};
> -
> -	irq_enter();
> -	exit_idle();
> -	/* First tickle the hardware, only then report what went on. -- REW */
> -	v0 = apic_read(APIC_ESR);
> -	apic_write(APIC_ESR, 0);
> -	v1 = apic_read(APIC_ESR);
> -	ack_APIC_irq();
> -	atomic_inc(&irq_err_count);
> -
> -	apic_printk(APIC_DEBUG, KERN_DEBUG "APIC error on CPU%d: %02x(%02x)",
> -		    smp_processor_id(), v0 , v1);
> -
> -	v1 = v1 & 0xff;
> -	while (v1) {
> -		if (v1 & 0x1)
> -			apic_printk(APIC_DEBUG, KERN_CONT " : %s", error_interrupt_reason[i]);
> -		i++;
> -		v1 >>= 1;
> -	}
> -
> -	apic_printk(APIC_DEBUG, KERN_CONT "\n");
> -
> -	irq_exit();
> -}
> +#define SMP_ERROR_INTERRUPT(trace, trace_enter, trace_exit)		\
> +void smp_##trace##error_interrupt(struct pt_regs *regs)		\
> +{									\
> +	u32 v0, v1;							\
> +	u32 i = 0;							\
> +	static const char * const error_interrupt_reason[] = {		\
> +		"Send CS error",		/* APIC Error Bit 0 */	\
> +		"Receive CS error",		/* APIC Error Bit 1 */	\
> +		"Send accept error",		/* APIC Error Bit 2 */	\
> +		"Receive accept error",		/* APIC Error Bit 3 */	\
> +		"Redirectable IPI",		/* APIC Error Bit 4 */	\
> +		"Send illegal vector",		/* APIC Error Bit 5 */	\
> +		"Received illegal vector",	/* APIC Error Bit 6 */	\
> +		"Illegal register address",	/* APIC Error Bit 7 */	\
> +	};								\
> +									\
> +	irq_enter();							\
> +	exit_idle();							\
> +	trace_enter;							\
> +	/*								\
> +	 * First tickle the hardware, only then report what went on.	\
> +	 * -- REW							\
> +	 */								\
> +	v0 = apic_read(APIC_ESR);					\
> +	apic_write(APIC_ESR, 0);					\
> +	v1 = apic_read(APIC_ESR);					\
> +	ack_APIC_irq();							\
> +	atomic_inc(&irq_err_count);					\
> +									\
> +	apic_printk(APIC_DEBUG,						\
> +		    KERN_DEBUG "APIC error on CPU%d: %02x(%02x)",	\
> +		    smp_processor_id(), v0 , v1);			\
> +									\
> +	v1 = v1 & 0xff;							\
> +	while (v1) {							\
> +		if (v1 & 0x1)						\
> +			apic_printk(APIC_DEBUG, KERN_CONT " : %s",	\
> +				    error_interrupt_reason[i]);		\
> +		i++;							\
> +		v1 >>= 1;						\
> +	}								\
> +									\
> +	apic_printk(APIC_DEBUG, KERN_CONT "\n");			\
> +									\
> +	trace_exit;							\
> +	irq_exit();							\
> +}
> +
> +
> +SMP_ERROR_INTERRUPT(,,)
> +SMP_ERROR_INTERRUPT(trace_, trace_error_apic_entry(ERROR_APIC_VECTOR),
> +		    trace_error_apic_exit(ERROR_APIC_VECTOR))
> 
>  /**
>   * connect_bsp_APIC - attach the APIC to the interrupt system diff --git a/arch/x86/kernel/cpu/mcheck/therm_throt.c
> b/arch/x86/kernel/cpu/mcheck/therm_throt.c
> index 47a1870..a1c86ab 100644
> --- a/arch/x86/kernel/cpu/mcheck/therm_throt.c
> +++ b/arch/x86/kernel/cpu/mcheck/therm_throt.c
> @@ -23,6 +23,7 @@
>  #include <linux/init.h>
>  #include <linux/smp.h>
>  #include <linux/cpu.h>
> +#include <asm/trace/irq_vectors.h>
> 
>  #include <asm/processor.h>
>  #include <asm/apic.h>
> @@ -378,17 +379,24 @@ static void unexpected_thermal_interrupt(void)
> 
>  static void (*smp_thermal_vector)(void) = unexpected_thermal_interrupt;
> 
> -asmlinkage void smp_thermal_interrupt(struct pt_regs *regs) -{
> -	irq_enter();
> -	exit_idle();
> -	inc_irq_stat(irq_thermal_count);
> -	smp_thermal_vector();
> -	irq_exit();
> -	/* Ack only at the end to avoid potential reentry */
> -	ack_APIC_irq();
> +#define SMP_THERMAL_INTERRUPT(trace, trace_enter, trace_exit)		\
> +asmlinkage void smp_##trace##thermal_interrupt(struct pt_regs *regs)	\
> +{									\
> +	irq_enter();							\
> +	exit_idle();							\
> +	trace_enter;							\
> +	inc_irq_stat(irq_thermal_count);				\
> +	smp_thermal_vector();						\
> +	trace_exit;							\
> +	irq_exit();							\
> +	/* Ack only at the end to avoid potential reentry */		\
> +	ack_APIC_irq();							\
>  }
> 
> +SMP_THERMAL_INTERRUPT(,,)
> +SMP_THERMAL_INTERRUPT(trace_, trace_thermal_apic_entry(THERMAL_APIC_VECTOR),
> +		      trace_thermal_apic_exit(THERMAL_APIC_VECTOR))
> +
>  /* Thermal monitoring depends on APIC, ACPI and clock modulation */  static int intel_thermal_supported(struct cpuinfo_x86 *c)
> { diff --git a/arch/x86/kernel/cpu/mcheck/threshold.c b/arch/x86/kernel/cpu/mcheck/threshold.c
> index aa578ca..b7a95c5 100644
> --- a/arch/x86/kernel/cpu/mcheck/threshold.c
> +++ b/arch/x86/kernel/cpu/mcheck/threshold.c
> @@ -4,6 +4,7 @@
>  #include <linux/interrupt.h>
>  #include <linux/kernel.h>
> 
> +#include <asm/trace/irq_vectors.h>
>  #include <asm/irq_vectors.h>
>  #include <asm/apic.h>
>  #include <asm/idle.h>
> @@ -17,13 +18,21 @@ static void default_threshold_interrupt(void)
> 
>  void (*mce_threshold_vector)(void) = default_threshold_interrupt;
> 
> -asmlinkage void smp_threshold_interrupt(void) -{
> -	irq_enter();
> -	exit_idle();
> -	inc_irq_stat(irq_threshold_count);
> -	mce_threshold_vector();
> -	irq_exit();
> -	/* Ack only at the end to avoid potential reentry */
> -	ack_APIC_irq();
> +#define SMP_THRESHOLD_INTERRUPT(trace, trace_enter, trace_exit)	\
> +asmlinkage void smp_##trace##threshold_interrupt(void)			\
> +{									\
> +	irq_enter();							\
> +	exit_idle();							\
> +	trace_enter;							\
> +	inc_irq_stat(irq_threshold_count);				\
> +	mce_threshold_vector();						\
> +	trace_exit;							\
> +	irq_exit();							\
> +	/* Ack only at the end to avoid potential reentry */		\
> +	ack_APIC_irq();							\
>  }
> +
> +SMP_THRESHOLD_INTERRUPT(,,)
> +SMP_THRESHOLD_INTERRUPT(trace_,
> +			trace_threshold_apic_entry(THRESHOLD_APIC_VECTOR),
> +			trace_threshold_apic_exit(THRESHOLD_APIC_VECTOR))
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index cdc790c..20faa26 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -1187,6 +1187,39 @@ apicinterrupt IRQ_WORK_VECTOR \
>  	irq_work_interrupt smp_irq_work_interrupt  #endif
> 
> +#ifdef CONFIG_TRACEPOINTS
> +
> +apicinterrupt LOCAL_TIMER_VECTOR \
> +	trace_apic_timer_interrupt smp_trace_apic_timer_interrupt
> +apicinterrupt X86_PLATFORM_IPI_VECTOR \
> +	trace_x86_platform_ipi smp_trace_x86_platform_ipi
> +
> +apicinterrupt THRESHOLD_APIC_VECTOR \
> +	trace_threshold_interrupt smp_trace_threshold_interrupt apicinterrupt
> +THERMAL_APIC_VECTOR \
> +	trace_thermal_interrupt smp_trace_thermal_interrupt
> +
> +#ifdef CONFIG_SMP
> +apicinterrupt CALL_FUNCTION_SINGLE_VECTOR \
> +	trace_call_function_single_interrupt \
> +	smp_trace_call_function_single_interrupt
> +apicinterrupt CALL_FUNCTION_VECTOR \
> +	trace_call_function_interrupt smp_trace_call_function_interrupt
> +apicinterrupt RESCHEDULE_VECTOR \
> +	trace_reschedule_interrupt smp_trace_reschedule_interrupt #endif
> +
> +apicinterrupt ERROR_APIC_VECTOR \
> +	trace_error_interrupt smp_trace_error_interrupt apicinterrupt
> +SPURIOUS_APIC_VECTOR \
> +	trace_spurious_interrupt smp_trace_spurious_interrupt
> +
> +#ifdef CONFIG_IRQ_WORK
> +apicinterrupt IRQ_WORK_VECTOR \
> +	trace_irq_work_interrupt smp_trace_irq_work_interrupt #endif #endif /*
> +CONFIG_TRACEPOINTS */
> +
>  /*
>   * Exception entry points.
>   */
> diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S index 94bf9cc..cc32708 100644
> --- a/arch/x86/kernel/head_64.S
> +++ b/arch/x86/kernel/head_64.S
> @@ -455,6 +455,12 @@ ENTRY(idt_table)
>  ENTRY(nmi_idt_table)
>  	.skip IDT_ENTRIES * 16
> 
> +#ifdef CONFIG_TRACEPOINTS
> +	.align L1_CACHE_BYTES
> +ENTRY(trace_idt_table)
> +	.skip IDT_ENTRIES * 16
> +#endif
> +
>  	__PAGE_ALIGNED_BSS
>  	.align PAGE_SIZE
>  ENTRY(empty_zero_page)
> diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c index e4595f1..9fd70ad 100644
> --- a/arch/x86/kernel/irq.c
> +++ b/arch/x86/kernel/irq.c
> @@ -18,6 +18,8 @@
>  #include <asm/mce.h>
>  #include <asm/hw_irq.h>
> 
> +#include <asm/trace/irq_vectors.h>
> +
>  atomic_t irq_err_count;
> 
>  /* Function pointer for generic interrupt vector handling */ @@ -208,26 +210,32 @@ unsigned int __irq_entry do_IRQ(struct pt_regs
> *regs)
>  /*
>   * Handler for X86_PLATFORM_IPI_VECTOR.
>   */
> -void smp_x86_platform_ipi(struct pt_regs *regs) -{
> -	struct pt_regs *old_regs = set_irq_regs(regs);
> -
> -	ack_APIC_irq();
> -
> -	irq_enter();
> -
> -	exit_idle();
> -
> -	inc_irq_stat(x86_platform_ipis);
> -
> -	if (x86_platform_ipi_callback)
> -		x86_platform_ipi_callback();
> -
> -	irq_exit();
> -
> -	set_irq_regs(old_regs);
> +#define SMP_X86_PLATFORM_IPI(trace, trace_enter, trace_exit)		\
> +void smp_##trace##x86_platform_ipi(struct pt_regs *regs)		\
> +{									\
> +	struct pt_regs *old_regs = set_irq_regs(regs);			\
> +									\
> +	ack_APIC_irq();							\
> +									\
> +	irq_enter();							\
> +									\
> +	exit_idle();							\
> +	trace_enter;							\
> +	inc_irq_stat(x86_platform_ipis);				\
> +									\
> +	if (x86_platform_ipi_callback)					\
> +		x86_platform_ipi_callback();				\
> +	trace_exit;							\
> +	irq_exit();							\
> +									\
> +	set_irq_regs(old_regs);						\
>  }
> 
> +SMP_X86_PLATFORM_IPI(,,)
> +SMP_X86_PLATFORM_IPI(trace_,
> +		     trace_x86_platform_ipi_entry(X86_PLATFORM_IPI_VECTOR),
> +		     trace_x86_platform_ipi_exit(X86_PLATFORM_IPI_VECTOR))
> +
>  EXPORT_SYMBOL_GPL(vector_used_by_percpu_irq);
> 
>  #ifdef CONFIG_HOTPLUG_CPU
> diff --git a/arch/x86/kernel/irq_work.c b/arch/x86/kernel/irq_work.c index ca8f703..a669b94 100644
> --- a/arch/x86/kernel/irq_work.c
> +++ b/arch/x86/kernel/irq_work.c
> @@ -8,16 +8,24 @@
>  #include <linux/irq_work.h>
>  #include <linux/hardirq.h>
>  #include <asm/apic.h>
> +#include <asm/trace/irq_vectors.h>
> 
> -void smp_irq_work_interrupt(struct pt_regs *regs) -{
> -	irq_enter();
> -	ack_APIC_irq();
> -	inc_irq_stat(apic_irq_work_irqs);
> -	irq_work_run();
> -	irq_exit();
> +#define SMP_IRQ_WORK_INTERRUPT(trace, trace_enter, trace_exit)		\
> +void smp_##trace##irq_work_interrupt(struct pt_regs *regs)		\
> +{									\
> +	irq_enter();							\
> +	ack_APIC_irq();							\
> +	trace_enter;							\
> +	inc_irq_stat(apic_irq_work_irqs);				\
> +	irq_work_run();							\
> +	trace_exit;							\
> +	irq_exit();							\
>  }
> 
> +SMP_IRQ_WORK_INTERRUPT(,,)
> +SMP_IRQ_WORK_INTERRUPT(trace_, trace_irq_work_entry(IRQ_WORK_VECTOR),
> +		       trace_irq_work_exit(IRQ_WORK_VECTOR))
> +
>  void arch_irq_work_raise(void)
>  {
>  #ifdef CONFIG_X86_LOCAL_APIC
> diff --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c index 6e03b0d..cf76128 100644
> --- a/arch/x86/kernel/irqinit.c
> +++ b/arch/x86/kernel/irqinit.c
> @@ -251,4 +251,6 @@ void __init native_init_IRQ(void)
> 
>  	irq_ctx_init(smp_processor_id());
>  #endif
> +
> +	trace_idt_table_init();
>  }
> diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c index 48d2b7d..d8e1a2c 100644
> --- a/arch/x86/kernel/smp.c
> +++ b/arch/x86/kernel/smp.c
> @@ -23,6 +23,7 @@
>  #include <linux/interrupt.h>
>  #include <linux/cpu.h>
>  #include <linux/gfp.h>
> +#include <asm/trace/irq_vectors.h>
> 
>  #include <asm/mtrr.h>
>  #include <asm/tlbflush.h>
> @@ -249,34 +250,57 @@ finish:
>  /*
>   * Reschedule call back.
>   */
> -void smp_reschedule_interrupt(struct pt_regs *regs) -{
> -	ack_APIC_irq();
> -	inc_irq_stat(irq_resched_count);
> -	scheduler_ipi();
> -	/*
> -	 * KVM uses this interrupt to force a cpu out of guest mode
> -	 */
> +#define SMP_RESCHEDULE_INTERRUPT(trace, trace_enter, trace_exit)	\
> +void smp_##trace##reschedule_interrupt(struct pt_regs *regs)		\
> +{									\
> +	ack_APIC_irq();							\
> +	trace_enter;							\
> +	inc_irq_stat(irq_resched_count);				\
> +	scheduler_ipi();						\
> +	trace_exit;							\
> +	/*								\
> +	 * KVM uses this interrupt to force a cpu out of guest mode	\
> +	 */								\
>  }
> 
> -void smp_call_function_interrupt(struct pt_regs *regs) -{
> -	ack_APIC_irq();
> -	irq_enter();
> -	generic_smp_call_function_interrupt();
> -	inc_irq_stat(irq_call_count);
> -	irq_exit();
> +SMP_RESCHEDULE_INTERRUPT(,,)
> +SMP_RESCHEDULE_INTERRUPT(trace_, trace_reschedule_entry(RESCHEDULE_VECTOR),
> +			 trace_reschedule_exit(RESCHEDULE_VECTOR))
> +
> +#define SMP_CALL_FUNCTION_INTERRUPT(trace, trace_enter, trace_exit)	\
> +void smp_##trace##call_function_interrupt(struct pt_regs *regs)	\
> +{									\
> +	ack_APIC_irq();							\
> +	irq_enter();							\
> +	trace_enter;							\
> +	generic_smp_call_function_interrupt();				\
> +	inc_irq_stat(irq_call_count);					\
> +	trace_exit;							\
> +	irq_exit();							\
>  }
> 
> -void smp_call_function_single_interrupt(struct pt_regs *regs) -{
> -	ack_APIC_irq();
> -	irq_enter();
> -	generic_smp_call_function_single_interrupt();
> -	inc_irq_stat(irq_call_count);
> -	irq_exit();
> +SMP_CALL_FUNCTION_INTERRUPT(,,)
> +SMP_CALL_FUNCTION_INTERRUPT(trace_,
> +			    trace_call_function_entry(CALL_FUNCTION_VECTOR),
> +			    trace_call_function_exit(CALL_FUNCTION_VECTOR))
> +
> +#define SMP_CALL_FUNCTION_SINGLE_INTERRUPT(trace, trace_enter, trace_exit)\
> +void smp_##trace##call_function_single_interrupt(struct pt_regs *regs)	\
> +{									\
> +	ack_APIC_irq();							\
> +	irq_enter();							\
> +	trace_enter;							\
> +	generic_smp_call_function_single_interrupt();			\
> +	inc_irq_stat(irq_call_count);					\
> +	trace_exit;							\
> +	irq_exit();							\
>  }
> 
> +SMP_CALL_FUNCTION_SINGLE_INTERRUPT(,,)
> +SMP_CALL_FUNCTION_SINGLE_INTERRUPT(trace_,
> +		trace_call_function_single_entry(CALL_FUNCTION_SINGLE_VECTOR),
> +		trace_call_function_single_exit(CALL_FUNCTION_SINGLE_VECTOR))
> +
>  static int __init nonmi_ipi_setup(char *str)  {
>  	smp_no_nmi_ipi = true;
> diff --git a/arch/x86/kernel/tracepoint.c b/arch/x86/kernel/tracepoint.c new file mode 100644 index 0000000..d7c96ba
> --- /dev/null
> +++ b/arch/x86/kernel/tracepoint.c
> @@ -0,0 +1,102 @@
> +/*
> + * Code for supporting irq vector tracepoints.
> + *
> + * Copyright (C) 2012 Seiji Aguchi <sei...@hd...>
> + *
> + */
> +#include <asm/hw_irq.h>
> +#include <asm/desc.h>
> +
> +static struct desc_ptr trace_idt_descr = { NR_VECTORS * 16 - 1,
> +				    (unsigned long) trace_idt_table };
> +
> +#ifndef CONFIG_X86_64
> +gate_desc trace_idt_table[NR_VECTORS] __page_aligned_data
> +					= { { { { 0, 0 } } }, };
> +#endif
> +
> +void __init trace_idt_table_init(void)
> +{
> +	memcpy(&trace_idt_table, &idt_table, IDT_ENTRIES * 16);
> +	/*
> +	 * The reschedule interrupt is a CPU-to-CPU reschedule-helper
> +	 * IPI, driven by wakeup.
> +	 */
> +	trace_set_intr_gate(RESCHEDULE_VECTOR, trace_reschedule_interrupt);
> +
> +	/* IPI for generic function call */
> +	trace_set_intr_gate(CALL_FUNCTION_VECTOR,
> +			    trace_call_function_interrupt);
> +
> +	/* IPI for generic single function call */
> +	trace_set_intr_gate(CALL_FUNCTION_SINGLE_VECTOR,
> +			    trace_call_function_single_interrupt);
> +
> +#ifdef CONFIG_X86_THERMAL_VECTOR
> +	trace_set_intr_gate(THERMAL_APIC_VECTOR, trace_thermal_interrupt);
> +#endif #ifdef CONFIG_X86_MCE_THRESHOLD
> +	trace_set_intr_gate(THRESHOLD_APIC_VECTOR, trace_threshold_interrupt);
> +#endif
> +
> +#if defined(CONFIG_X86_64) || defined(CONFIG_X86_LOCAL_APIC)
> +	/* self generated IPI for local APIC timer */
> +	trace_set_intr_gate(LOCAL_TIMER_VECTOR, trace_apic_timer_interrupt);
> +
> +	/* IPI for X86 platform specific use */
> +	trace_set_intr_gate(X86_PLATFORM_IPI_VECTOR, trace_x86_platform_ipi);
> +
> +	/* IPI vectors for APIC spurious and error interrupts */
> +	trace_set_intr_gate(SPURIOUS_APIC_VECTOR, trace_spurious_interrupt);
> +	trace_set_intr_gate(ERROR_APIC_VECTOR, trace_error_interrupt);
> +
> +	/* IRQ work interrupts: */
> +# ifdef CONFIG_IRQ_WORK
> +	trace_set_intr_gate(IRQ_WORK_VECTOR, trace_irq_work_interrupt); #
> +endif # endif }
> +
> +static struct desc_ptr orig_idt_descr[NR_CPUS]; static int
> +trace_irq_vector_refcount;
> +
> +static void switch_trace_idt(void *arg) {
> +	store_idt(&orig_idt_descr[smp_processor_id()]);
> +	load_idt(&trace_idt_descr);
> +
> +	return;
> +}
> +
> +static void restore_original_idt(void *arg) {
> +	if (orig_idt_descr[smp_processor_id()].address) {
> +		load_idt(&orig_idt_descr[smp_processor_id()]);
> +		memset(&orig_idt_descr[smp_processor_id()], 0,
> +		       sizeof(struct desc_ptr));
> +	}
> +
> +	return;
> +}
> +
> +void trace_irq_vector_regfunc(void)
> +{
> +	if (!trace_irq_vector_refcount) {
> +		smp_call_function(switch_trace_idt, NULL, 0);
> +		local_irq_disable();
> +		switch_trace_idt(NULL);
> +		local_irq_enable();
> +	}
> +	trace_irq_vector_refcount++;
> +}
> +
> +void trace_irq_vector_unregfunc(void)
> +{
> +	trace_irq_vector_refcount--;
> +	if (!trace_irq_vector_refcount) {
> +		smp_call_function(restore_original_idt, NULL, 0);
> +		local_irq_disable();
> +		restore_original_idt(NULL);
> +		local_irq_enable();
> +	}
> +}
> -- 1.7.1

[Dle-develop] [RFC][PATCH v5]trace, x86: add x86 irq vector tracepoints

From: Seiji A. <sei...@hd...> - 2012-10-11 17:26:32

Change log 

 v4 -> v5
 - Rebased to 3.6.0

 - Introduce a logic switching IDT at enabling/disabling TP time 
   so that a time penalty makes a zero when tracepoints are disabled.
   This IDT is created only when CONFIG_TRACEPOINTS is enabled.

 - Remove arch_irq_vector_entry/exit and add followings again
   so that we can add each tracepoint in a generic way.
   - error_apic_vector
   - thermal_apic_vector
   - threshold_apic_vector
   - spurious_apic_vector
   - x86_platform_ipi_vector

 - Drop nmi tracepoints to begin with apic interrupts and discuss a logic switching
   IDT first.

 - Move irq_vectors.h in the directory of arch/x86/include/asm/trace because
   I'm not sure if a logic switching IDT is sharable with other architectures.

 v3 -> v4
 - Add a latency measurement of each tracepoint
 - Rebased to 3.6-rc6

 v2 -> v3
 - Remove an invalidate_tlb_vector event because it was replaced by a call function vector
   in a following commit.
   http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=52aec3308db85f4e9f5c8b9f5dc4fbd0138c6fa4

 v1 -> v2
 - Modify variable name from irq to vector.
 - Merge arch-specific tracepoints below to an arch_irq_vector_entry/exit.
   - error_apic_vector
   - thermal_apic_vector
   - threshold_apic_vector
   - spurious_apic_vector
   - x86_platform_ipi_vector

[Purpose of this patch]

As Vaibhav explained in the thread below, tracepoints for irq vectors
are useful.

http://www.spinics.net/lists/mm-commits/msg85707.html

<snip>
The current interrupt traces from irq_handler_entry and irq_handler_exit
provide when an interrupt is handled.  They provide good data about when
the system has switched to kernel space and how it affects the currently
running processes.

There are some IRQ vectors which trigger the system into kernel space,
which are not handled in generic IRQ handlers.  Tracing such events gives
us the information about IRQ interaction with other system events.

The trace also tells where the system is spending its time.  We want to
know which cores are handling interrupts and how they are affecting other
processes in the system.  Also, the trace provides information about when
the cores are idle and which interrupts are changing that state.
<snip>

On the other hand, my usecase is tracing just local timer event and 
getting a value of instruction pointer.

  I suggested to add an argument local timer event to get instruction pointer before.
  But there is another way to get it with external module like systemtap.
  So, I don't need to add any argument to irq vector tracepoints now.

[Patch Description]

Vaibhav's patch shared a trace point ,irq_vector_entry/irq_vector_exit, in all events.
But there is an above use case to trace specific irq_vector rather than tracing all events.
In this case, we are concerned about overhead due to unwanted events.

This patch adds following tracepoints instead of introducing irq_vector_entry/exit.
so that we can enable them independently.
   - local_timer_vector
   - reschedule_vector
   - call_function_vector
   - call_function_single_vector 
   - irq_work_entry_vector
   - error_apic_vector
   - thermal_apic_vector
   - threshold_apic_vector
   - spurious_apic_vector
   - x86_platform_ipi_vector

Also, it introduces a logic switching IDT at enabling/disabling time so that a time penalty makes 
a complete zero when tracepoints are disabled. Detailed explanations are as follows.
 - Create new irq handlers inserted tracepoints by using macros.
 - Create a new IDT, trace_idt_table, at boot time by duplicating original IDT, idt table, and 
   registering the new handers for tracpoints.
 - Switch IDT to new one at enabling TP time.
 - Restore to an original IDT at disabling TP time.
The new IDT is created only when CONFIG_TRACEPOINTS is enabled to avoid being used for other purposes.

Signed-off-by: Seiji Aguchi <sei...@hd...>
---
 arch/x86/include/asm/desc.h              |   27 +++++
 arch/x86/include/asm/entry_arch.h        |   32 +++++
 arch/x86/include/asm/hw_irq.h            |   14 +++
 arch/x86/include/asm/trace/irq_vectors.h |  153 ++++++++++++++++++++++++
 arch/x86/kernel/Makefile                 |    1 +
 arch/x86/kernel/apic/apic.c              |  186 +++++++++++++++++-------------
 arch/x86/kernel/cpu/mcheck/therm_throt.c |   26 +++--
 arch/x86/kernel/cpu/mcheck/threshold.c   |   27 +++--
 arch/x86/kernel/entry_64.S               |   33 ++++++
 arch/x86/kernel/head_64.S                |    6 +
 arch/x86/kernel/irq.c                    |   44 ++++---
 arch/x86/kernel/irq_work.c               |   22 +++-
 arch/x86/kernel/irqinit.c                |    2 +
 arch/x86/kernel/smp.c                    |   68 ++++++++----
 arch/x86/kernel/tracepoint.c             |  102 ++++++++++++++++
 15 files changed, 600 insertions(+), 143 deletions(-)
 create mode 100644 arch/x86/include/asm/trace/irq_vectors.h
 create mode 100644 arch/x86/kernel/tracepoint.c

diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index 8bf1c06..52becf4 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -345,6 +345,33 @@ static inline void set_intr_gate(unsigned int n, void *addr)
 	_set_gate(n, GATE_INTERRUPT, addr, 0, 0, __KERNEL_CS);
 }
 
+#ifdef CONFIG_TRACEPOINTS
+extern gate_desc trace_idt_table[];
+extern void trace_idt_table_init(void);
+static inline void _trace_set_gate(int gate, unsigned type, void *addr,
+				   unsigned dpl, unsigned ist, unsigned seg)
+{
+	gate_desc s;
+
+	pack_gate(&s, type, (unsigned long)addr, dpl, ist, seg);
+	/*
+	 * does not need to be atomic because it is only done once at
+	 * setup time
+	 */
+	write_idt_entry(trace_idt_table, gate, &s);
+}
+
+static inline void trace_set_intr_gate(unsigned int n, void *addr)
+{
+	BUG_ON((unsigned)n > 0xFF);
+	_trace_set_gate(n, GATE_INTERRUPT, addr, 0, 0, __KERNEL_CS);
+}
+#else
+static inline void trace_idt_table_init(void)
+{
+}
+#endif
+
 extern int first_system_vector;
 /* used_vectors is BITMAP for irq is not managed by percpu vector_irq */
 extern unsigned long used_vectors[];
diff --git a/arch/x86/include/asm/entry_arch.h b/arch/x86/include/asm/entry_arch.h
index 40afa00..8ef3900 100644
--- a/arch/x86/include/asm/entry_arch.h
+++ b/arch/x86/include/asm/entry_arch.h
@@ -45,3 +45,35 @@ BUILD_INTERRUPT(threshold_interrupt,THRESHOLD_APIC_VECTOR)
 #endif
 
 #endif
+
+#ifdef CONFIG_TRACEPOINTS
+#ifdef CONFIG_SMP
+BUILD_INTERRUPT(trace_reschedule_interrupt, RESCHEDULE_VECTOR)
+BUILD_INTERRUPT(trace_call_function_interrupt, CALL_FUNCTION_VECTOR)
+BUILD_INTERRUPT(trace_call_function_single_interrupt,
+		CALL_FUNCTION_SINGLE_VECTOR)
+#endif
+
+BUILD_INTERRUPT(trace_x86_platform_ipi, X86_PLATFORM_IPI_VECTOR)
+
+#ifdef CONFIG_X86_LOCAL_APIC
+
+BUILD_INTERRUPT(trace_apic_timer_interrupt, LOCAL_TIMER_VECTOR)
+BUILD_INTERRUPT(trace_error_interrupt, ERROR_APIC_VECTOR)
+BUILD_INTERRUPT(trace_spurious_interrupt, SPURIOUS_APIC_VECTOR)
+
+#ifdef CONFIG_IRQ_WORK
+BUILD_INTERRUPT(trace_irq_work_interrupt, IRQ_WORK_VECTOR)
+#endif
+
+#ifdef CONFIG_X86_THERMAL_VECTOR
+BUILD_INTERRUPT(trace_thermal_interrupt, THERMAL_APIC_VECTOR)
+#endif
+
+#ifdef CONFIG_X86_MCE_THRESHOLD
+BUILD_INTERRUPT(trace_threshold_interrupt, THRESHOLD_APIC_VECTOR)
+#endif
+
+#endif
+
+#endif /* CONFIG_TRACEPOINTS */
diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index eb92a6e..4472a78 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -76,6 +76,20 @@ extern void threshold_interrupt(void);
 extern void call_function_interrupt(void);
 extern void call_function_single_interrupt(void);
 
+#ifdef CONFIG_TRACEPOINTS
+/* Interrupt handlers registered during init_IRQ */
+extern void trace_apic_timer_interrupt(void);
+extern void trace_x86_platform_ipi(void);
+extern void trace_error_interrupt(void);
+extern void trace_irq_work_interrupt(void);
+extern void trace_spurious_interrupt(void);
+extern void trace_thermal_interrupt(void);
+extern void trace_reschedule_interrupt(void);
+extern void trace_threshold_interrupt(void);
+extern void trace_call_function_interrupt(void);
+extern void trace_call_function_single_interrupt(void);
+#endif /* CONFIG_TRACEPOINTS */
+
 /* IOAPIC */
 #define IO_APIC_IRQ(x) (((x) >= NR_IRQS_LEGACY) || ((1<<(x)) & io_apic_irqs))
 extern unsigned long io_apic_irqs;
diff --git a/arch/x86/include/asm/trace/irq_vectors.h b/arch/x86/include/asm/trace/irq_vectors.h
new file mode 100644
index 0000000..47858f1
--- /dev/null
+++ b/arch/x86/include/asm/trace/irq_vectors.h
@@ -0,0 +1,153 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM irq_vectors
+
+#if !defined(_TRACE_IRQ_VECTORS_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_IRQ_VECTORS_H
+
+#include <linux/tracepoint.h>
+
+extern void trace_irq_vector_regfunc(void);
+extern void trace_irq_vector_unregfunc(void);
+
+#define DECLARE_IRQ_VECTOR_EVENT(name)				\
+TRACE_EVENT_FN(name,						\
+	TP_PROTO(int vector),					\
+								\
+	TP_ARGS(vector),					\
+								\
+	TP_STRUCT__entry(					\
+		__field(	int,	vector	)		\
+	),							\
+								\
+	TP_fast_assign(						\
+		__entry->vector = vector;			\
+	),							\
+								\
+	TP_printk("vector=%d", __entry->vector),		\
+	trace_irq_vector_regfunc, trace_irq_vector_unregfunc	\
+);
+
+/*
+ * local_timer_entry - called before enterring a local timer interrupt
+ * vector handler
+ */
+DECLARE_IRQ_VECTOR_EVENT(local_timer_entry)
+
+/*
+ * local_timer_exit - called immediately after the interrupt vector
+ * handler returns
+ */
+DECLARE_IRQ_VECTOR_EVENT(local_timer_exit)
+
+/*
+ * reschedule_entry - called before enterring a reschedule vector handler
+ */
+DECLARE_IRQ_VECTOR_EVENT(reschedule_entry)
+
+/*
+ * reschedule_exit - called immediately after the interrupt vector
+ * handler returns
+ */
+DECLARE_IRQ_VECTOR_EVENT(reschedule_exit)
+
+/*
+ * spurious_apic_entry - called before enterring a spurious apic vector handler
+ */
+DECLARE_IRQ_VECTOR_EVENT(spurious_apic_entry)
+
+/*
+ * spurious_apic_exit - called immediately after the interrupt vector
+ * handler returns
+ */
+DECLARE_IRQ_VECTOR_EVENT(spurious_apic_exit)
+
+/*
+ * error_apic_entry - called before enterring an error apic vector handler
+ */
+DECLARE_IRQ_VECTOR_EVENT(error_apic_entry)
+
+/*
+ * error_apic_exit - called immediately after the interrupt vector
+ * handler returns
+ */
+DECLARE_IRQ_VECTOR_EVENT(error_apic_exit)
+
+/*
+ * x86_platform_ipi_entry - called before enterring a x86 platform ipi interrupt
+ * vector handler
+ */
+DECLARE_IRQ_VECTOR_EVENT(x86_platform_ipi_entry)
+
+/*
+ * x86_platform_ipi_exit - called immediately after the interrupt vector
+ * handler returns
+ */
+DECLARE_IRQ_VECTOR_EVENT(x86_platform_ipi_exit)
+
+/*
+ * irq_work_entry - called before enterring a irq work interrupt
+ * vector handler
+ */
+DECLARE_IRQ_VECTOR_EVENT(irq_work_entry)
+
+/*
+ * irq_work_exit - called immediately after the interrupt vector
+ * handler returns
+ */
+DECLARE_IRQ_VECTOR_EVENT(irq_work_exit)
+
+/*
+ * call_function_entry - called before enterring a call function interrupt
+ * vector handler
+ */
+DECLARE_IRQ_VECTOR_EVENT(call_function_entry)
+
+/*
+ * call_function_exit - called immediately after the interrupt vector
+ * handler returns
+ */
+DECLARE_IRQ_VECTOR_EVENT(call_function_exit)
+
+/*
+ * call_function_single_entry - called before enterring a call function
+ * single interrupt vector handler
+ */
+DECLARE_IRQ_VECTOR_EVENT(call_function_single_entry)
+
+/*
+ * call_function_single_exit - called immediately after the interrupt vector
+ * handler returns
+ */
+DECLARE_IRQ_VECTOR_EVENT(call_function_single_exit)
+
+/*
+ * threshold_apic_entry - called before enterring a threshold apic interrupt
+ * vector handler
+ */
+DECLARE_IRQ_VECTOR_EVENT(threshold_apic_entry)
+
+/*
+ * threshold_apic_exit - called immediately after the interrupt vector
+ * handler returns
+ */
+DECLARE_IRQ_VECTOR_EVENT(threshold_apic_exit)
+
+/*
+ * thermal_apic_entry - called before enterring a thermal apic interrupt
+ * vector handler
+ */
+DECLARE_IRQ_VECTOR_EVENT(thermal_apic_entry)
+
+/*
+ * thrmal_apic_exit - called immediately after the interrupt vector
+ * handler returns
+ */
+DECLARE_IRQ_VECTOR_EVENT(thermal_apic_exit)
+
+#undef TRACE_INCLUDE_PATH
+#define TRACE_INCLUDE_PATH ../../arch/x86/include/asm/trace
+#define TRACE_INCLUDE_FILE irq_vectors
+#endif /*  _TRACE_IRQ_VECTORS_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 91ce48f..fe4635d 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -100,6 +100,7 @@ obj-$(CONFIG_OF)			+= devicetree.o
 obj-$(CONFIG_UPROBES)			+= uprobes.o
 
 obj-$(CONFIG_PERF_EVENTS)		+= perf_regs.o
+obj-$(CONFIG_TRACEPOINTS)		+= tracepoint.o
 
 ###
 # 64 bit specific files
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index b17416e..abbee29 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -55,6 +55,9 @@
 #include <asm/tsc.h>
 #include <asm/hypervisor.h>
 
+#define CREATE_TRACE_POINTS
+#include <asm/trace/irq_vectors.h>
+
 unsigned int num_processors;
 
 unsigned disabled_cpus __cpuinitdata;
@@ -879,27 +882,34 @@ static void local_apic_timer_interrupt(void)
  * [ if a single-CPU system runs an SMP kernel then we call the local
  *   interrupt as well. Thus we cannot inline the local irq ... ]
  */
-void __irq_entry smp_apic_timer_interrupt(struct pt_regs *regs)
-{
-	struct pt_regs *old_regs = set_irq_regs(regs);
-
-	/*
-	 * NOTE! We'd better ACK the irq immediately,
-	 * because timer handling can be slow.
-	 */
-	ack_APIC_irq();
-	/*
-	 * update_process_times() expects us to have done irq_enter().
-	 * Besides, if we don't timer interrupts ignore the global
-	 * interrupt lock, which is the WrongThing (tm) to do.
-	 */
-	irq_enter();
-	exit_idle();
-	local_apic_timer_interrupt();
-	irq_exit();
-
-	set_irq_regs(old_regs);
-}
+#define SMP_APIC_TIMER_INTERRUPT(trace, trace_enter, trace_exit)	\
+void __irq_entry smp_##trace##apic_timer_interrupt(struct pt_regs *regs)\
+{									\
+	struct pt_regs *old_regs = set_irq_regs(regs);			\
+									\
+	/*								\
+	 * NOTE! We'd better ACK the irq immediately,			\
+	 * because timer handling can be slow.				\
+	 */								\
+	ack_APIC_irq();							\
+	/*								\
+	 * update_process_times() expects us to have done irq_enter().	\
+	 * Besides, if we don't timer interrupts ignore the global	\
+	 * interrupt lock, which is the WrongThing (tm) to do.		\
+	 */								\
+	irq_enter();							\
+	exit_idle();							\
+	trace_enter;							\
+	local_apic_timer_interrupt();					\
+	trace_exit;							\
+	irq_exit();							\
+									\
+	set_irq_regs(old_regs);						\
+}
+
+SMP_APIC_TIMER_INTERRUPT(,,)
+SMP_APIC_TIMER_INTERRUPT(trace_, trace_local_timer_entry(LOCAL_TIMER_VECTOR),
+			 trace_local_timer_exit(LOCAL_TIMER_VECTOR))
 
 int setup_profiling_timer(unsigned int multiplier)
 {
@@ -1875,71 +1885,91 @@ int __init APIC_init_uniprocessor(void)
 /*
  * This interrupt should _never_ happen with our APIC/SMP architecture
  */
-void smp_spurious_interrupt(struct pt_regs *regs)
-{
-	u32 v;
-
-	irq_enter();
-	exit_idle();
-	/*
-	 * Check if this really is a spurious interrupt and ACK it
-	 * if it is a vectored one.  Just in case...
-	 * Spurious interrupts should not be ACKed.
-	 */
-	v = apic_read(APIC_ISR + ((SPURIOUS_APIC_VECTOR & ~0x1f) >> 1));
-	if (v & (1 << (SPURIOUS_APIC_VECTOR & 0x1f)))
-		ack_APIC_irq();
-
-	inc_irq_stat(irq_spurious_count);
-
-	/* see sw-dev-man vol 3, chapter 7.4.13.5 */
-	pr_info("spurious APIC interrupt on CPU#%d, "
-		"should never happen.\n", smp_processor_id());
-	irq_exit();
-}
+#define SMP_SPURIOUS_INTERRUPT(trace, trace_enter, trace_exit)		\
+void smp_##trace##spurious_interrupt(struct pt_regs *regs)		\
+{									\
+	u32 v;								\
+									\
+	irq_enter();							\
+	exit_idle();							\
+	trace_enter;							\
+	/*								\
+	 * Check if this really is a spurious interrupt and ACK it	\
+	 * if it is a vectored one.  Just in case...			\
+	 * Spurious interrupts should not be ACKed.			\
+	 */								\
+	v = apic_read(APIC_ISR + ((SPURIOUS_APIC_VECTOR & ~0x1f) >> 1));\
+	if (v & (1 << (SPURIOUS_APIC_VECTOR & 0x1f)))			\
+		ack_APIC_irq();						\
+									\
+	inc_irq_stat(irq_spurious_count);				\
+									\
+	/* see sw-dev-man vol 3, chapter 7.4.13.5 */			\
+	pr_info("spurious APIC interrupt on CPU#%d, "			\
+		"should never happen.\n", smp_processor_id());		\
+	trace_exit;							\
+	irq_exit();							\
+}
+
+SMP_SPURIOUS_INTERRUPT(,,)
+SMP_SPURIOUS_INTERRUPT(trace_, trace_spurious_apic_entry(SPURIOUS_APIC_VECTOR),
+		       trace_spurious_apic_exit(SPURIOUS_APIC_VECTOR))
 
 /*
  * This interrupt should never happen with our APIC/SMP architecture
  */
-void smp_error_interrupt(struct pt_regs *regs)
-{
-	u32 v0, v1;
-	u32 i = 0;
-	static const char * const error_interrupt_reason[] = {
-		"Send CS error",		/* APIC Error Bit 0 */
-		"Receive CS error",		/* APIC Error Bit 1 */
-		"Send accept error",		/* APIC Error Bit 2 */
-		"Receive accept error",		/* APIC Error Bit 3 */
-		"Redirectable IPI",		/* APIC Error Bit 4 */
-		"Send illegal vector",		/* APIC Error Bit 5 */
-		"Received illegal vector",	/* APIC Error Bit 6 */
-		"Illegal register address",	/* APIC Error Bit 7 */
-	};
-
-	irq_enter();
-	exit_idle();
-	/* First tickle the hardware, only then report what went on. -- REW */
-	v0 = apic_read(APIC_ESR);
-	apic_write(APIC_ESR, 0);
-	v1 = apic_read(APIC_ESR);
-	ack_APIC_irq();
-	atomic_inc(&irq_err_count);
-
-	apic_printk(APIC_DEBUG, KERN_DEBUG "APIC error on CPU%d: %02x(%02x)",
-		    smp_processor_id(), v0 , v1);
-
-	v1 = v1 & 0xff;
-	while (v1) {
-		if (v1 & 0x1)
-			apic_printk(APIC_DEBUG, KERN_CONT " : %s", error_interrupt_reason[i]);
-		i++;
-		v1 >>= 1;
-	}
-
-	apic_printk(APIC_DEBUG, KERN_CONT "\n");
-
-	irq_exit();
-}
+#define SMP_ERROR_INTERRUPT(trace, trace_enter, trace_exit)		\
+void smp_##trace##error_interrupt(struct pt_regs *regs)		\
+{									\
+	u32 v0, v1;							\
+	u32 i = 0;							\
+	static const char * const error_interrupt_reason[] = {		\
+		"Send CS error",		/* APIC Error Bit 0 */	\
+		"Receive CS error",		/* APIC Error Bit 1 */	\
+		"Send accept error",		/* APIC Error Bit 2 */	\
+		"Receive accept error",		/* APIC Error Bit 3 */	\
+		"Redirectable IPI",		/* APIC Error Bit 4 */	\
+		"Send illegal vector",		/* APIC Error Bit 5 */	\
+		"Received illegal vector",	/* APIC Error Bit 6 */	\
+		"Illegal register address",	/* APIC Error Bit 7 */	\
+	};								\
+									\
+	irq_enter();							\
+	exit_idle();							\
+	trace_enter;							\
+	/*								\
+	 * First tickle the hardware, only then report what went on.	\
+	 * -- REW							\
+	 */								\
+	v0 = apic_read(APIC_ESR);					\
+	apic_write(APIC_ESR, 0);					\
+	v1 = apic_read(APIC_ESR);					\
+	ack_APIC_irq();							\
+	atomic_inc(&irq_err_count);					\
+									\
+	apic_printk(APIC_DEBUG,						\
+		    KERN_DEBUG "APIC error on CPU%d: %02x(%02x)",	\
+		    smp_processor_id(), v0 , v1);			\
+									\
+	v1 = v1 & 0xff;							\
+	while (v1) {							\
+		if (v1 & 0x1)						\
+			apic_printk(APIC_DEBUG, KERN_CONT " : %s",	\
+				    error_interrupt_reason[i]);		\
+		i++;							\
+		v1 >>= 1;						\
+	}								\
+									\
+	apic_printk(APIC_DEBUG, KERN_CONT "\n");			\
+									\
+	trace_exit;							\
+	irq_exit();							\
+}
+
+
+SMP_ERROR_INTERRUPT(,,)
+SMP_ERROR_INTERRUPT(trace_, trace_error_apic_entry(ERROR_APIC_VECTOR),
+		    trace_error_apic_exit(ERROR_APIC_VECTOR))
 
 /**
  * connect_bsp_APIC - attach the APIC to the interrupt system
diff --git a/arch/x86/kernel/cpu/mcheck/therm_throt.c b/arch/x86/kernel/cpu/mcheck/therm_throt.c
index 47a1870..a1c86ab 100644
--- a/arch/x86/kernel/cpu/mcheck/therm_throt.c
+++ b/arch/x86/kernel/cpu/mcheck/therm_throt.c
@@ -23,6 +23,7 @@
 #include <linux/init.h>
 #include <linux/smp.h>
 #include <linux/cpu.h>
+#include <asm/trace/irq_vectors.h>
 
 #include <asm/processor.h>
 #include <asm/apic.h>
@@ -378,17 +379,24 @@ static void unexpected_thermal_interrupt(void)
 
 static void (*smp_thermal_vector)(void) = unexpected_thermal_interrupt;
 
-asmlinkage void smp_thermal_interrupt(struct pt_regs *regs)
-{
-	irq_enter();
-	exit_idle();
-	inc_irq_stat(irq_thermal_count);
-	smp_thermal_vector();
-	irq_exit();
-	/* Ack only at the end to avoid potential reentry */
-	ack_APIC_irq();
+#define SMP_THERMAL_INTERRUPT(trace, trace_enter, trace_exit)		\
+asmlinkage void smp_##trace##thermal_interrupt(struct pt_regs *regs)	\
+{									\
+	irq_enter();							\
+	exit_idle();							\
+	trace_enter;							\
+	inc_irq_stat(irq_thermal_count);				\
+	smp_thermal_vector();						\
+	trace_exit;							\
+	irq_exit();							\
+	/* Ack only at the end to avoid potential reentry */		\
+	ack_APIC_irq();							\
 }
 
+SMP_THERMAL_INTERRUPT(,,)
+SMP_THERMAL_INTERRUPT(trace_, trace_thermal_apic_entry(THERMAL_APIC_VECTOR),
+		      trace_thermal_apic_exit(THERMAL_APIC_VECTOR))
+
 /* Thermal monitoring depends on APIC, ACPI and clock modulation */
 static int intel_thermal_supported(struct cpuinfo_x86 *c)
 {
diff --git a/arch/x86/kernel/cpu/mcheck/threshold.c b/arch/x86/kernel/cpu/mcheck/threshold.c
index aa578ca..b7a95c5 100644
--- a/arch/x86/kernel/cpu/mcheck/threshold.c
+++ b/arch/x86/kernel/cpu/mcheck/threshold.c
@@ -4,6 +4,7 @@
 #include <linux/interrupt.h>
 #include <linux/kernel.h>
 
+#include <asm/trace/irq_vectors.h>
 #include <asm/irq_vectors.h>
 #include <asm/apic.h>
 #include <asm/idle.h>
@@ -17,13 +18,21 @@ static void default_threshold_interrupt(void)
 
 void (*mce_threshold_vector)(void) = default_threshold_interrupt;
 
-asmlinkage void smp_threshold_interrupt(void)
-{
-	irq_enter();
-	exit_idle();
-	inc_irq_stat(irq_threshold_count);
-	mce_threshold_vector();
-	irq_exit();
-	/* Ack only at the end to avoid potential reentry */
-	ack_APIC_irq();
+#define SMP_THRESHOLD_INTERRUPT(trace, trace_enter, trace_exit)	\
+asmlinkage void smp_##trace##threshold_interrupt(void)			\
+{									\
+	irq_enter();							\
+	exit_idle();							\
+	trace_enter;							\
+	inc_irq_stat(irq_threshold_count);				\
+	mce_threshold_vector();						\
+	trace_exit;							\
+	irq_exit();							\
+	/* Ack only at the end to avoid potential reentry */		\
+	ack_APIC_irq();							\
 }
+
+SMP_THRESHOLD_INTERRUPT(,,)
+SMP_THRESHOLD_INTERRUPT(trace_,
+			trace_threshold_apic_entry(THRESHOLD_APIC_VECTOR),
+			trace_threshold_apic_exit(THRESHOLD_APIC_VECTOR))
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index cdc790c..20faa26 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1187,6 +1187,39 @@ apicinterrupt IRQ_WORK_VECTOR \
 	irq_work_interrupt smp_irq_work_interrupt
 #endif
 
+#ifdef CONFIG_TRACEPOINTS
+
+apicinterrupt LOCAL_TIMER_VECTOR \
+	trace_apic_timer_interrupt smp_trace_apic_timer_interrupt
+apicinterrupt X86_PLATFORM_IPI_VECTOR \
+	trace_x86_platform_ipi smp_trace_x86_platform_ipi
+
+apicinterrupt THRESHOLD_APIC_VECTOR \
+	trace_threshold_interrupt smp_trace_threshold_interrupt
+apicinterrupt THERMAL_APIC_VECTOR \
+	trace_thermal_interrupt smp_trace_thermal_interrupt
+
+#ifdef CONFIG_SMP
+apicinterrupt CALL_FUNCTION_SINGLE_VECTOR \
+	trace_call_function_single_interrupt \
+	smp_trace_call_function_single_interrupt
+apicinterrupt CALL_FUNCTION_VECTOR \
+	trace_call_function_interrupt smp_trace_call_function_interrupt
+apicinterrupt RESCHEDULE_VECTOR \
+	trace_reschedule_interrupt smp_trace_reschedule_interrupt
+#endif
+
+apicinterrupt ERROR_APIC_VECTOR \
+	trace_error_interrupt smp_trace_error_interrupt
+apicinterrupt SPURIOUS_APIC_VECTOR \
+	trace_spurious_interrupt smp_trace_spurious_interrupt
+
+#ifdef CONFIG_IRQ_WORK
+apicinterrupt IRQ_WORK_VECTOR \
+	trace_irq_work_interrupt smp_trace_irq_work_interrupt
+#endif
+#endif /* CONFIG_TRACEPOINTS */
+
 /*
  * Exception entry points.
  */
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 94bf9cc..cc32708 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -455,6 +455,12 @@ ENTRY(idt_table)
 ENTRY(nmi_idt_table)
 	.skip IDT_ENTRIES * 16
 
+#ifdef CONFIG_TRACEPOINTS
+	.align L1_CACHE_BYTES
+ENTRY(trace_idt_table)
+	.skip IDT_ENTRIES * 16
+#endif
+
 	__PAGE_ALIGNED_BSS
 	.align PAGE_SIZE
 ENTRY(empty_zero_page)
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index e4595f1..9fd70ad 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -18,6 +18,8 @@
 #include <asm/mce.h>
 #include <asm/hw_irq.h>
 
+#include <asm/trace/irq_vectors.h>
+
 atomic_t irq_err_count;
 
 /* Function pointer for generic interrupt vector handling */
@@ -208,26 +210,32 @@ unsigned int __irq_entry do_IRQ(struct pt_regs *regs)
 /*
  * Handler for X86_PLATFORM_IPI_VECTOR.
  */
-void smp_x86_platform_ipi(struct pt_regs *regs)
-{
-	struct pt_regs *old_regs = set_irq_regs(regs);
-
-	ack_APIC_irq();
-
-	irq_enter();
-
-	exit_idle();
-
-	inc_irq_stat(x86_platform_ipis);
-
-	if (x86_platform_ipi_callback)
-		x86_platform_ipi_callback();
-
-	irq_exit();
-
-	set_irq_regs(old_regs);
+#define SMP_X86_PLATFORM_IPI(trace, trace_enter, trace_exit)		\
+void smp_##trace##x86_platform_ipi(struct pt_regs *regs)		\
+{									\
+	struct pt_regs *old_regs = set_irq_regs(regs);			\
+									\
+	ack_APIC_irq();							\
+									\
+	irq_enter();							\
+									\
+	exit_idle();							\
+	trace_enter;							\
+	inc_irq_stat(x86_platform_ipis);				\
+									\
+	if (x86_platform_ipi_callback)					\
+		x86_platform_ipi_callback();				\
+	trace_exit;							\
+	irq_exit();							\
+									\
+	set_irq_regs(old_regs);						\
 }
 
+SMP_X86_PLATFORM_IPI(,,)
+SMP_X86_PLATFORM_IPI(trace_,
+		     trace_x86_platform_ipi_entry(X86_PLATFORM_IPI_VECTOR),
+		     trace_x86_platform_ipi_exit(X86_PLATFORM_IPI_VECTOR))
+
 EXPORT_SYMBOL_GPL(vector_used_by_percpu_irq);
 
 #ifdef CONFIG_HOTPLUG_CPU
diff --git a/arch/x86/kernel/irq_work.c b/arch/x86/kernel/irq_work.c
index ca8f703..a669b94 100644
--- a/arch/x86/kernel/irq_work.c
+++ b/arch/x86/kernel/irq_work.c
@@ -8,16 +8,24 @@
 #include <linux/irq_work.h>
 #include <linux/hardirq.h>
 #include <asm/apic.h>
+#include <asm/trace/irq_vectors.h>
 
-void smp_irq_work_interrupt(struct pt_regs *regs)
-{
-	irq_enter();
-	ack_APIC_irq();
-	inc_irq_stat(apic_irq_work_irqs);
-	irq_work_run();
-	irq_exit();
+#define SMP_IRQ_WORK_INTERRUPT(trace, trace_enter, trace_exit)		\
+void smp_##trace##irq_work_interrupt(struct pt_regs *regs)		\
+{									\
+	irq_enter();							\
+	ack_APIC_irq();							\
+	trace_enter;							\
+	inc_irq_stat(apic_irq_work_irqs);				\
+	irq_work_run();							\
+	trace_exit;							\
+	irq_exit();							\
 }
 
+SMP_IRQ_WORK_INTERRUPT(,,)
+SMP_IRQ_WORK_INTERRUPT(trace_, trace_irq_work_entry(IRQ_WORK_VECTOR),
+		       trace_irq_work_exit(IRQ_WORK_VECTOR))
+
 void arch_irq_work_raise(void)
 {
 #ifdef CONFIG_X86_LOCAL_APIC
diff --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c
index 6e03b0d..cf76128 100644
--- a/arch/x86/kernel/irqinit.c
+++ b/arch/x86/kernel/irqinit.c
@@ -251,4 +251,6 @@ void __init native_init_IRQ(void)
 
 	irq_ctx_init(smp_processor_id());
 #endif
+
+	trace_idt_table_init();
 }
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 48d2b7d..d8e1a2c 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -23,6 +23,7 @@
 #include <linux/interrupt.h>
 #include <linux/cpu.h>
 #include <linux/gfp.h>
+#include <asm/trace/irq_vectors.h>
 
 #include <asm/mtrr.h>
 #include <asm/tlbflush.h>
@@ -249,34 +250,57 @@ finish:
 /*
  * Reschedule call back.
  */
-void smp_reschedule_interrupt(struct pt_regs *regs)
-{
-	ack_APIC_irq();
-	inc_irq_stat(irq_resched_count);
-	scheduler_ipi();
-	/*
-	 * KVM uses this interrupt to force a cpu out of guest mode
-	 */
+#define SMP_RESCHEDULE_INTERRUPT(trace, trace_enter, trace_exit)	\
+void smp_##trace##reschedule_interrupt(struct pt_regs *regs)		\
+{									\
+	ack_APIC_irq();							\
+	trace_enter;							\
+	inc_irq_stat(irq_resched_count);				\
+	scheduler_ipi();						\
+	trace_exit;							\
+	/*								\
+	 * KVM uses this interrupt to force a cpu out of guest mode	\
+	 */								\
 }
 
-void smp_call_function_interrupt(struct pt_regs *regs)
-{
-	ack_APIC_irq();
-	irq_enter();
-	generic_smp_call_function_interrupt();
-	inc_irq_stat(irq_call_count);
-	irq_exit();
+SMP_RESCHEDULE_INTERRUPT(,,)
+SMP_RESCHEDULE_INTERRUPT(trace_, trace_reschedule_entry(RESCHEDULE_VECTOR),
+			 trace_reschedule_exit(RESCHEDULE_VECTOR))
+
+#define SMP_CALL_FUNCTION_INTERRUPT(trace, trace_enter, trace_exit)	\
+void smp_##trace##call_function_interrupt(struct pt_regs *regs)	\
+{									\
+	ack_APIC_irq();							\
+	irq_enter();							\
+	trace_enter;							\
+	generic_smp_call_function_interrupt();				\
+	inc_irq_stat(irq_call_count);					\
+	trace_exit;							\
+	irq_exit();							\
 }
 
-void smp_call_function_single_interrupt(struct pt_regs *regs)
-{
-	ack_APIC_irq();
-	irq_enter();
-	generic_smp_call_function_single_interrupt();
-	inc_irq_stat(irq_call_count);
-	irq_exit();
+SMP_CALL_FUNCTION_INTERRUPT(,,)
+SMP_CALL_FUNCTION_INTERRUPT(trace_,
+			    trace_call_function_entry(CALL_FUNCTION_VECTOR),
+			    trace_call_function_exit(CALL_FUNCTION_VECTOR))
+
+#define SMP_CALL_FUNCTION_SINGLE_INTERRUPT(trace, trace_enter, trace_exit)\
+void smp_##trace##call_function_single_interrupt(struct pt_regs *regs)	\
+{									\
+	ack_APIC_irq();							\
+	irq_enter();							\
+	trace_enter;							\
+	generic_smp_call_function_single_interrupt();			\
+	inc_irq_stat(irq_call_count);					\
+	trace_exit;							\
+	irq_exit();							\
 }
 
+SMP_CALL_FUNCTION_SINGLE_INTERRUPT(,,)
+SMP_CALL_FUNCTION_SINGLE_INTERRUPT(trace_,
+		trace_call_function_single_entry(CALL_FUNCTION_SINGLE_VECTOR),
+		trace_call_function_single_exit(CALL_FUNCTION_SINGLE_VECTOR))
+
 static int __init nonmi_ipi_setup(char *str)
 {
 	smp_no_nmi_ipi = true;
diff --git a/arch/x86/kernel/tracepoint.c b/arch/x86/kernel/tracepoint.c
new file mode 100644
index 0000000..d7c96ba
--- /dev/null
+++ b/arch/x86/kernel/tracepoint.c
@@ -0,0 +1,102 @@
+/*
+ * Code for supporting irq vector tracepoints.
+ *
+ * Copyright (C) 2012 Seiji Aguchi <sei...@hd...>
+ *
+ */
+#include <asm/hw_irq.h>
+#include <asm/desc.h>
+
+static struct desc_ptr trace_idt_descr = { NR_VECTORS * 16 - 1,
+				    (unsigned long) trace_idt_table };
+
+#ifndef CONFIG_X86_64
+gate_desc trace_idt_table[NR_VECTORS] __page_aligned_data
+					= { { { { 0, 0 } } }, };
+#endif
+
+void __init trace_idt_table_init(void)
+{
+	memcpy(&trace_idt_table, &idt_table, IDT_ENTRIES * 16);
+	/*
+	 * The reschedule interrupt is a CPU-to-CPU reschedule-helper
+	 * IPI, driven by wakeup.
+	 */
+	trace_set_intr_gate(RESCHEDULE_VECTOR, trace_reschedule_interrupt);
+
+	/* IPI for generic function call */
+	trace_set_intr_gate(CALL_FUNCTION_VECTOR,
+			    trace_call_function_interrupt);
+
+	/* IPI for generic single function call */
+	trace_set_intr_gate(CALL_FUNCTION_SINGLE_VECTOR,
+			    trace_call_function_single_interrupt);
+
+#ifdef CONFIG_X86_THERMAL_VECTOR
+	trace_set_intr_gate(THERMAL_APIC_VECTOR, trace_thermal_interrupt);
+#endif
+#ifdef CONFIG_X86_MCE_THRESHOLD
+	trace_set_intr_gate(THRESHOLD_APIC_VECTOR, trace_threshold_interrupt);
+#endif
+
+#if defined(CONFIG_X86_64) || defined(CONFIG_X86_LOCAL_APIC)
+	/* self generated IPI for local APIC timer */
+	trace_set_intr_gate(LOCAL_TIMER_VECTOR, trace_apic_timer_interrupt);
+
+	/* IPI for X86 platform specific use */
+	trace_set_intr_gate(X86_PLATFORM_IPI_VECTOR, trace_x86_platform_ipi);
+
+	/* IPI vectors for APIC spurious and error interrupts */
+	trace_set_intr_gate(SPURIOUS_APIC_VECTOR, trace_spurious_interrupt);
+	trace_set_intr_gate(ERROR_APIC_VECTOR, trace_error_interrupt);
+
+	/* IRQ work interrupts: */
+# ifdef CONFIG_IRQ_WORK
+	trace_set_intr_gate(IRQ_WORK_VECTOR, trace_irq_work_interrupt);
+# endif
+# endif
+}
+
+static struct desc_ptr orig_idt_descr[NR_CPUS];
+static int trace_irq_vector_refcount;
+
+static void switch_trace_idt(void *arg)
+{
+	store_idt(&orig_idt_descr[smp_processor_id()]);
+	load_idt(&trace_idt_descr);
+
+	return;
+}
+
+static void restore_original_idt(void *arg)
+{
+	if (orig_idt_descr[smp_processor_id()].address) {
+		load_idt(&orig_idt_descr[smp_processor_id()]);
+		memset(&orig_idt_descr[smp_processor_id()], 0,
+		       sizeof(struct desc_ptr));
+	}
+
+	return;
+}
+
+void trace_irq_vector_regfunc(void)
+{
+	if (!trace_irq_vector_refcount) {
+		smp_call_function(switch_trace_idt, NULL, 0);
+		local_irq_disable();
+		switch_trace_idt(NULL);
+		local_irq_enable();
+	}
+	trace_irq_vector_refcount++;
+}
+
+void trace_irq_vector_unregfunc(void)
+{
+	trace_irq_vector_refcount--;
+	if (!trace_irq_vector_refcount) {
+		smp_call_function(restore_original_idt, NULL, 0);
+		local_irq_disable();
+		restore_original_idt(NULL);
+		local_irq_enable();
+	}
+}
-- 1.7.1

Re: [Dle-develop] [PATCH v4] trace, x86: add x86 irq vector tracepoints

From: Seiji A. <sei...@hd...> - 2012-10-11 17:04:52

> >
> > I didn't say anything magic, but a table of pointers that are very
> > critical for the system running. Should we implement it with a single
> > switch, like we discussed in San Diego to do with the system call table?
> >
> > That is, have a "normal" table, and a "trace" table. The trace table
> > points to functions that have tracepoints. The first enabler of
> > tracing switches the table to use the tracepoints, and the last
> > disabler switches it back?
> >
> 
> That is certainly a reasonable implementation option.  It is slightly less usable than it is for system calls, though, because the vectors in
> the IDTs are somewhat scrambled and so you can't just do an indirect jump to the original vector content.  This does get messy
> because you also want to preserve registers...
> 

Peter, Steven, 

Thank you for explaining the reason why you think a time penalty should be zero
and discussing its implementation.

I will update my patch so that a time penalty makes zero and submit it shortly.

Seiji

[Dle-develop] system call trace test program

From: Satoru M. <sat...@hd...> - 2012-10-08 20:05:06

Attachments: syscall_trace_test.tar.bz2

This is a set of programs for testing systemcall tracepoints. It includes
test programs which call x86_64 and ia32 system calls. We can test all
system call entry path which described below.

 1. syscall (arch/x86_64/kernel/entry.S)
 2. ia32_sysenter_target (arch/x86_64/ia32/ia32entry.S)
 3. ia32_syscall (arch/x86_64/ia32/ia32entry.S)
 4. ia32_cstar_target (arxh/x86_64/ia32/ia32entry.S)

Also it includes the kernel module which adds probe functions to
tracepoint and system call entry points with kprobes. This module
traces the process which specified with pid module parameter.

As a result, We can test system call tracing function for all entry
paths with this test programs.

Regards,
Satoru Moriya

Re: [Dle-develop] [PATCH v4] trace, x86: add x86 irq vector tracepoints

From: H. P. A. <hp...@zy...> - 2012-10-07 15:25:46

On 10/06/2012 10:57 AM, Steven Rostedt wrote:
> 
> I didn't say anything magic, but a table of pointers that are very
> critical for the system running. Should we implement it with a single
> switch, like we discussed in San Diego to do with the system call table?
> 
> That is, have a "normal" table, and a "trace" table. The trace table
> points to functions that have tracepoints. The first enabler of tracing
> switches the table to use the tracepoints, and the last disabler
> switches it back?
> 

That is certainly a reasonable implementation option.  It is slightly
less usable than it is for system calls, though, because the vectors in
the IDTs are somewhat scrambled and so you can't just do an indirect
jump to the original vector content.  This does get messy because you
also want to preserve registers...

>>
>>> You are the maintainer and are responsible for the outcome of changes to
>>> the x86 arch, thus you do have final say. And if you think there's
>>> nothing to worry about with an IDT change then Seiji should implement
>>> it.
>>>
>>> I just want to point out some possible repercussions of doing it in a
>>> more complex way. As tracepoints use nops, and I may be pushing to even
>>> out-of-line the tracepoint unlikely part even more, I'm not sure the
>>> complexity is worth the amount of savings it would be against just
>>> adding the tracepoint in the code.
>>
>> The problem I'm seeing is the constant "oh, just a little bit more."  My
>> experience over the years is that there is always demand for "just one
>> more debug feature", each of which has negible cost, because they always
>> use the previous thing as a baseline... noone ever looks at the grand
>> total cost of the package (and by the time that happens, it is too late.)
> 
> Now I can turn this back at you ;-)  We can implement the simple "just
> add the tracepoints in the code" first, and then later implement the
> table swap version and we can say "hey! we just made the code faster!".

Can we?  My understanding how tracepoints are is that they export a
bunch of data structures via debugfs (a.k.a. sploitfs -- any system with
debugfs mounted really should be considered compromised from the start)
and that they are intimately tied to how they are implemented.

>>
>> tracepoints in particular are something I'm getting concerned about,
>> because they are one of those things that turn kernel internals into an
>> ABI, which means misdesigned tracepoints can actually make kernel
>> internal changes very hard to do.  The cost of those kinds of issues go
>> up over time as the strain between where we'd like the code to be vs.
>> where the code is increases.
> 
> Honestly, I'm extremely concerned about this too. In fact, I've bitched
> about this so many times in the past, but it just fell to deaf ears:
> 
> http://lwn.net/Articles/412685/
> http://lwn.net/Articles/415591/
> http://lwn.net/Articles/416665/
> http://lwn.net/Articles/416684/
> 

Slightly different complaint, but in the same general vein, yes.

	-hpa

Re: [Dle-develop] [PATCH v4] trace, x86: add x86 irq vector tracepoints

From: Borislav P. <bp...@al...> - 2012-10-06 23:33:32

On Sat, Oct 06, 2012 at 02:26:17PM -0400, Steven Rostedt wrote:
> On Sat, 2012-10-06 at 19:32 +0200, Borislav Petkov wrote:
> 
> > > 2) Are the tracepoints done in a way that it's not going to cause "ABI"
> > > issues. If not then we need to redesign the tracepoints.
> > 
> > Btw, this we should be asking ourselves about *all* TPs, especially if
> > they're in generic code.
> 
> I agree, and I'm starting to think I shouldn't have given free reign
> over the TPs to system maintainers. That is, I should have pushed harder
> to understand all tracepoints added to code to make sure the maintainer
> knows that it can become an ABI.
> 
> Some maintainers don't worry about it. But I can see it coming back to
> haunt them. In the end, it will hurt the maintainer of the code, which
> is why I gave the ownership of tracepoints to locations where they are
> at (instead of a "joint" ownership). But I probably should have been a
> TP cop for a while to allow them to understand the consequences first.

Yeah, even if you were the TP cop and had a shiny uniform with a badge
8-), do you think you'd have the time to review all the code adding TPs?

I think maybe it would've been better to add some text to Documentation
explaining with what care TPs should be designed, have checkpatch warn
on all new tracepoints, hope for the best and prepare for the worst. In
addition maybe review all TPs added to generic or arch-you-care-about
code. Maybe...

-- 
Regards/Gruss,
    Boris.

Re: [Dle-develop] [PATCH v4] trace, x86: add x86 irq vector tracepoints

From: Steven R. <ro...@go...> - 2012-10-06 18:26:25

On Sat, 2012-10-06 at 19:32 +0200, Borislav Petkov wrote:

> > 2) Are the tracepoints done in a way that it's not going to cause "ABI"
> > issues. If not then we need to redesign the tracepoints.
> 
> Btw, this we should be asking ourselves about *all* TPs, especially if
> they're in generic code.

I agree, and I'm starting to think I shouldn't have given free reign
over the TPs to system maintainers. That is, I should have pushed harder
to understand all tracepoints added to code to make sure the maintainer
knows that it can become an ABI.

Some maintainers don't worry about it. But I can see it coming back to
haunt them. In the end, it will hurt the maintainer of the code, which
is why I gave the ownership of tracepoints to locations where they are
at (instead of a "joint" ownership). But I probably should have been a
TP cop for a while to allow them to understand the consequences first.

-- Steve

Re: [Dle-develop] [PATCH v4] trace, x86: add x86 irq vector tracepoints

From: Borislav P. <bp...@al...> - 2012-10-06 17:32:41

On Sat, Oct 06, 2012 at 10:51:45AM -0400, Steven Rostedt wrote:
> On Sat, 2012-10-06 at 15:05 +0200, Borislav Petkov wrote:
> > What I'm missing with all those patches on LKML is: here's a patch that
> > doesn't add a new feature but gives us n% improv with this and that
> > workload. I wish we had more of those instead of the gazillion new
> > features each 3 months.
> 
> That would be nice too. But we can also add a patch that gives us
> negligible improvement that makes things a little more complex and also
> opens the possibility of a security hole.
> 
> Thus my question is, is the swap IDT really worth it? I'm fine if
> someone goes ahead and implements it. Heck, I'd love to implement it
> when I have time, as it refreshes my knowledge of how intel archs do
> interrupt processing.
> 
> I'm just worried that we are adding more complexity to code for very
> little gain.
> 
> I think we need to take another look at this.
> 
> 1) Are the tracepoints that Seiji worth adding? If not then we can stop
> the discussion here.

I like straight talk - it saves everybody a lot of time :-)

But seriously, I was adressing the general observation how a lot of new
features get added because "it would be cool if we could do X and Y" and
how we're progressively getting fatter and slowing down over time.

And I like how you're giving that feature a hard look - something we
should be doing always, btw.

So http://marc.info/?l=linux-kernel&m=134827445716419 talks about how it
is good to know which cores handle IRQs and how this affects the system,
and IRQ interaction and yadda yadda... But frankly speaking, that still
doesn't give me a hard-on; I gotta say - it sounds more like a debugging
feature which people can enable, with certain overhead like most of
those in "Kernel hacking" but the general public doesn't need it.

So, do we really really need this or are we better off with a debugging
design where we don't care about overhead?

Hmm, I'd say make it off by default and let people who want it enable it
and go crazy.

> 2) Are the tracepoints done in a way that it's not going to cause "ABI"
> issues. If not then we need to redesign the tracepoints.

Btw, this we should be asking ourselves about *all* TPs, especially if
they're in generic code.

[ … ]

Thanks.

-- 
Regards/Gruss,
    Boris.

Re: [Dle-develop] [PATCH v4] trace, x86: add x86 irq vector tracepoints

From: Steven R. <ro...@go...> - 2012-10-06 14:51:53

On Sat, 2012-10-06 at 15:05 +0200, Borislav Petkov wrote:

> What I'm missing with all those patches on LKML is: here's a patch that
> doesn't add a new feature but gives us n% improv with this and that
> workload. I wish we had more of those instead of the gazillion new
> features each 3 months.

That would be nice too. But we can also add a patch that gives us
negligible improvement that makes things a little more complex and also
opens the possibility of a security hole.

Thus my question is, is the swap IDT really worth it? I'm fine if
someone goes ahead and implements it. Heck, I'd love to implement it
when I have time, as it refreshes my knowledge of how intel archs do
interrupt processing.

I'm just worried that we are adding more complexity to code for very
little gain.

I think we need to take another look at this.

1) Are the tracepoints that Seiji worth adding? If not then we can stop
the discussion here.

2) Are the tracepoints done in a way that it's not going to cause "ABI"
issues. If not then we need to redesign the tracepoints.

3) If we are here, then we have tracepoints that are worth adding and
are done in a way that they should be stable for the long term. OK, how
to implement them?  The question really is, should we keep it 0% impact
when not active by the IDT switch or allow for the negligible impact by
adding the tracepoints into the code directly and not worrying about it.

  a) The tracepoints are going in the code regardless. Even with a 
  switch we need to have a duplicate of the calls, one with and one
  without the tracepoints.

  b) It can be done with one big change: add the tracepoints and do the
  duplicate with and without versions for the IDT switch. Or we can
  break it into two parts. First, add the tracepoints, then add the
  switch with the duplicates. I prefer this method if we are doing
  the switch.

I expect that if we do the switch we would have something like this:

void __irq_entry smp_apic_timer_interrupt(struct pt_regs *regs)
{
	struct pt_regs *old_regs = set_irq_regs(regs);

	/*
	 * NOTE! We'd better ACK the irq immediately,
	 * because timer handling can be slow.
	 */
	ack_APIC_irq();
	/*
	 * update_process_times() expects us to have done irq_enter().
	 * Besides, if we don't timer interrupts ignore the global
	 * interrupt lock, which is the WrongThing (tm) to do.
	 */
	irq_enter();
	exit_idle();
	local_apic_timer_interrupt();
	irq_exit();

	set_irq_regs(old_regs);
}

void __irq_entry trace_smp_apic_timer_interrupt(struct pt_regs *regs)
{
	struct pt_regs *old_regs = set_irq_regs(regs);

	/*
	 * NOTE! We'd better ACK the irq immediately,
	 * because timer handling can be slow.
	 */
	ack_APIC_irq();
	/*
	 * update_process_times() expects us to have done irq_enter().
	 * Besides, if we don't timer interrupts ignore the global
	 * interrupt lock, which is the WrongThing (tm) to do.
	 */
	irq_enter();
	exit_idle();
	trace_local_timer_entry(LOCAL_TIMER_VECTOR);
	local_apic_timer_interrupt();
	trace_local_timer_exit(LOCAL_TIMER_VECTOR);
	irq_exit();

	set_irq_regs(old_regs);
}

Now we have two functions accomplishing the same task. Any change to one
must be done to change the other. Due to rcu idle, the tracepoint needs
to be after the exit_idle() and before irq_exit().

We could force the two to be in step by using ugly macro magic:

#define APIC_TIMER_INTERRUPT(trace, trace_enter, trace_exit)	\
void __irq_entry trace##_smp_apic_timer_interrupt(struct pt_regs *regs) \
{ \
	struct pt_regs *old_regs = set_irq_regs(regs); \
\
	/* \
	 * NOTE! We'd better ACK the irq immediately, \
	 * because timer handling can be slow. \
	 */ \
	ack_APIC_irq(); \
	/* \
	 * update_process_times() expects us to have done irq_enter(). \
	 * Besides, if we don't timer interrupts ignore the global \
	 * interrupt lock, which is the WrongThing (tm) to do. \
	 */ \
	irq_enter(); \
	exit_idle(); \
	trace_enter; \
	local_apic_timer_interrupt(); \
	trace_exit; \
	irq_exit(); \
 \
	set_irq_regs(old_regs); \
}

APIC_TIMER_INTERRUPT(,,)
APIC_TIMER_INTERRUPT(trace,trace_local_timer_entry(LOCAL_TIMER_VECTOR), trace_local_timer_exit(LOCAL_TIMER_VECTOR))

But I'm not sure we want to go there.

-- Steve

Re: [Dle-develop] [PATCH v4] trace, x86: add x86 irq vector tracepoints

From: Borislav P. <bp...@al...> - 2012-10-06 13:24:37

On Fri, Oct 05, 2012 at 10:57:41PM -0400, Steven Rostedt wrote:
> > The problem I'm seeing is the constant "oh, just a little bit more."  My
> > experience over the years is that there is always demand for "just one
> > more debug feature", each of which has negible cost, because they always
> > use the previous thing as a baseline... noone ever looks at the grand
> > total cost of the package (and by the time that happens, it is too late.)
> 
> Now I can turn this back at you ;-)  We can implement the simple "just
> add the tracepoints in the code" first, and then later implement the
> table swap version and we can say "hey! we just made the code faster!".

I absolutely agree with hpa here - it's like he's reading my mind. The
sum of the total cost of all those features simply and surely slows down
the kernel with time and if we don't pay attention, we might get bogged
down with fat no matter the IPC improvements hw guys give us. Which are
not endless, btw, in case someone wonders.

What I'm missing with all those patches on LKML is: here's a patch that
doesn't add a new feature but gives us n% improv with this and that
workload. I wish we had more of those instead of the gazillion new
features each 3 months.

> > tracepoints in particular are something I'm getting concerned about,
> > because they are one of those things that turn kernel internals into an
> > ABI, which means misdesigned tracepoints can actually make kernel
> > internal changes very hard to do.  The cost of those kinds of issues go
> > up over time as the strain between where we'd like the code to be vs.
> > where the code is increases.
> 
> Honestly, I'm extremely concerned about this too. In fact, I've bitched
> about this so many times in the past, but it just fell to deaf ears:
> 
> http://lwn.net/Articles/412685/
> http://lwn.net/Articles/415591/
> http://lwn.net/Articles/416665/
> http://lwn.net/Articles/416684/

Absolutely agreed too. This is why we had such a long discussion about
the RAS tracepoint format recently, for example.

Thanks.

-- 
Regards/Gruss,
    Boris.

Re: [Dle-develop] [PATCH v4] trace, x86: add x86 irq vector tracepoints

From: Steven R. <ro...@go...> - 2012-10-06 02:57:49

On Fri, 2012-10-05 at 17:16 -0700, H. Peter Anvin wrote:
> On 10/05/2012 07:13 AM, Steven Rostedt wrote:
> > 
> > Peter,
> > 
> > I agree that the IDT version is a zero cost in performance, where as the
> > tracepoint version is a negligible cost in performance. But my worry is
> > the complexity (read error prone and possible openings of security
> > exploits) worth it?
> > 
> > Switching of the IDT is not that trivial, and to make it something for
> > common activities such as reading tracepoints by tools like ftrace and
> > perf, that do it often, even on production machines, may lead to issues
> > if its not done right.
> > 
> 
> It's a table of pointers... there really isn't anything magic about it
> (except perhaps the slightly weird format.)

I didn't say anything magic, but a table of pointers that are very
critical for the system running. Should we implement it with a single
switch, like we discussed in San Diego to do with the system call table?

That is, have a "normal" table, and a "trace" table. The trace table
points to functions that have tracepoints. The first enabler of tracing
switches the table to use the tracepoints, and the last disabler
switches it back?

> 
> > You are the maintainer and are responsible for the outcome of changes to
> > the x86 arch, thus you do have final say. And if you think there's
> > nothing to worry about with an IDT change then Seiji should implement
> > it.
> > 
> > I just want to point out some possible repercussions of doing it in a
> > more complex way. As tracepoints use nops, and I may be pushing to even
> > out-of-line the tracepoint unlikely part even more, I'm not sure the
> > complexity is worth the amount of savings it would be against just
> > adding the tracepoint in the code.
> 
> The problem I'm seeing is the constant "oh, just a little bit more."  My
> experience over the years is that there is always demand for "just one
> more debug feature", each of which has negible cost, because they always
> use the previous thing as a baseline... noone ever looks at the grand
> total cost of the package (and by the time that happens, it is too late.)

Now I can turn this back at you ;-)  We can implement the simple "just
add the tracepoints in the code" first, and then later implement the
table swap version and we can say "hey! we just made the code faster!".

> 
> tracepoints in particular are something I'm getting concerned about,
> because they are one of those things that turn kernel internals into an
> ABI, which means misdesigned tracepoints can actually make kernel
> internal changes very hard to do.  The cost of those kinds of issues go
> up over time as the strain between where we'd like the code to be vs.
> where the code is increases.

Honestly, I'm extremely concerned about this too. In fact, I've bitched
about this so many times in the past, but it just fell to deaf ears:

http://lwn.net/Articles/412685/
http://lwn.net/Articles/415591/
http://lwn.net/Articles/416665/
http://lwn.net/Articles/416684/

-- Steve

Re: [Dle-develop] [PATCH v4] trace, x86: add x86 irq vector tracepoints

From: H. P. A. <hp...@zy...> - 2012-10-06 00:29:01

On 10/05/2012 07:13 AM, Steven Rostedt wrote:
> 
> Peter,
> 
> I agree that the IDT version is a zero cost in performance, where as the
> tracepoint version is a negligible cost in performance. But my worry is
> the complexity (read error prone and possible openings of security
> exploits) worth it?
> 
> Switching of the IDT is not that trivial, and to make it something for
> common activities such as reading tracepoints by tools like ftrace and
> perf, that do it often, even on production machines, may lead to issues
> if its not done right.
> 

It's a table of pointers... there really isn't anything magic about it
(except perhaps the slightly weird format.)

> You are the maintainer and are responsible for the outcome of changes to
> the x86 arch, thus you do have final say. And if you think there's
> nothing to worry about with an IDT change then Seiji should implement
> it.
> 
> I just want to point out some possible repercussions of doing it in a
> more complex way. As tracepoints use nops, and I may be pushing to even
> out-of-line the tracepoint unlikely part even more, I'm not sure the
> complexity is worth the amount of savings it would be against just
> adding the tracepoint in the code.

The problem I'm seeing is the constant "oh, just a little bit more."  My
experience over the years is that there is always demand for "just one
more debug feature", each of which has negible cost, because they always
use the previous thing as a baseline... noone ever looks at the grand
total cost of the package (and by the time that happens, it is too late.)

tracepoints in particular are something I'm getting concerned about,
because they are one of those things that turn kernel internals into an
ABI, which means misdesigned tracepoints can actually make kernel
internal changes very hard to do.  The cost of those kinds of issues go
up over time as the strain between where we'd like the code to be vs.
where the code is increases.

	-hpa

[Dle-develop] [RFC][PATCH 2/2] efi_pstore: add a sequence counter to a variable name

From: Seiji A. <sei...@hd...> - 2012-10-05 19:02:53

[Problem]

Currently, a variable name, which distinguishes each entry, consists of type, id and ctime.
But if multiple events happens in a short time, a second/third event may fail to log because
efi_pstore can't distinguish each event with current variable name.

[Solution]

  A reasonable way to identify all events precisely is introducing a sequence counter to
  the variable name.

[Patch Description]

  The sequence counter has already supported in a pstore layer with "oopscount".
  So, this patch adds it to a variable name.
  Also, it is passed to read/erase callbacks of platform drivers in accordance with
  the modification of the variable name.

  <before applying this patch>
 a variable name of first event: dump-type0-1-12345678
 a variable name of second event: dump-type0-1-12345678

  type:0
  id:1
  ctime:12345678

 If multiple events happen in a short time, efi_pstore can't distinguish them because
 variable names are same among them.

  <after applying this patch>

 it can be distinguishable by adding a sequence counter as follows.

 a variable name of first event: dump-type0-1-1-12345678
 a variable name of Second event: dump-type0-1-2-12345678

  type:0
  id:1
  sequence counter: 1(first event), 2(second event)
  ctime:12345678

Signed-off-by: Seiji Aguchi <sei...@hd...>
---
 drivers/acpi/apei/erst.c   |   12 ++++++------
 drivers/firmware/efivars.c |   25 ++++++++++++++++---------
 fs/pstore/inode.c          |    8 +++++---
 fs/pstore/internal.h       |    2 +-
 fs/pstore/platform.c       |   11 ++++++-----
 fs/pstore/ram.c            |    7 +++----
 include/linux/pstore.h     |    8 +++++---
 7 files changed, 42 insertions(+), 31 deletions(-)

diff --git a/drivers/acpi/apei/erst.c b/drivers/acpi/apei/erst.c
index 0bd6ae4..6d894bf 100644
--- a/drivers/acpi/apei/erst.c
+++ b/drivers/acpi/apei/erst.c
@@ -931,13 +931,13 @@ static int erst_check_table(struct acpi_table_erst *erst_tab)
 
 static int erst_open_pstore(struct pstore_info *psi);
 static int erst_close_pstore(struct pstore_info *psi);
-static ssize_t erst_reader(u64 *id, enum pstore_type_id *type,
+static ssize_t erst_reader(u64 *id, enum pstore_type_id *type, int *count,
 			   struct timespec *time, char **buf,
 			   struct pstore_info *psi);
 static int erst_writer(enum pstore_type_id type, enum kmsg_dump_reason reason,
-		       u64 *id, unsigned int part,
+		       u64 *id, unsigned int part, int count,
 		       size_t size, struct pstore_info *psi);
-static int erst_clearer(enum pstore_type_id type, u64 id,
+static int erst_clearer(enum pstore_type_id type, u64 id, int count,
 			struct timespec time, struct pstore_info *psi);
 
 static struct pstore_info erst_info = {
@@ -987,7 +987,7 @@ static int erst_close_pstore(struct pstore_info *psi)
 	return 0;
 }
 
-static ssize_t erst_reader(u64 *id, enum pstore_type_id *type,
+static ssize_t erst_reader(u64 *id, enum pstore_type_id *type, int *count,
 			   struct timespec *time, char **buf,
 			   struct pstore_info *psi)
 {
@@ -1055,7 +1055,7 @@ out:
 }
 
 static int erst_writer(enum pstore_type_id type, enum kmsg_dump_reason reason,
-		       u64 *id, unsigned int part,
+		       u64 *id, unsigned int part, int count,
 		       size_t size, struct pstore_info *psi)
 {
 	struct cper_pstore_record *rcd = (struct cper_pstore_record *)
@@ -1101,7 +1101,7 @@ static int erst_writer(enum pstore_type_id type, enum kmsg_dump_reason reason,
 	return ret;
 }
 
-static int erst_clearer(enum pstore_type_id type, u64 id,
+static int erst_clearer(enum pstore_type_id type, u64 id, int count,
 			struct timespec time, struct pstore_info *psi)
 {
 	return erst_clear(id);
diff --git a/drivers/firmware/efivars.c b/drivers/firmware/efivars.c
index 0eaaba3..511d6ee 100644
--- a/drivers/firmware/efivars.c
+++ b/drivers/firmware/efivars.c
@@ -658,13 +658,14 @@ static int efi_pstore_close(struct pstore_info *psi)
 }
 
 static ssize_t efi_pstore_read(u64 *id, enum pstore_type_id *type,
-			       struct timespec *timespec,
+			       int *count, struct timespec *timespec,
 			       char **buf, struct pstore_info *psi)
 {
 	efi_guid_t vendor = LINUX_EFI_CRASH_GUID;
 	struct efivars *efivars = psi->data;
 	char name[DUMP_NAME_LEN];
 	int i;
+	int cnt;
 	unsigned int part, size;
 	unsigned long time;
 
@@ -674,8 +675,10 @@ static ssize_t efi_pstore_read(u64 *id, enum pstore_type_id *type,
 			for (i = 0; i < DUMP_NAME_LEN; i++) {
 				name[i] = efivars->walk_entry->var.VariableName[i];
 			}
-			if (sscanf(name, "dump-type%u-%u-%lu", type, &part, &time) == 3) {
+			if (sscanf(name, "dump-type%u-%u-%d-%lu",
+				   type, &part, &cnt, &time) == 4) {
 				*id = part;
+				*count = cnt;
 				timespec->tv_sec = time;
 				timespec->tv_nsec = 0;
 				get_var_data_locked(efivars, &efivars->walk_entry->var);
@@ -698,7 +701,8 @@ static ssize_t efi_pstore_read(u64 *id, enum pstore_type_id *type,
 
 static int efi_pstore_write(enum pstore_type_id type,
 		enum kmsg_dump_reason reason, u64 *id,
-		unsigned int part, size_t size, struct pstore_info *psi)
+		unsigned int part, int count, size_t size,
+		struct pstore_info *psi)
 {
 	char name[DUMP_NAME_LEN];
 	efi_char16_t efi_name[DUMP_NAME_LEN];
@@ -726,7 +730,8 @@ static int efi_pstore_write(enum pstore_type_id type,
 		return -ENOSPC;
 	}
 
-	sprintf(name, "dump-type%u-%u-%lu", type, part, get_seconds());
+	sprintf(name, "dump-type%u-%u-%d-%lu", type, part, count,
+		get_seconds());
 
 	for (i = 0; i < DUMP_NAME_LEN; i++)
 		efi_name[i] = name[i];
@@ -746,7 +751,7 @@ static int efi_pstore_write(enum pstore_type_id type,
 	return ret;
 };
 
-static int efi_pstore_erase(enum pstore_type_id type, u64 id,
+static int efi_pstore_erase(enum pstore_type_id type, u64 id, int count,
 			    struct timespec time, struct pstore_info *psi)
 {
 	char name[DUMP_NAME_LEN];
@@ -756,7 +761,8 @@ static int efi_pstore_erase(enum pstore_type_id type, u64 id,
 	struct efivar_entry *entry, *found = NULL;
 	int i;
 
-	sprintf(name, "dump-type%u-%llu-%lu", type, id, time.tv_sec);
+	sprintf(name, "dump-type%u-%llu-%d-%lu", type, id, count,
+		time.tv_sec);
 
 	spin_lock(&efivars->lock);
 
@@ -806,7 +812,7 @@ static int efi_pstore_close(struct pstore_info *psi)
 	return 0;
 }
 
-static ssize_t efi_pstore_read(u64 *id, enum pstore_type_id *type,
+static ssize_t efi_pstore_read(u64 *id, enum pstore_type_id *type, int *count,
 			       struct timespec *timespec,
 			       char **buf, struct pstore_info *psi)
 {
@@ -815,12 +821,13 @@ static ssize_t efi_pstore_read(u64 *id, enum pstore_type_id *type,
 
 static int efi_pstore_write(enum pstore_type_id type,
 		enum kmsg_dump_reason reason, u64 *id,
-		unsigned int part, size_t size, struct pstore_info *psi)
+		unsigned int part, int count, size_t size,
+		struct pstore_info *psi)
 {
 	return 0;
 }
 
-static int efi_pstore_erase(enum pstore_type_id type, u64 id,
+static int efi_pstore_erase(enum pstore_type_id type, u64 id, int count,
 			    struct timespec time, struct pstore_info *psi)
 {
 	return 0;
diff --git a/fs/pstore/inode.c b/fs/pstore/inode.c
index 4300af6..ed1d8c7 100644
--- a/fs/pstore/inode.c
+++ b/fs/pstore/inode.c
@@ -49,6 +49,7 @@ struct pstore_private {
 	struct pstore_info *psi;
 	enum pstore_type_id type;
 	u64	id;
+	int	count;
 	ssize_t	size;
 	char	data[];
 };
@@ -175,8 +176,8 @@ static int pstore_unlink(struct inode *dir, struct dentry *dentry)
 	struct pstore_private *p = dentry->d_inode->i_private;
 
 	if (p->psi->erase)
-		p->psi->erase(p->type, p->id, dentry->d_inode->i_ctime,
-			      p->psi);
+		p->psi->erase(p->type, p->id, p->count,
+			      dentry->d_inode->i_ctime, p->psi);
 
 	return simple_unlink(dir, dentry);
 }
@@ -271,7 +272,7 @@ int pstore_is_mounted(void)
  * Load it up with "size" bytes of data from "buf".
  * Set the mtime & ctime to the date that this record was originally stored.
  */
-int pstore_mkfile(enum pstore_type_id type, char *psname, u64 id,
+int pstore_mkfile(enum pstore_type_id type, char *psname, u64 id, int count,
 		  char *data, size_t size, struct timespec time,
 		  struct pstore_info *psi)
 {
@@ -307,6 +308,7 @@ int pstore_mkfile(enum pstore_type_id type, char *psname, u64 id,
 		goto fail_alloc;
 	private->type = type;
 	private->id = id;
+	private->count = count;
 	private->psi = psi;
 
 	switch (type) {
diff --git a/fs/pstore/internal.h b/fs/pstore/internal.h
index 0d0d3b7..e170a66 100644
--- a/fs/pstore/internal.h
+++ b/fs/pstore/internal.h
@@ -44,7 +44,7 @@ extern struct pstore_info *psinfo;
 extern void	pstore_set_kmsg_bytes(int);
 extern void	pstore_get_records(int);
 extern int	pstore_mkfile(enum pstore_type_id, char *psname, u64 id,
-			      char *data, size_t size,
+			      int count, char *data, size_t size,
 			      struct timespec time, struct pstore_info *psi);
 extern int	pstore_is_mounted(void);
 
diff --git a/fs/pstore/platform.c b/fs/pstore/platform.c
index 29996e8..5a07c9e 100644
--- a/fs/pstore/platform.c
+++ b/fs/pstore/platform.c
@@ -136,7 +136,7 @@ static void pstore_dump(struct kmsg_dumper *dumper,
 			break;
 
 		ret = psinfo->write(PSTORE_TYPE_DMESG, reason, &id, part,
-				    hsize + len, psinfo);
+				    oopscount, hsize + len, psinfo);
 		if (ret == 0 && reason == KMSG_DUMP_OOPS && pstore_is_mounted())
 			pstore_new_entry = 1;
 
@@ -190,7 +190,7 @@ static void pstore_register_console(void) {}
 
 static int pstore_write_compat(enum pstore_type_id type,
 			       enum kmsg_dump_reason reason,
-			       u64 *id, unsigned int part,
+			       u64 *id, unsigned int part, int count,
 			       size_t size, struct pstore_info *psi)
 {
 	return psi->write_buf(type, reason, id, part, psinfo->buf, size, psi);
@@ -259,6 +259,7 @@ void pstore_get_records(int quiet)
 	char			*buf = NULL;
 	ssize_t			size;
 	u64			id;
+	int			count;
 	enum pstore_type_id	type;
 	struct timespec		time;
 	int			failed = 0, rc;
@@ -270,9 +271,9 @@ void pstore_get_records(int quiet)
 	if (psi->open && psi->open(psi))
 		goto out;
 
-	while ((size = psi->read(&id, &type, &time, &buf, psi)) > 0) {
-		rc = pstore_mkfile(type, psi->name, id, buf, (size_t)size,
-				  time, psi);
+	while ((size = psi->read(&id, &type, &count, &time, &buf, psi)) > 0) {
+		rc = pstore_mkfile(type, psi->name, id, count, buf,
+				  (size_t)size, time, psi);
 		kfree(buf);
 		buf = NULL;
 		if (rc && (rc != -EEXIST || !quiet))
diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c
index 177b281..403a5fc 100644
--- a/fs/pstore/ram.c
+++ b/fs/pstore/ram.c
@@ -131,9 +131,8 @@ ramoops_get_next_prz(struct persistent_ram_zone *przs[], uint *c, uint max,
 }
 
 static ssize_t ramoops_pstore_read(u64 *id, enum pstore_type_id *type,
-				   struct timespec *time,
-				   char **buf,
-				   struct pstore_info *psi)
+				   int *count, struct timespec *time,
+				   char **buf, struct pstore_info *psi)
 {
 	ssize_t size;
 	struct ramoops_context *cxt = psi->data;
@@ -236,7 +235,7 @@ static int ramoops_pstore_write_buf(enum pstore_type_id type,
 	return 0;
 }
 
-static int ramoops_pstore_erase(enum pstore_type_id type, u64 id,
+static int ramoops_pstore_erase(enum pstore_type_id type, u64 id, int count,
 				timespec time, struct pstore_info *psi)
 {
 	struct ramoops_context *cxt = psi->data;
diff --git a/include/linux/pstore.h b/include/linux/pstore.h
index 71f43e0..c768758 100644
--- a/include/linux/pstore.h
+++ b/include/linux/pstore.h
@@ -50,17 +50,19 @@ struct pstore_info {
 	int		(*open)(struct pstore_info *psi);
 	int		(*close)(struct pstore_info *psi);
 	ssize_t		(*read)(u64 *id, enum pstore_type_id *type,
-			struct timespec *time, char **buf,
+			int *count, struct timespec *time, char **buf,
 			struct pstore_info *psi);
 	int		(*write)(enum pstore_type_id type,
 			enum kmsg_dump_reason reason, u64 *id,
-			unsigned int part, size_t size, struct pstore_info *psi);
+			unsigned int part, int count, size_t size,
+			struct pstore_info *psi);
 	int		(*write_buf)(enum pstore_type_id type,
 			enum kmsg_dump_reason reason, u64 *id,
 			unsigned int part, const char *buf, size_t size,
 			struct pstore_info *psi);
 	int		(*erase)(enum pstore_type_id type, u64 id,
-			struct timespec time, struct pstore_info *psi);
+			int count, struct timespec time,
+			struct pstore_info *psi);
 	void		*data;
 };
 
-- 1.7.1

[Dle-develop] [RFC][PATCH 1/2] efi_pstore: hold multiple logs

From: Seiji A. <sei...@hd...> - 2012-10-05 19:02:10

[Problem]

  Currently, efi_pstore driver simply overwrites existing panic messages in NVRAM.
  So, in the following scenario, we will lose 1st panic messages.

    1. kernel panics.
    2. efi_pstore is kicked and write panic messages to NVRAM.
    3. system reboots.
    4. kernel panics again before a user checks the 1st panic messages in NVRAM.

[Solution]

  This patch takes an approach just holding multiple logs to avoid overwriting existing
  entries.

[Patch Description]

  This patch changes write/erase/read callbacks as follows.

  - Write callback
    - Check if there are enough spaces to write logs with QueryVariableInfo()
      to avoid handling out of space situation. (It is suggested by Matthew Garrett.)

    - Remove a logic erasing existing entries.

  - Erase callback
    - Freshly create a logic rather than sharing it with a write callback because
      erasing entries doesn't need in a write callback.

    - Add ctime to an argument.
      Currently, a variable name, which is used to distinguish each log entry, consists of type, 
      id and ctime. But an erase callback does not use ctime.

      If efi_pstore supported just one log, type and id were enough.

      On the other hand, in case of supporting multiple logs, ctime is needed to precisely
      distinguish each entry at erasing time.
      Therefore this patch adds ctime to an argument of an erase callback is needed.

   - Read callback
    - Add ctime to an argument.
       This change is needed to identify the entry at reading time.

  <Example>

 As you can see below, we can distinguish first and second events with ctime.

 a variable name of first event: dump-type0-1-12345678
 a variable name of second event: dump-type0-1-23456789

  type:0
  id:1
  ctime:12345678, 23456789

Signed-off-by: Seiji Aguchi <sei...@hd...>
---
 drivers/acpi/apei/erst.c   |    4 +-
 drivers/firmware/efivars.c |   87 ++++++++++++++++++++++++++++----------------
 fs/pstore/inode.c          |    3 +-
 fs/pstore/ram.c            |    2 +-
 include/linux/efi.h        |    1 +
 include/linux/pstore.h     |    2 +-
 6 files changed, 63 insertions(+), 36 deletions(-)

diff --git a/drivers/acpi/apei/erst.c b/drivers/acpi/apei/erst.c
index e4d9d24..0bd6ae4 100644
--- a/drivers/acpi/apei/erst.c
+++ b/drivers/acpi/apei/erst.c
@@ -938,7 +938,7 @@ static int erst_writer(enum pstore_type_id type, enum kmsg_dump_reason reason,
 		       u64 *id, unsigned int part,
 		       size_t size, struct pstore_info *psi);
 static int erst_clearer(enum pstore_type_id type, u64 id,
-			struct pstore_info *psi);
+			struct timespec time, struct pstore_info *psi);
 
 static struct pstore_info erst_info = {
 	.owner		= THIS_MODULE,
@@ -1102,7 +1102,7 @@ static int erst_writer(enum pstore_type_id type, enum kmsg_dump_reason reason,
 }
 
 static int erst_clearer(enum pstore_type_id type, u64 id,
-			struct pstore_info *psi)
+			struct timespec time, struct pstore_info *psi)
 {
 	return erst_clear(id);
 }
diff --git a/drivers/firmware/efivars.c b/drivers/firmware/efivars.c
index d10c987..0eaaba3 100644
--- a/drivers/firmware/efivars.c
+++ b/drivers/firmware/efivars.c
@@ -701,23 +701,70 @@ static int efi_pstore_write(enum pstore_type_id type,
 		unsigned int part, size_t size, struct pstore_info *psi)
 {
 	char name[DUMP_NAME_LEN];
-	char stub_name[DUMP_NAME_LEN];
 	efi_char16_t efi_name[DUMP_NAME_LEN];
 	efi_guid_t vendor = LINUX_EFI_CRASH_GUID;
 	struct efivars *efivars = psi->data;
-	struct efivar_entry *entry, *found = NULL;
 	int i, ret = 0;
+	u64 storage_space, remaining_space, max_variable_size;
+	efi_status_t status = EFI_NOT_FOUND;
+
+	spin_lock(&efivars->lock);
+
+	/*
+	 * Check if there is a space enough to log.
+	 * size: a size of logging data
+	 * DUMP_NAME_LEN * 2: a maximum size of variable name
+	 *
+	 */
+	status = efivars->ops->query_variable_info(PSTORE_EFI_ATTRIBUTES,
+						   &storage_space,
+						   &remaining_space,
+						   &max_variable_size);
+	if (status || remaining_space < size + DUMP_NAME_LEN * 2) {
+		spin_unlock(&efivars->lock);
+		*id = part;
+		return -ENOSPC;
+	}
+
+	sprintf(name, "dump-type%u-%u-%lu", type, part, get_seconds());
+
+	for (i = 0; i < DUMP_NAME_LEN; i++)
+		efi_name[i] = name[i];
+
+	efivars->ops->set_variable(efi_name, &vendor, PSTORE_EFI_ATTRIBUTES,
+				   size, psi->buf);
+
+	spin_unlock(&efivars->lock);
+
+	if (size)
+		ret = efivar_create_sysfs_entry(efivars,
+					  utf16_strsize(efi_name,
+							DUMP_NAME_LEN * 2),
+					  efi_name, &vendor);
+
+	*id = part;
+	return ret;
+};
+
+static int efi_pstore_erase(enum pstore_type_id type, u64 id,
+			    struct timespec time, struct pstore_info *psi)
+{
+	char name[DUMP_NAME_LEN];
+	efi_char16_t efi_name[DUMP_NAME_LEN];
+	efi_guid_t vendor = LINUX_EFI_CRASH_GUID;
+	struct efivars *efivars = psi->data;
+	struct efivar_entry *entry, *found = NULL;
+	int i;
 
-	sprintf(stub_name, "dump-type%u-%u-", type, part);
-	sprintf(name, "%s%lu", stub_name, get_seconds());
+	sprintf(name, "dump-type%u-%llu-%lu", type, id, time.tv_sec);
 
 	spin_lock(&efivars->lock);
 
 	for (i = 0; i < DUMP_NAME_LEN; i++)
-		efi_name[i] = stub_name[i];
+		efi_name[i] = name[i];
 
 	/*
-	 * Clean up any entries with the same name
+	 * Clean up an entry with the same name
 	 */
 
 	list_for_each_entry(entry, &efivars->list, list) {
@@ -728,9 +775,6 @@ static int efi_pstore_write(enum pstore_type_id type,
 		if (utf16_strncmp(entry->var.VariableName, efi_name,
 				  utf16_strlen(efi_name)))
 			continue;
-		/* Needs to be a prefix */
-		if (entry->var.VariableName[utf16_strlen(efi_name)] == 0)
-			continue;
 
 		/* found */
 		found = entry;
@@ -738,37 +782,17 @@ static int efi_pstore_write(enum pstore_type_id type,
 					   &entry->var.VendorGuid,
 					   PSTORE_EFI_ATTRIBUTES,
 					   0, NULL);
+		break;
 	}
 
 	if (found)
 		list_del(&found->list);
 
-	for (i = 0; i < DUMP_NAME_LEN; i++)
-		efi_name[i] = name[i];
-
-	efivars->ops->set_variable(efi_name, &vendor, PSTORE_EFI_ATTRIBUTES,
-				   size, psi->buf);
-
 	spin_unlock(&efivars->lock);
 
 	if (found)
 		efivar_unregister(found);
 
-	if (size)
-		ret = efivar_create_sysfs_entry(efivars,
-					  utf16_strsize(efi_name,
-							DUMP_NAME_LEN * 2),
-					  efi_name, &vendor);
-
-	*id = part;
-	return ret;
-};
-
-static int efi_pstore_erase(enum pstore_type_id type, u64 id,
-			    struct pstore_info *psi)
-{
-	efi_pstore_write(type, 0, &id, (unsigned int)id, 0, psi);
-
 	return 0;
 }
 #else
@@ -797,7 +821,7 @@ static int efi_pstore_write(enum pstore_type_id type,
 }
 
 static int efi_pstore_erase(enum pstore_type_id type, u64 id,
-			    struct pstore_info *psi)
+			    struct timespec time, struct pstore_info *psi)
 {
 	return 0;
 }
@@ -1237,6 +1261,7 @@ efivars_init(void)
 	ops.get_variable = efi.get_variable;
 	ops.set_variable = efi.set_variable;
 	ops.get_next_variable = efi.get_next_variable;
+	ops.query_variable_info = efi.query_variable_info;
 	error = register_efivars(&__efivars, &ops, efi_kobj);
 	if (error)
 		goto err_put;
diff --git a/fs/pstore/inode.c b/fs/pstore/inode.c
index 4ab572e..4300af6 100644
--- a/fs/pstore/inode.c
+++ b/fs/pstore/inode.c
@@ -175,7 +175,8 @@ static int pstore_unlink(struct inode *dir, struct dentry *dentry)
 	struct pstore_private *p = dentry->d_inode->i_private;
 
 	if (p->psi->erase)
-		p->psi->erase(p->type, p->id, p->psi);
+		p->psi->erase(p->type, p->id, dentry->d_inode->i_ctime,
+			      p->psi);
 
 	return simple_unlink(dir, dentry);
 }
diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c
index 0b311bc..177b281 100644
--- a/fs/pstore/ram.c
+++ b/fs/pstore/ram.c
@@ -237,7 +237,7 @@ static int ramoops_pstore_write_buf(enum pstore_type_id type,
 }
 
 static int ramoops_pstore_erase(enum pstore_type_id type, u64 id,
-				struct pstore_info *psi)
+				timespec time, struct pstore_info *psi)
 {
 	struct ramoops_context *cxt = psi->data;
 	struct persistent_ram_zone *prz;
diff --git a/include/linux/efi.h b/include/linux/efi.h
index 8670eb1..c47ec36 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -643,6 +643,7 @@ struct efivar_operations {
 	efi_get_variable_t *get_variable;
 	efi_get_next_variable_t *get_next_variable;
 	efi_set_variable_t *set_variable;
+	efi_query_variable_info_t *query_variable_info;
 };
 
 struct efivars {
diff --git a/include/linux/pstore.h b/include/linux/pstore.h
index c892587..71f43e0 100644
--- a/include/linux/pstore.h
+++ b/include/linux/pstore.h
@@ -60,7 +60,7 @@ struct pstore_info {
 			unsigned int part, const char *buf, size_t size,
 			struct pstore_info *psi);
 	int		(*erase)(enum pstore_type_id type, u64 id,
-			struct pstore_info *psi);
+			struct timespec time, struct pstore_info *psi);
 	void		*data;
 };
 
-- 1.7.1

[Dle-develop] [RFC][PATCH 0/2] efi_pstore: multiple event logging support

From: Seiji A. <sei...@hd...> - 2012-10-05 19:00:27

[Problem]

    Currently, efi_pstore driver simply overwrites existing panic messages in NVRAM.
    So, in the following scenario, we will lose 1st panic messages.

    1. kernel panics.
    2. efi_pstore is kicked and writes panic messages to NVRAM.
    3. system reboots.
    4. kernel panics again before a user checks the 1st panic messages in NVRAM.

[Solution]
   
   Solutions of this problem has been discussed among Tony, Matthew, Don, Mike and me.

   http://marc.info/?l=linux-kernel&m=134273270704586&w=2

   And there are two possible solutions right now.
     - First one is introducing some policy overwriting existing logs. 
     - Second one is simply holding multiple log without overwriting any entries. 

   We haven't decided the overwriting policy which is reasonable to all users yet. 
   But I believe we agree that just holding multiple logs is a reasonable way.

   We may need further discussions to find the possibility of introducing overwriting 
   policy, especially getting critical messages in multiple oops case.
   But I would like to begin with a simple and reasonable way to everyone.
   So, this patch takes an approach just holding multiple logs.

[Patch Description]

(1/2) efi_pstore: hold multiple logs
  This patch makes efi_pstore hold multiple logs.
  Once a critical message is logged, users can see it via /dev/pstore 
  without being influenced by subsequent events.

(2/2) efi_pstore: add a sequence counter to a variable name
  This patch fix an issue in case where multiple events happens in a short time  

Detailed explanations are written in each patch.

 drivers/acpi/apei/erst.c   |   16 +++---
 drivers/firmware/efivars.c |  106 ++++++++++++++++++++++++++++---------------
 fs/pstore/inode.c          |    7 ++-
 fs/pstore/internal.h       |    2 +-
 fs/pstore/platform.c       |   11 +++--
 fs/pstore/ram.c            |    9 ++--
 include/linux/efi.h        |    1 +
 include/linux/pstore.h     |    6 ++-
 8 files changed, 98 insertions(+), 60 deletions(-)

Re: [Dle-develop] [PATCH v4] trace, x86: add x86 irq vector tracepoints

From: Steven R. <ro...@go...> - 2012-10-05 14:13:55

On Tue, 2012-10-02 at 19:10 +0000, Seiji Aguchi wrote:
> > >
> > > If I misunderstand something, please let me know.
> > >
> > 
> > Quite.
> > 
> > These functions are being invoked from the IDT, which is an indirect pointer structure.  When not being traced, there is absolutely no
> > reason why it should go through a thunk with tracepoints.
> 
> I agree that the cost can be absolutely zero by switching  each interrupt hander when turning on/off the tracepoint.
> 

Peter,

I agree that the IDT version is a zero cost in performance, where as the
tracepoint version is a negligible cost in performance. But my worry is
the complexity (read error prone and possible openings of security
exploits) worth it?

Switching of the IDT is not that trivial, and to make it something for
common activities such as reading tracepoints by tools like ftrace and
perf, that do it often, even on production machines, may lead to issues
if its not done right.

You are the maintainer and are responsible for the outcome of changes to
the x86 arch, thus you do have final say. And if you think there's
nothing to worry about with an IDT change then Seiji should implement
it.

I just want to point out some possible repercussions of doing it in a
more complex way. As tracepoints use nops, and I may be pushing to even
out-of-line the tracepoint unlikely part even more, I'm not sure the
complexity is worth the amount of savings it would be against just
adding the tracepoint in the code.

-- Steve

Re: [Dle-develop] [PATCH v4] trace, x86: add x86 irq vector tracepoints

From: Seiji A. <sei...@hd...> - 2012-10-02 19:11:09

> >
> > If I misunderstand something, please let me know.
> >
> 
> Quite.
> 
> These functions are being invoked from the IDT, which is an indirect pointer structure.  When not being traced, there is absolutely no
> reason why it should go through a thunk with tracepoints.

I agree that the cost can be absolutely zero by switching  each interrupt hander when turning on/off the tracepoint.

But I would like to talk about a time penalty of a tracepoint more.

When not being traced, the tracepoint is just nop.
And it is described in the documentation below.
So, the tracepoint seems to be designed to add to performance critical paths.

Documentation/trace/tracepoints.txt
<snip>
* Purpose of tracepoints

A tracepoint placed in code provides a hook to call a function (probe) that you can provide at runtime. A tracepoint can be "on" (a probe is connected to it) or "off" (no probe is attached). When a tracepoint is "off" it has no effect, except for adding a tiny time penalty (checking a condition for a branch) and space penalty (adding a few bytes for the function call at the end of the instrumented function and adds a data structure in a separate section).  When a tracepoint is "on", the function you provide is called each time the tracepoint is executed, in the execution context of the caller. When the function provided ends its execution, it returns to the caller (continuing from the tracepoint site).

You can put tracepoints at important locations in the code. They are lightweight hooks that can pass an arbitrary number of parameters, which prototypes are described in a tracepoint declaration placed in a header file.
<snip>

Also, as I submitted an actual latency measurement, the time penalty is almost zero.

<snip>
(1-1) local_timer_entry

 - 3.6-rc6 original
   	<...>-2788    2dNh. 894834337us : exit_idle <-smp_apic_timer_interrupt
   	<...>-2788    2dNh. 894834337us : hrtimer_interrupt <-smp_apic_timer_interrupt

 - 3.6-rc6 + this patch + trace off
   	<...>-1981    0d.h. 210538us : exit_idle <-smp_apic_timer_interrupt
   	<...>-1981    0d.h. 210538us : hrtimer_interrupt <-smp_apic_timer_interrupt
<snip>

When switching each interrupt handler, all cpus have to be interrupt-disable with smp_call_funciton() or something like that.

IMO, rather than doing such a complex thing, just adding a tracepoint is reasonable.

What do you think?

Seiji

Re: [Dle-develop] [PATCH] Add option to mlock guest and qemu memory

From: Satoru M. <sat...@hd...> - 2012-10-01 21:24:29

Hi Jan,

Thank you for reviewing.

On 09/28/2012 04:05 AM, Jan Kiszka wrote:
> On 2012-09-28 01:21, Satoru Moriya wrote:
>> We have some plans to migrate old enterprise systems which require
>> low latency (msec order) to kvm virtualized environment. Usually,
>> we uses mlock to preallocate and pin down process memory in order
>> to avoid page allocation in latency critical path. On the other
>> hand, in kvm environment, mlocking in guests is not effective
>> because it can't avoid page reclaim in host. Actually, to avoid
>> guest memory reclaim, qemu has "mem-path" option that is actually
>> for using hugepage. But a memory region of qemu is not allocated
>> on hugepage, so it may be reclaimed. That may cause a latency
>> problem.
>>
>> To avoid guest and qemu memory reclaim, this patch introduces
>> a new "mlock" option. With this option, we can preallocate and
>> pin down guest and qemu memory before booting guest OS.
>
> I guess this reduces the likeliness of multi-millisecond latencies for
> you but not eliminate them. Of course, mlockall is part of our local
> changes for real-time QEMU/KVM, but it is just one of the many pieces
> required. I'm wondering how the situation is on your side.

You're right. I think this is a first step toward solving latency issue
on qemu/kvm.

> I think mlockall should once be enabled automatically as soon as you ask
> for real-time support for QEMU guests. How that should be controlled is
> another question. I'm currently carrying a top-level switch "-rt
> maxprio=x[,policy=y]" here, likely not the final solution. I'm not

Could you please tell me what that option actually do?
Do you have any public repositories or something for me to look at your
real-time qemu/kvm changes?

> really convinced we need to control memory locking separately. And as we
> are very reluctant to add new top-level switches, this is even more
> important.

Regards,
Satoru

Re: [Dle-develop] [Qemu-devel] [PATCH] Add option to mlock guest and qemu memory

From: Anthony L. <an...@co...> - 2012-09-28 15:54:25

Jan Kiszka <jan...@si...> writes:

> On 2012-09-28 14:33, Anthony Liguori wrote:
>> Jan Kiszka <jan...@si...> writes:
>> 
>>> On 2012-09-28 01:21, Satoru Moriya wrote:
>>>> This is a first time for me to post a patch to qemu-devel.
>>>> If there is something missing/wrong, please let me know.
>>>>
>>>> We have some plans to migrate old enterprise systems which require
>>>> low latency (msec order) to kvm virtualized environment. Usually,
>>>> we uses mlock to preallocate and pin down process memory in order
>>>> to avoid page allocation in latency critical path. On the other
>>>> hand, in kvm environment, mlocking in guests is not effective
>>>> because it can't avoid page reclaim in host. Actually, to avoid
>>>> guest memory reclaim, qemu has "mem-path" option that is actually
>>>> for using hugepage. But a memory region of qemu is not allocated
>>>> on hugepage, so it may be reclaimed. That may cause a latency
>>>> problem.
>>>>
>>>> To avoid guest and qemu memory reclaim, this patch introduces
>>>> a new "mlock" option. With this option, we can preallocate and
>>>> pin down guest and qemu memory before booting guest OS.
>>>
>>> I guess this reduces the likeliness of multi-millisecond latencies for
>>> you but not eliminate them. Of course, mlockall is part of our local
>>> changes for real-time QEMU/KVM, but it is just one of the many pieces
>>> required. I'm wondering how the situation is on your side.
>>>
>>> I think mlockall should once be enabled automatically as soon as you ask
>>> for real-time support for QEMU guests. How that should be controlled is
>>> another question. I'm currently carrying a top-level switch "-rt
>>> maxprio=x[,policy=y]" here, likely not the final solution. I'm not
>>> really convinced we need to control memory locking separately. And as we
>>> are very reluctant to add new top-level switches, this is even more
>>> important.
>> 
>> I think you're right here although I'd suggest not abbreviating.
>
> You mean the sense of "-realtime" instead of "-rt"?

Yes.  Or any other word that makes sense.

Regards,

Anthony Liguori

>
> Jan
>
> -- 
> Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
> Corporate Competence Center Embedded Linux

Re: [Dle-develop] [Qemu-devel] [PATCH] Add option to mlock guest and qemu memory

From: Jan K. <jan...@si...> - 2012-09-28 13:14:36

On 2012-09-28 14:33, Anthony Liguori wrote:
> Jan Kiszka <jan...@si...> writes:
> 
>> On 2012-09-28 01:21, Satoru Moriya wrote:
>>> This is a first time for me to post a patch to qemu-devel.
>>> If there is something missing/wrong, please let me know.
>>>
>>> We have some plans to migrate old enterprise systems which require
>>> low latency (msec order) to kvm virtualized environment. Usually,
>>> we uses mlock to preallocate and pin down process memory in order
>>> to avoid page allocation in latency critical path. On the other
>>> hand, in kvm environment, mlocking in guests is not effective
>>> because it can't avoid page reclaim in host. Actually, to avoid
>>> guest memory reclaim, qemu has "mem-path" option that is actually
>>> for using hugepage. But a memory region of qemu is not allocated
>>> on hugepage, so it may be reclaimed. That may cause a latency
>>> problem.
>>>
>>> To avoid guest and qemu memory reclaim, this patch introduces
>>> a new "mlock" option. With this option, we can preallocate and
>>> pin down guest and qemu memory before booting guest OS.
>>
>> I guess this reduces the likeliness of multi-millisecond latencies for
>> you but not eliminate them. Of course, mlockall is part of our local
>> changes for real-time QEMU/KVM, but it is just one of the many pieces
>> required. I'm wondering how the situation is on your side.
>>
>> I think mlockall should once be enabled automatically as soon as you ask
>> for real-time support for QEMU guests. How that should be controlled is
>> another question. I'm currently carrying a top-level switch "-rt
>> maxprio=x[,policy=y]" here, likely not the final solution. I'm not
>> really convinced we need to control memory locking separately. And as we
>> are very reluctant to add new top-level switches, this is even more
>> important.
> 
> I think you're right here although I'd suggest not abbreviating.

You mean the sense of "-realtime" instead of "-rt"?

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux

Re: [Dle-develop] [Qemu-devel] [PATCH] Add option to mlock guest and qemu memory

From: Anthony L. <an...@co...> - 2012-09-28 13:04:52

Jan Kiszka <jan...@si...> writes:

> On 2012-09-28 01:21, Satoru Moriya wrote:
>> This is a first time for me to post a patch to qemu-devel.
>> If there is something missing/wrong, please let me know.
>> 
>> We have some plans to migrate old enterprise systems which require
>> low latency (msec order) to kvm virtualized environment. Usually,
>> we uses mlock to preallocate and pin down process memory in order
>> to avoid page allocation in latency critical path. On the other
>> hand, in kvm environment, mlocking in guests is not effective
>> because it can't avoid page reclaim in host. Actually, to avoid
>> guest memory reclaim, qemu has "mem-path" option that is actually
>> for using hugepage. But a memory region of qemu is not allocated
>> on hugepage, so it may be reclaimed. That may cause a latency
>> problem.
>> 
>> To avoid guest and qemu memory reclaim, this patch introduces
>> a new "mlock" option. With this option, we can preallocate and
>> pin down guest and qemu memory before booting guest OS.
>
> I guess this reduces the likeliness of multi-millisecond latencies for
> you but not eliminate them. Of course, mlockall is part of our local
> changes for real-time QEMU/KVM, but it is just one of the many pieces
> required. I'm wondering how the situation is on your side.
>
> I think mlockall should once be enabled automatically as soon as you ask
> for real-time support for QEMU guests. How that should be controlled is
> another question. I'm currently carrying a top-level switch "-rt
> maxprio=x[,policy=y]" here, likely not the final solution. I'm not
> really convinced we need to control memory locking separately. And as we
> are very reluctant to add new top-level switches, this is even more
> important.

I think you're right here although I'd suggest not abbreviating.

Regards,

Anthony Liguori

>
>> 
>> Tested on Linux, x86_64 (fedora 17).
>> 
>> Signed-off-by: Satoru Moriya <sat...@hd...>
>> ---
>>  cpu-all.h       | 1 +
>>  exec.c          | 3 +++
>>  qemu-options.hx | 8 ++++++++
>>  vl.c            | 4 ++++
>>  4 files changed, 16 insertions(+)
>> 
>> diff --git a/cpu-all.h b/cpu-all.h
>> index 74d3681..e12e5d5 100644
>> --- a/cpu-all.h
>> +++ b/cpu-all.h
>> @@ -503,6 +503,7 @@ extern RAMList ram_list;
>>  
>>  extern const char *mem_path;
>>  extern int mem_prealloc;
>> +extern int mem_lock;
>>  
>>  /* Flags stored in the low bits of the TLB virtual address.  These are
>>     defined so that fast path ram access is all zeros.  */
>> diff --git a/exec.c b/exec.c
>> index bb6aa4a..de13bc9 100644
>> --- a/exec.c
>> +++ b/exec.c
>> @@ -2572,6 +2572,9 @@ ram_addr_t qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
>>              }
>>              memory_try_enable_merging(new_block->host, size);
>>          }
>> +        if (mem_lock && mlockall(MCL_CURRENT | MCL_FUTURE)) {
>> +            perror("mlockall");
>> +        }
>
> This belongs to the OS abstraction layer (it's POSIX). And you only need
> to call it once per process lifetime.
>
>>      }
>>      new_block->length = size;
>>  
>> diff --git a/qemu-options.hx b/qemu-options.hx
>> index 7d97f96..9d82f15 100644
>> --- a/qemu-options.hx
>> +++ b/qemu-options.hx
>> @@ -427,6 +427,14 @@ Preallocate memory when using -mem-path.
>>  ETEXI
>>  #endif
>>  
>> +DEF("mlock", 0, QEMU_OPTION_mlock,
>> +    "-mlock          mlock guest and qemu memory\n",
>> +    QEMU_ARCH_ALL)
>> +STEXI
>> +@item -mlock
>> +mlock guest and qemu memory.
>> +ETEXI
>> +
>>  DEF("k", HAS_ARG, QEMU_OPTION_k,
>>      "-k language     use keyboard layout (for example 'fr' for French)\n",
>>      QEMU_ARCH_ALL)
>> diff --git a/vl.c b/vl.c
>> index 8d305ca..c902084 100644
>> --- a/vl.c
>> +++ b/vl.c
>> @@ -187,6 +187,7 @@ const char *mem_path = NULL;
>>  #ifdef MAP_POPULATE
>>  int mem_prealloc = 0; /* force preallocation of physical target memory */
>>  #endif
>> +int mem_lock;
>>  int nb_nics;
>>  NICInfo nd_table[MAX_NICS];
>>  int autostart;
>> @@ -2770,6 +2771,9 @@ int main(int argc, char **argv, char **envp)
>>                  mem_prealloc = 1;
>>                  break;
>>  #endif
>> +            case QEMU_OPTION_mlock:
>> +                mem_lock = 1;
>> +                break;
>>              case QEMU_OPTION_d:
>>                  log_mask = optarg;
>>                  break;
>> 
>
> Jan
>
> -- 
> Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
> Corporate Competence Center Embedded Linux

Re: [Dle-develop] [PATCH] Add option to mlock guest and qemu memory

From: Jan K. <jan...@si...> - 2012-09-28 08:05:26

On 2012-09-28 01:21, Satoru Moriya wrote:
> This is a first time for me to post a patch to qemu-devel.
> If there is something missing/wrong, please let me know.
> 
> We have some plans to migrate old enterprise systems which require
> low latency (msec order) to kvm virtualized environment. Usually,
> we uses mlock to preallocate and pin down process memory in order
> to avoid page allocation in latency critical path. On the other
> hand, in kvm environment, mlocking in guests is not effective
> because it can't avoid page reclaim in host. Actually, to avoid
> guest memory reclaim, qemu has "mem-path" option that is actually
> for using hugepage. But a memory region of qemu is not allocated
> on hugepage, so it may be reclaimed. That may cause a latency
> problem.
> 
> To avoid guest and qemu memory reclaim, this patch introduces
> a new "mlock" option. With this option, we can preallocate and
> pin down guest and qemu memory before booting guest OS.

I guess this reduces the likeliness of multi-millisecond latencies for
you but not eliminate them. Of course, mlockall is part of our local
changes for real-time QEMU/KVM, but it is just one of the many pieces
required. I'm wondering how the situation is on your side.

I think mlockall should once be enabled automatically as soon as you ask
for real-time support for QEMU guests. How that should be controlled is
another question. I'm currently carrying a top-level switch "-rt
maxprio=x[,policy=y]" here, likely not the final solution. I'm not
really convinced we need to control memory locking separately. And as we
are very reluctant to add new top-level switches, this is even more
important.

> 
> Tested on Linux, x86_64 (fedora 17).
> 
> Signed-off-by: Satoru Moriya <sat...@hd...>
> ---
>  cpu-all.h       | 1 +
>  exec.c          | 3 +++
>  qemu-options.hx | 8 ++++++++
>  vl.c            | 4 ++++
>  4 files changed, 16 insertions(+)
> 
> diff --git a/cpu-all.h b/cpu-all.h
> index 74d3681..e12e5d5 100644
> --- a/cpu-all.h
> +++ b/cpu-all.h
> @@ -503,6 +503,7 @@ extern RAMList ram_list;
>  
>  extern const char *mem_path;
>  extern int mem_prealloc;
> +extern int mem_lock;
>  
>  /* Flags stored in the low bits of the TLB virtual address.  These are
>     defined so that fast path ram access is all zeros.  */
> diff --git a/exec.c b/exec.c
> index bb6aa4a..de13bc9 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -2572,6 +2572,9 @@ ram_addr_t qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
>              }
>              memory_try_enable_merging(new_block->host, size);
>          }
> +        if (mem_lock && mlockall(MCL_CURRENT | MCL_FUTURE)) {
> +            perror("mlockall");
> +        }

This belongs to the OS abstraction layer (it's POSIX). And you only need
to call it once per process lifetime.

>      }
>      new_block->length = size;
>  
> diff --git a/qemu-options.hx b/qemu-options.hx
> index 7d97f96..9d82f15 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -427,6 +427,14 @@ Preallocate memory when using -mem-path.
>  ETEXI
>  #endif
>  
> +DEF("mlock", 0, QEMU_OPTION_mlock,
> +    "-mlock          mlock guest and qemu memory\n",
> +    QEMU_ARCH_ALL)
> +STEXI
> +@item -mlock
> +mlock guest and qemu memory.
> +ETEXI
> +
>  DEF("k", HAS_ARG, QEMU_OPTION_k,
>      "-k language     use keyboard layout (for example 'fr' for French)\n",
>      QEMU_ARCH_ALL)
> diff --git a/vl.c b/vl.c
> index 8d305ca..c902084 100644
> --- a/vl.c
> +++ b/vl.c
> @@ -187,6 +187,7 @@ const char *mem_path = NULL;
>  #ifdef MAP_POPULATE
>  int mem_prealloc = 0; /* force preallocation of physical target memory */
>  #endif
> +int mem_lock;
>  int nb_nics;
>  NICInfo nd_table[MAX_NICS];
>  int autostart;
> @@ -2770,6 +2771,9 @@ int main(int argc, char **argv, char **envp)
>                  mem_prealloc = 1;
>                  break;
>  #endif
> +            case QEMU_OPTION_mlock:
> +                mem_lock = 1;
> +                break;
>              case QEMU_OPTION_d:
>                  log_mask = optarg;
>                  break;
> 

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux

454 messages has been excluded from this view by a project administrator.

Flat | Threaded

<< < 1 .. 29 30 31 32 33 .. 99 > >> (Page 31 of 99)