Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

nmon for Linux Questions, Bugs & Ideas

Help
2009-07-30
2013-05-28
1 2 > >> (Page 1 of 2)
  • This is the place for questions, bug reports, fixes and ideas to new features related to nmon for Linux.

     
  • poet_imp
    poet_imp
    2009-09-24

    12d is not working with the latest kernel. I tried the precompiled binaries and tried compiling it myself. The result was the same:

    me@my-laptop:~/Downloads/nmon$ uname -a
    Linux paul-laptop 2.6.31-10-generic-pae #34-Ubuntu SMP Wed Sep 16 01:20:33 UTC 2009 i686 GNU/Linux
    me@my-laptop:~/Downloads/nmon$ nmon
    nmon: malloc.c:3074: sYSMALLOc: Assertion `(old_top == (((mbinptr) (((char *) &((av)->bins)) - __builtin_offsetof (struct malloc_chunk, fd)))) && old_size == 0) || ((unsigned long) (old_size) >= (unsigned long)((((__builtin_offsetof (struct malloc_chunk, fd_nextsize))+((2 * (sizeof(size_t))) - 1)) & ~((2 * (sizeof(size_t))) - 1))) && ((old_top)->size & 0x1) && ((unsigned long)old_end & pagemask) == 0)' failed.
    Aborted (core dumped)

     
  • Gerrie Fisk
    Gerrie Fisk
    2009-10-02

    Is nmon xen aware?  If it isn't, what statistic is the cpu % reporting and is it meaningful?  My goal is to calculate cpu utilization for the virtual cpus in the VM.  Thanks.

     
  • Martha Centeno
    Martha Centeno
    2009-10-19

    I am trying to run nmon to collect performance data on a 96-core IBM System x3950 M2.  However, the CPU utilization report just shows columns full on "nan".  I've downloaded and compiled the 12d version and it still shows the same results.  Is there something else I need to change in the source code in order to capture the CPU stats?

    Thank you,
    Martha Centeno
    mcenteno@us.ibm.com

     
  • Marck
    Marck
    2009-11-15

    Hello,

    I have compiled the latest source on Opensuse 11.1 there I can run Nmon
    without problems on Opensuse 11.2 I got the following error message:

    ./nmon_x86_opensuse
    nmon_x86_opensuse: malloc.c:3096: sYSMALLOc: Assertion `(old_top ==
    (((mbinptr) (((char *) &((av)->bins)) - __builtin_offsetof
    (struct malloc_chunk, fd)))) && old_size == 0) || ((unsigned long)
    (old_size) >= (unsigned long)((((__builtin_offsetof (struct malloc_chunk,
    fd_nextsize))+((2 * (sizeof(size_t))) - 1)) & ~((2 * (sizeof(size_t))) -
    1))) && ((old_top)->size & 0x1) && ((unsigned long)old_end & pagemask) ==
    0)' failed.
    Aborted

    Regards,

    Marck

     
  • Found the bug and its fixed in lmon12e.c
    Tested on Ubuntu 910 that showed exactly the same problem.
    Can you try lmon12e.c on OpenSUSE?
    I am confident it should work - also note the new colour graphs.

    Thanks Nigel

     
  • Oh forgot to say that  lmon12e.c should also fix the 96 CPU issue. Please test and report back if OK.

    The problem is the size of /proc/cpuinfo on Intel machines = ~420 bytes per CPU - and 99% of it is just repeating uninteresting stuff so its 40KB of junk and 1K of facts.

    Does anyone make a Intel machine with 16 or more CPUs where the CPUs can be different?

    thanks Nigel

     
  • Marck
    Marck
    2009-11-16

    Nigel,

    I have compiled the source after installing ncurses devel it works on OpenSuse 11.1 I will test it on OpenSuse 11.2 what is the warning below ?:

    make nmon_x86_opensuse10 lmon.c
    cc -o nmon_x86_opensuse10 lmon.c -g -O2 -D JFS -D GETUSER -Wall -D LARGEMEM -lncurses -g
    lmon.c: In function ‘proc_disk’:
    lmon.c:1521: warning: call to function ‘proc_partitions’ without a real prototype
    lmon.c:1429: note: ‘proc_partitions’ was declared here
    make: Nothing to be done for `lmon.c'.

    Thanks,

    Marck

     
  • Marck
    Marck
    2009-11-16

    Nigel,

    Thanks!!!

    It works on OpenSuse 11.2!!

    Marck

     
  • poet_imp
    poet_imp
    2009-11-17

    I just downloaded and compiled 12f and it works on kubuntu 9.10. Thank you!

     
  • Davey r.
    Davey r.
    2010-01-06

    Hi

    I am running nmon on a virtual server and i want nmon to collect data. But the network setup is done by the physical machine. Is it possible to disable collecting data about network?
    How can i configure the datacollectmode to collect only cpu and ram informations?

    Regards,

    Davey

     
  • David Baril
    David Baril
    2010-01-12

    Hi Nigel,

    I believe I found a bug in version 12a, with regard to disk groups.
    On line 4525, just after the
    #endif /* JFS */

    if (show_disk || show_verbose || show_diskmap) {
                            proc_read(P_STAT);
                            proc_disk(elapsed);

    This section displays the various disk information screens.  Note that the statistics are read and disk stats processed only if one of the disk modes or verbose modes is selected.  If disk groups(only) are selected, the reading and processing of the disks statistics is not ensured.  The "show_dgroup" boolean probably needs to be added to this if statement to ensure that the disk statistics are read and processed … even if the normal, non-group disk screens are NOT being displayed.

    Further down on line 4719:
    if ((show_dgroup || (!cursed && dgroup_loaded))) {

    This continues into the code to display the disk group.  In this if clause, there is no code to read or process the disk statistics.

    The end result is that the disk group screen will stay unchanged with zero activity, unless some other command caused disk statistics to be collected.  On large systems, especially with multipath running, there can be hundreds of paths … that need to be mapped into a smaller number of disk groups … and the un-grouped disk information will often not need to be displayed at all.

    Thank you for a great tool.

    Dave B.

     
  • David Baril
    David Baril
    2010-01-12

    Nigel,

    I believe there is a bug in the code to display disk groups ONLY when you are running in "." mode (showing only active disks).  When running in this mode, the upper portion of the screen is updated properly for the current screen update, but the previous screen contents in the lower portion of the screen are not properly erased, and the next section is not pulled up to display at the next line after the "Groups=xxx TOTALS" line.
    I could not find the section in the code, but it appears that the logical end of section on the screen is not being changed when the number of disk group lines to be displayed varies due to activity/lack of activity.  The number of rows displayed for the disk groups always appears to be the same, with the lower portion being stale.

    This is for nmon 12a 64-bit on RHEL 5.4 64-bit.

    Regards,

    Dave B

     
  • ixbrian
    ixbrian
    2010-01-19

    Dave,
    I looked in the disk group section of the code.  It looks like it is half coded to be a fixed sized and half coded to dynamically resize if the "show only active disks ('.')" mode is enabled. 

    I checked on the topas_nmon version on AIX, and on this version the "show only active disks" mode causes no change with the disk group display. 

    If you want to have behaviour similar to the AIX version, you can comment out lines 4755 and 4756:

        //if (!show_all && (disk_read < 1.0 && disk_write < 1.0))
        //      continue;

    And also change line 4783
    From:
        DISPLAY(paddg, 2 + dgroup_total_groups);
    To:
        DISPLAY(paddg, 3 + dgroup_total_groups);

    Brian

     
  • David Baril
    David Baril
    2010-01-19

    Hi Brian,

    Thank you for the quick reply.  I will try to make the changes to my local copy of nmon and post the results.  Would you agree that I would also need make some changes that I noted in posting #12 with regard to line 4522:

    Original:
            if (show_disk || show_verbose || show_diskmap) {
                            proc_read(P_STAT);
                            proc_disk(elapsed);
             }
    Proposed:
            if (show_disk || show_verbose || show_diskmap  || show_dgroup) {
                            proc_read(P_STAT);
                            proc_disk(elapsed);
             }

    This will ensure that the disk statistics are gathered and processed if disk groups are being displayed without any of the other disk-related screens.  Would there be any unexpected side effects of adding "show_dgroups" to this if clause?

    Dave B

     
  • ixbrian
    ixbrian
    2010-01-19

    Dave,
    I was able to reproduce the same bug that you had in post #12.  I agree with your solution to the problem, and I just tried it out and this fixes the bug. 

    Brian

     
  • David Baril
    David Baril
    2010-01-22

    Hi Brian,
    I was able to enter the changes in post #12/15 and #14, compiled and started testing.  This is the first I have really looked at the disk group numbers under Linux, since before they weren't collected properly.

    Well, with these fixes in, it appears that the BlockSizeKB field for the disk group is in error by about a factor of 2. and the read/write KB/sec is also off by ~2-fold, and so are the total kb/sec.

    I looked at the source, and I have not yet been able to identify where the error is occurring. 

    You can most easily detect the issue by displaying the active disks only (the "." command) and the disk groups.  The BlockSizeKB for the group will be much different than the underlying disk paths.

    My current case is complex.  I have a GPFS file system striped across 10 data LUNs and 10 Metadata LUNs, and each of these LUNs have 8 paths.  There is also a GPFS scratch file system striped across 10 small LUNs.  In total there are 30 LUNs, with 8 paths each = 240 paths … plus a few disk entries for the boot disk.

    GPFS is configured to use a 4MB block size, and I have run tests with large files that validate that the large IO is occurring.  Using nmon, iostat, or sar, you can see that the average IO size approaches 4MB for the data LUNs.  Using nmon disk groups to aggregate the statistics from the multiple paths yields incorrect results.

    I also wrote a script to take the output of the iostat command and map the individual paths into disk groups via a mapping table, somewhat similar to nmon.  The iostat script output correlates with the individual statistics, including the totals and IO sizes, and also correlates with nmon individual disk path statistics. 

    Only the disk group statistics under nmon are different.  If I missed disk entries in the nmon disk group file, my overall totals would be low, but the average blocksize would be about the same.  Nmon shows the number of disks in each group, and they all show up correctly having the proper number of underlying disks.

    Next week, I'll look at the issue again.  I'll try running some controlled tests that may help identify where the math is going wrong.  I'll revisit the source.  My first though that there might be a problem with the 512kb sector vs. 1kb but the code looked consistent.  The sector count is divided by 2 to yield kb very early just after the stats are read, and then kb is used.

    My thought now has to do with disk paths without current activity still being included in the divisor to compute the average block size.  That might explain the blocksize being incorrect, but it does not answer why the kb/sec, MB/sec and xfers/sec are low.
    .
    I've also compiled nmon as both a 32-bit and 64-bit program, with the same behavior.  I am running RHEL 64-bit 5.4

     
  • David Baril
    David Baril
    2010-01-22

    Hi,

    I found a latent problem.  For the 12f source, on line 2024:
    fgets(line, 2048, gp) != NULL && dgroup_total_groups < DGROUPS;
    There is the read of the disk group line from the disk group file.  The maximum allowable length is 2048 bytes.  Nmon allows 512 members in a disk group, and the disk names could be in the form sd?? plus a space character.  This is 5 characters per entry * 512 entries = 2560 characters plus the disk group name.  With the 2048 limit, the practical limit is around 408 assuming there is a 10 character disk group name.  This is not causing a problem for me yet, but it will in the future. 

    For nmon's current 512 disk group member limit, I would suggest that the allowable line size needs to be increased to about 3000 characters.

    Dave B

     
  • ixbrian
    ixbrian
    2010-01-31

    David,
    In reply to post #18:  I did some testing and verified that this is a bug.  I recently setup a new project (http://elmon.sourceforge.net/) based on the nmon code and have included the bug fixes that you have posted here on the nmon forum including your fix for this issue. 

    In reply to post #17:  I see the same behaviour.  I will look in to this. 

     
  • David Baril
    David Baril
    2010-02-01

    Hi Brian,
    Thanks for the feedback.  elmon looks interesting, and I'll need to spend some time to check it out.

    Back to nmon.  I found another incorrect behavior with regard to calculating disk busy.

    At line 1318-1322, inf the function "proc_disk_io, there is code to calculate a "fudged_busy", with the assumption that a disk can do 200 IO/sec.  This is inaccurate and not necessary.  For disk stats lines (not for partitions), nmon is already collecting dk_rmsec, dk_wmsec, and dk_time.  The dk_time field was already converted to percentage at line 1411 in function proc_diskstats.

    I suspect that this "fudged_busy" code at 1318-1322 may be trying to generate a pseudo busy for partitions.  There is a challenge with disk partitions, since Linux does not provide a full set of disk stats for the partition entries.  Specificaly, the dk_rmsec, dk_wmsec, dk_time and other fields are not available.  So for these cases, just set the values to zero (which the initialization code for eack diskstats line read does anyway).  For partitions, the nmon user will need to look at the parent disk to get disk busy information, unless nmon wants to apportion the parent's disk busy onto the child partitions, which won't be 100% accurate.  If nmon wants to "fudge" the busy of child partitions … ok … but don't throw away the available busy information for full disks.

    I would recommend commenting out lines 1318-1322 to restore accurate busy values for full disks, and set the busy metrics for partitions to zero, based on no available busy statistics for partitions.

    Dave B.

     
  • David Baril
    David Baril
    2010-02-05

    I would like to submit this source code fix for posting #18 … incorrect line size for disk group input.

    Change blank line 1992:

    New:
    #define MAX_LINE_SIZE   3500

    Change line 2003
    Orig:
    char line;

    New:
    char line;

    Change line 2024:
    Orig:
    fgets(line, 2048, gp) != NULL && dgroup_total_groups < DGROUPS;

    New:
    fgets(line, MAX_LINE_SIZE, gp) != NULL && dgroup_total_groups < DGROUPS;

    Dave B

     
  • David Baril
    David Baril
    2010-02-05

    I would like to submit this source code fix for posting #20 … incorrect use of "fudged_busy".  Comment out the following lines:

    Line 1284
                    Orig:
    int fudged_busy;

    New:
    //int fudged_busy;

    Line 1319-1322
    Orig:
    fudged_busy = (p->dk_.dk_reads + p->dk.dk_writes)/2;
    if(fudged_busy > 100*elapsed)
    p->dk.dk_time += 100*elapsed;
    p->dk.dk_time = fudged_busy;

    New:
    //if(fudged_busy > 100*elapsed)
    // p->dk.dk_time += 100*elapsed;
    //p->dk.dk_time = fudged_busy;

    Dave B_

     
  • On replies 20 and 22
    The fudge code is to work around missing data in the /proc/stat file.
    This is used in some of the very oldest Linux versions as better data is not available.
    In the Linux versions you are using you are extracting the disk data from /proc/diskstats
    So you are patching the wrong code.

    And it is best that it stays or guys with these older Linux versions in production could get annoyed!
    thanks Nigel

     
  • When I stated "This is the place for questions, bug reports, fixes and ideas to new features related to nmon for Linux. "

    I meant the forum - not this one particular Topic- lets break out different problems in to different Topics
    Thanks Nigel

     
  • ThoSil
    ThoSil
    2010-04-23

    I've founded a bug: segfault while getting NFS stats on a RH 5 with an nfs server:
    diff lmon13g.c lmon13g.c.orig
    1267c1267
    < if(proc.line_ == ' ') {
    --
    > if(proc.line == ' ') {

    for the rest, thank you Nigel :-)

    • ThoSil_
     
1 2 > >> (Page 1 of 2)