Menu

#2739 gatherer reposd SIGFPE during calculation of interval metric

Function
closed-fixed
reposd (1)
gather
5
2015-01-20
2014-10-06
Dave Heller
No

A bug was found where the gatherer repository daemon "reposd" was crashing with SIGFPE on the calculation of certain interval metrics. The root cause of the crash was determined to be a duplicate data value in the repository, i.e. two data points with the same timestamp, resulting in a zero value for the time interval and hence, the floating point exception during the calculation of the rate metric. This resulted in a core with a backtrace like the following:

Program terminated with signal 8, Arithmetic exception.
#0  0x00007fe9fe1703de in metricCalcPageInRate (mv=0x16b7fd0, mnum=<value optimized out>, v=0x16df5f0, vlen=<value optimized out>) at plugin/repositoryUnixProcess.c:611
(gdb) bt
#0  0x00007fe9fe1703de in metricCalcPageInRate (mv=0x16b7fd0, mnum=<value optimized out>, v=0x16df5f0, vlen=<value optimized out>) at plugin/repositoryUnixProcess.c:611
#1  0x0000003cf0602ee9 in reposvalue_get (vs=0x7fe9fcb3c018, ch=0x168b5a0) at repos.c:385
#2  0x00000000004020af in main (argc=<value optimized out>, argv=<value optimized out>) at reposd.c:474
(gdb) l
606     #ifdef DEBUG
607       fprintf(stderr,"--- %s(%i) : Calculate PageInRate\n",
608               __FILE__,__LINE__);
609     #endif
610       if ( mv && (vlen>=sizeof(unsigned long long)) && (mnum>=2) ) {
611         total = (*(unsigned long long*)mv[0].mvData - *(unsigned long long*)mv[mnum-1].mvData) /
612                 (mv[0].mvTimeStamp - mv[mnum-1].mvTimeStamp);
613         memcpy(v, &total, sizeof(unsigned long long));
614         return sizeof(unsigned long long);
615       }

It was confirmed there was indeed a duplicate value in the repository. The root cause of this has still not been determined. In the environment where it was seen, distributed gatherer was in use (i.e. many gatherd nodes reporting to a single reposd node), so this likely has something to do with it.

To mitigate the crash, and to facilitate debugging, the following patch was developed. This detects the duplicate data point and discards the resulting interval calculation, and logs an error to the syslog, reporting the timestamp and the metric id on which the dup was encountered. When this happens you will see like:

May  1 04:50:02 mybox reposd[2455]: Hit repeated value on calculation of PageOutRate (mid=198)
May  1 04:50:02 mybox reposd[2455]: 0: remotebox.blah.com res=7726.processname value=0 at 1398934202
May  1 04:50:02 mybox reposd[2455]: 1: remotebox.blah.com res=7726.processname value=0 at 1398934202
May  1 08:10:00 mybox reposd[2455]: Hit repeated value on calculation of KernelModeTime (mid=180)
May  1 08:10:00 mybox reposd[2455]: 0: remotebox.blah.com res=15280.processname value=892745776 at 1398946200
May  1 08:10:00 mybox reposd[2455]: 1: remotebox.blah.com res=15280.processname value=892745776 at 1398946200

Until the root cause of the duplicate data point is found, the mitigation patch will prevent reposd from crashing when this condition occurs.

The patch also fixes a bug where .Linux_UnixProcessMetricValue PageInRate & PageInRate were being calculated incorrectly.

Discussion

  • Dave Heller

    Dave Heller - 2014-10-06

    Commit [adb18f] for gather 2.2

     

    Related

    Commit: [adb18f]

  • Dave Heller

    Dave Heller - 2014-10-10
    • status: open --> closed-fixed
     

Log in to post a comment.