Menu

Corrupt mumps.dat?

Help
2003-02-17
2012-12-29
  • Ben Mehling

    Ben Mehling - 2003-02-17

    This week we encountered our first mumps.dat corruption.  I thought we'd share our findings (from a mupip integ) and see if anyone else has run into this problem.

    We've looked through the newest programmers guide for any references to 'fixing' or 'maintaining' globals in the mumps.dat file.  Couldn't find anything... 

    Any recommendations for either fixing this OR avoiding it in the future?  Nothing significant happened to this file or the host it is on... no notable crashes were logged.

    Thanks! 

    - Ben

    Here's the output from mupip integ (sorry, the formatting will no doubt be tweaked):

    12:14pm root@dev ~vista/dev/vista> mupip integ /home/vista/dev/vista/g/mumps.dat

    Block:Offset Level
    %GTM-I-DBTNTOOLG,
           1:0      1  Block transaction number too large
                       Directory Path:  1:0
    Keys from ^ to the end are suspect.
    %GTM-I-DBTN, Block TN is 0x00000000
    %GTM-I-DBTNTOOLG,
           2:0      0  Block transaction number too large
                       Directory Path:  1:8, 2:0
    Keys from ^ to the end are suspect.
    %GTM-I-DBTN, Block TN is 0x187061C6
    %GTM-I-DBTNTOOLG,
         602:0      1  Block transaction number too large
                       Directory Path:  1:8, 2:8
                       Path:  602:0
    Keys from ^%Z to the end are suspect.
    %GTM-I-DBTN, Block TN is 0x0000008A
    %GTM-I-DBTNTOOLG,
         603:0      0  Block transaction number too large
                       Directory Path:  1:8, 2:8
                       Path:  602:8, 603:0
    Keys from ^%Z to ^%Z("BREAK") are suspect.
    %GTM-I-DBTN, Block TN is 0x00000038
    %GTM-I-DBTNTOOLG,
         604:0      0  Block transaction number too large
                       Directory Path:  1:8, 2:8
                       Path:  602:1B, 604:0
    Keys from ^%Z("BREAK") to ^%Z("F12") are suspect.
    %GTM-I-DBTN, Block TN is 0x00000066
    %GTM-I-DBTNTOOLG,
         605:0      0  Block transaction number too large
                       Directory Path:  1:8, 2:8
                       Path:  602:28, 605:0
    Keys from ^%Z("F12") to ^%Z("REMOVE") are suspect.
    %GTM-I-DBTN, Block TN is 0x0000008A
    %GTM-I-DBTNTOOLG,
         601:0      0  Block transaction number too large
                       Directory Path:  1:8, 2:8
                       Path:  602:38, 601:0
    Keys from ^%Z("REMOVE") to the end are suspect.
    %GTM-I-DBTN, Block TN is 0x00000098
    %GTM-I-DBTNTOOLG,
         607:0      1  Block transaction number too large
                       Directory Path:  1:8, 2:14
                       Path:  607:0
    Keys from ^%ZIS to the end are suspect.
    %GTM-I-DBTN, Block TN is 0x00A9E7B4
    %GTM-I-DBTNTOOLG,
        DD07:0      0  Block transaction number too large
                       Directory Path:  1:8, 2:14
                       Path:  607:8, DD07:0
    Keys from ^%ZIS to ^%ZIS(1,22,1) are suspect.
    %GTM-I-DBTN, Block TN is 0x00A9E7B4
    %GTM-I-DBTNTOOLG,
         608:0      0  Block transaction number too large
                       Directory Path:  1:8, 2:14
                       Path:  607:1F, 608:0
    Keys from ^%ZIS(1,22,1) to ^%ZIS(1,2*) are suspect.
    %GTM-I-DBTN, Block TN is 0x00A9E7B4

    Total error count from integ:           55092.

    Type           Blocks         Records          % Used      Adjacent

    Directory           2             276          40.722            NA
    Index             510           54707          46.350             9
    Data            54472        11118788          99.135         54082
    Free            45016              NA              NA            NA
    Total          100000        11173771              NA         54091
    %GTM-E-DBTNLTCTN, Transaction numbers greater than the current transaction were found
    Maximum number of transaction number errors to display:  10, was exceeded
    55092 transaction number errors encountered.

    Largest transaction number found in database was FFFFFFFF
    Current transaction number is                    0
    %GTM-E-INTEGERRS, Database integrity errors
    12:15pm root@dev ~vista/dev/vista>

     
    • K.S. Bhaskar

      K.S. Bhaskar - 2003-02-17

      Ben --

      Under normal operation, database damage is very unusual.

      How did you shut down the system?  Could you have shut down the system without shutting GT.M down cleanly and/or a mupip rundown?  Did you perhaps kill GT.M processes with a kill -9?  Was shared memory removed with an ipcrm?

      To protect yourself against database damage due to system crashes (i.e., to recover from crashes) use journaling.  However, journaling won't protect you from shooting processes with kill -9, or removing shared memory.

      Regards
      -- Bhaskar

       
      • Bob Isch

        Bob Isch - 2003-02-18

        It looked to me like virtually every block in the database had a transaction number that was too large (55092 transaction number errors encountered. and 54472 data blocks plus 500 some pointer blocks.)

        In fact, the
        "Largest transaction number found in database was FFFFFFFF" and
        "Current transaction number is 0" messages are interesting too.

        This leads one to ask a couple of questions:

        1) Has this database perhaps been running for a very long time with very high activity?
        and
        2) What does GT.M do when the transaction number wraps?  4 billion transactions is really not that many over a significant period (well, 1.2yrs at 100/sec?  Of course, you could bump the transaction count up manually for some reason.)

        It may be more likely that the header got clobbered so I would first take a look at dump -file (in dse).  However, if the transaction count did wrap one way to reset it might be a simple extract, recreate the DB, and reload.  The extract will probably still work with the file in its current state.  If not you could try changing the transaction count in the header back to FFFFFFFF.

        Hm? Actually, one could also look at the documentation for MUPIP ?TN_RESET (GT.M Admin and Operations Guide ? p.79) where all this (and a bit more) is explained too.  (Kind of wishing I had...)

        Good luck,
        -bi

         
    • K.S. Bhaskar

      K.S. Bhaskar - 2003-02-18

      Bob --

      Since GT.M doesn't flush the fileheader with every update to the database, the signature of the problem is something along the lines of an unclean shutdown.

      TN_RESET is indeed what should be done if the transaction number wraps, but GT.M starts putting messages in the operator log well before that happens (I think something like 300 million transactions short of 4 billion).

      -- Bhaskar

       
      • Bob Isch

        Bob Isch - 2003-02-18

        Wouldn't you expect a (relatively) few blocks to have the TN too large in that case and not most of them?  Also, wouldn't the largest TN found in the data/pointer blocks be a more reasonable number (not FFFFFFFF but something you might want to plop into the header to fix the problem?)  Also, why would the header currently have a last TN of 0? 

        I suppose the header could, of course, have been overwritten with garbage (thus the 0) and there could be some garbage in one of the data blocks, thus the FFFFFFF.  Also, it did look they were doing regular integs so they should have been forwarned as you point out... (or does that just go into the syslog? -- who reads the log files until something goes wrong? :-)

        I'm sure you've seen many more of these than me but it just seemed a little unique...

        -bi

         
    • Dr. Martin Lehr

      Dr. Martin Lehr - 2003-02-18

      When the Transaction number wraps (from FFFFFFFF to 0)
      the system crashes, because the Transaction number
      does not increase from 0 to 1 upon the next update.

      Perhaps this is the result of some
      endless loop situation

      the command:
      mupip integ -TN -FU -FILE mumps.dat

      should correct all these errors

      Regards

      -Martin

       
    • Ben Mehling

      Ben Mehling - 2003-02-18

      1) There's no way this database rolled over the TN count...  I'd guess easily less than a million transactions as this is just a development copy running on a dev server and only up for a few weeks.

      Once the database was corrupted, any attempt to write to a global would cause the gtm process to loop -- the only way out was a kill -9 of the gtm process.  It's hard to say which came first the corrupted DB or the kill -9.  It's possible that a gtm process got shutdown dirty at some point.  I seriously doubt this is "gtm's fault".  :)

      We are going to play w/ journaling going forward, thank you for the suggestion Bhaskar.

      2) The mupip command suggested above fixed the DB. (Thanks Martin!)  What's the danger of using this DB going forward?  Should we manually export and import into a new, clean DB?  I've used other database products that suggest running these types of 'fix' commands several times back-to-back -- is that recommended with GT.M?

      Thanks very much for the quick response everyone, very helpful and educational.

      - Ben

       
    • v7i

      v7i - 2006-07-10

      If we don't use "kill -9" but described above situation occurs. Are there any preventive actions for exclusion of the similar situations ?

       
      • K.S. Bhaskar

        K.S. Bhaskar - 2006-07-18

        Your question is tooo abstract to be able troubleshoot it.  Are your errors exactly the same as Ben's?  What version of GT.M are you running?  What unusual events occurred?  Was there a system shut down that may have left GT.M processes running?

        Random database damage is almost unheard of, but is of course always a possibility.

        -- Bhaskar

         

Log in to post a comment.