Menu

Need help restoring database integrity

Help
2007-04-05
2012-12-29
  • Kevin Toppenberg

    Hello all.  I haven't posted much here before, but I have used GT.M for several years now, using the VistA EMR application.

    I have recently discovered some database integrity errors that have accumulated over time, that I need to clean up.  I am hoping someone here can help.

    Here is the result of my database integrity scan:

    MUPIP> integ
    File or Region: /var/local/OpenVistA_UserData/g/mumps.dat
    %GTM-W-MUTNWARN, Database file xxxx/mumps.dat is appr
    oaching 4G transaction number limit.  Renew database with MUPIP INTEG TN_RESET
    %GTM-W-DBLOCMBINC,
       11400:0     FF  Local bit map incorrect
    %GTM-W-DBMRKBUSY,
       11573:0     FF  Block incorrectly marked busy
    %GTM-W-DBMBPFLDLBM,
       11400:0     FF  Master bit map shows this map full, agreeing with disk local
    map
    %GTM-W-DBLOCMBINC,
       1B000:0     FF  Local bit map incorrect
    %GTM-W-DBMRKBUSY,
       1B1FC:0     FF  Block incorrectly marked busy
    %GTM-W-DBMBPFLDLBM,
       1B000:0     FF  Master bit map shows this map full, agreeing with disk local
    map
    %GTM-W-DBLOCMBINC,
       1B200:0     FF  Local bit map incorrect
    %GTM-W-DBMRKBUSY,
       1B2A1:0     FF  Block incorrectly marked busy
    %GTM-W-DBMBPFLDLBM,
       1B200:0     FF  Master bit map shows this map full, agreeing with disk local
    map
    %GTM-W-DBLOCMBINC,
       1CA00:0     FF  Local bit map incorrect
    %GTM-W-DBMRKBUSY,
       1CAF1:0     FF  Block incorrectly marked busy
    %GTM-W-DBMBPFLDLBM,
       1CA00:0     FF  Master bit map shows this map full, agreeing with disk local
    map
    %GTM-W-DBLOCMBINC,
       1D200:0     FF  Local bit map incorrect
    %GTM-W-DBMRKBUSY,
       1D3BB:0     FF  Block incorrectly marked busy
    %GTM-W-DBMBPFLDLBM,
       1D200:0     FF  Master bit map shows this map full, agreeing with disk local
    map
    %GTM-W-DBLOCMBINC,
       28800:0     FF  Local bit map incorrect
    %GTM-W-DBMRKBUSY,
       289AC:0     FF  Block incorrectly marked busy
    %GTM-W-DBMBPFLDLBM,
       28800:0     FF  Master bit map shows this map full, agreeing with disk local
    map
    %GTM-W-DBLOCMBINC,
       80E00:0     FF  Local bit map incorrect
    %GTM-W-DBMRKBUSY,
       80F75:0     FF  Block incorrectly marked busy
    %GTM-W-DBMBPFLDLBM,
       80E00:0     FF  Master bit map shows this map full, agreeing with disk localmap
    %GTM-W-DBLOCMBINC,
       86400:0     FF  Local bit map incorrect
    %GTM-W-DBMRKBUSY,
       865E7:0     FF  Block incorrectly marked busy
    %GTM-W-DBMRKBUSY,
       865E9:0     FF  Block incorrectly marked busy
    %GTM-W-DBMRKBUSY,
       865EA:0     FF  Block incorrectly marked busy
    Maximum number of incorrectly busy errors to display:  10, has been exceeded
    25 incorrectly busy errors encountered

    Total error count from integ:           43.

    Type           Blocks         Records          % Used      Adjacent

    Directory           4             356          26.422            NA
    Index           56229          965061           7.800         48187
    Data           909185       120771731          72.597        485706
    Free            23582              NA              NA            NA
    Total          989000       121737148              NA        533893
    %GTM-E-INTEGERRS, Database integrity errors
    [kdt0p@poweredge ~]$ sh D
    Desktop/   Downloads/ DSE        DSE~
    [kdt0p@poweredge ~]$ sh D
    Desktop/   Downloads/ DSE        DSE~
    [kdt0p@poweredge ~]$ sh DSE

    File    /var/local/OpenVistA_UserData/g/mumps.dat
    Region  DEFAULT

    DSE> quit

    I have read about restoring database integrity in the GT.M Admin and Operation guide, found here:
    http://www.fidelityinfoservices.com/user_documentation/AdminOpsUNIX/UNIX_A_O/index.html

    I have read about the data structure, and I think I understand that there is a bitmap that specifies if blocks are busy or available.  I think I need to use the MAPS command to change this.  But I can't even seem to be able to navigate to the erroroneous blocks.  Here is a screen log:

    ...
    DSE> find -b=11400 <-- 11400:0 is specified in the first error message
    Error: invalid block number.
    DSE> find -b=1b000
    Error: invalid block number.
    ...

    DSE> maps -master
    DSE> maps

    Block 1 is marked busy in its local bit map.

    Can anyone help?

    Thanks!
    Kevin

     
    • Kevin Toppenberg

      Well, I am making some progress.  For the error that says:

      %GTM-W-DBMRKBUSY,
         11573:0     FF  Block incorrectly marked busy
      ...
      %GTM-W-DBMRKBUSY,
         1B1FC:0     FF  Block incorrectly marked busy

      The following seems to work.

      DSE> maps -block=11573

      Block 11573 is marked busy in its local bit map.

      DSE> maps -block=11573 -free
      DSE> maps -block=11573

      Block 11573 is marked free in its local bit map.

      DSE> maps -block=1b1fc

      Block 1B1FC is marked busy in its local bit map.

      DSE> maps -block=1b1fc -free
      DSE>                    

      I'll post my progress here, but I would still appreciate any available input.

      Thanks
      Kevin

       
      • Kevin Toppenberg

        Success!  I have resolved all the integrety problems.  Thanks for your help everyone.  I hope I can return the favor sometime.

        Kevin

         
        • Steven Estes

          Steven Estes - 2007-04-06

          Easily done! Run journaling and buy GT.M Support! :-)

          Glad you are back up and running..

          Steve

           
          • Kevin Toppenberg

            Steve,

            These are very good suggestions.

            I have been afraid to turn on journaling because I was afraid the log files would fill up my harddrive.  If my mumps.dat file is 4 gb, can I expect that my journal files would be that size or bigger?

            Regarding mumps support, I did purchase it last year.  I was in a larger group then.  When I went into business by my self, I found that the support contracts were really not scalable down to a single person setup such as I am in.  Thus I might be willing to pay a few hundred dollars for an annual support contract, but not the $1000+ minimum that I was told was required.

            Thanks again
            Kevin

             
            • K.S. Bhaskar

              K.S. Bhaskar - 2007-04-12

              Kevin --

              You *must* run journaling in a production environment.

              Do you encourage your patients to come in for an annual / periodic checkup?  Do you get your car inspected annually and/or serviced regularly?  Do you check your server's operator log for messages of soft disk errors that may presage a head crash?  Well, in the same manner, you would periodically check your journal files, make sure you have room on disk, archive/delete old files. etc.  Whether it is 400KB, 400MB or 400GB, the process is routine.  So, apropos journal files, you have nothing to fear but fear itself.

              You can't directly compare the journal file size to the database file size.  Database files store the current values of your data - the more the data, the more the disk used.  Journal files store changes in your data.  So, if you had only a single global variable, but it was churned at the rate of thousands of sets a minute, your database file would be miniscule, but the journal files would be huge.  Conversely, if you had a practice with a million patients, but they all stay healthy and rarely come to see you, your database would be huge and your journal files would be miniscule.  But in either case, you would have to monitor disk usage.

              I feel that Source Forge forums are not appropriate for discussions about commercial support contracts - anyone who wants to discuss one, please write to me at ks dot bhaskar at fnis dot com.

              Regards
              -- Bhaskar

               
    • Dr. Martin Lehr

      Dr. Martin Lehr - 2007-04-05

      Hello Kevin,

      all local bitmap errors can be corrected by rebuilding the bitmaps:

      DSE>m -r

      This is fast and simple.

      warning: your transaction number is approaching the transaction number limit.
      you should run MUPIP INTEG TN_RESET

      regards
      Martin

       
      • Kevin Toppenberg

        Martin,

        Thanks so much for your response.

        ...
        all local bitmap errors can be corrected by rebuilding the bitmaps:
        DSE>m -r 
        This is fast and simple.

        The manual says that this can result in data loss, so I am hesitant to use it, as per another post in this thread.

        Thanks again so much,

        Kevin

         
    • Steven Estes

      Steven Estes - 2007-04-05

      Kevin,

      I can give you some general guidelines here but how to proceed is something you will need to decide based on your database.

      First off, before I go any further, the message at the top of the listing warning you that your transaction numbers are about to wrap needs to be addressed "very soon". I don't know what "very soon" is but depending on how long you have been seeing this message, the current transaction number (obtained from DSE DUMP -FILEHEADER), and how many transactions you do per day will determine how soon "very soon" is. Databases have been destroyed by wrapped transaction numbers so it is something you *REALLY* want to avoid. In seeing this message, I am assuming you are running a V4 version. A potential alternative to the TN_RESET (if you have enough transaction numbers left) is to upgrade to GT.M V5 which has a 64 bit transaction number which you probably won't live long enough to wrap on any hardware existing today. I mention transaction numbers left because the actual upgrade procedure (which requires a database upgrade) requires one transaction number for each block in the database and the 64 bit TN doesn't take effect till that is complete (see the V5 database upgrade procedure document in the user documentation on www.fis-gtm.com)

      On the integrity issues: There are two bitmaps in question. One is the master bit map which is stored as part of the file header. It has one bit for each of the second kind of bit map - the local bit map. When a master bitmap's bit is set to 1, that means there is some space available in the associated local bit map. Each local bit map contains usage information for 512 blocks (0x200). It tells whether each block is in use, never used, or reusable.

      When you have errors like what you report, use of the maps -master or -restore_all comands can be dangerous because you don't really know which is correct. The reason is behind the origin of these errors. When a database update is done, the blocks are changed in shared memory but may not make it out to the disk any time soon and the order in which they make it out is not defined. So if you had a system crash, or otherwise the system was shutdown before all GT.M processes had become quiescent, some updated blocks and/or updates to the fileheader (including the master bit map) may not have become permanent.

      In your situation, both the master map and the local bitmap agree that the blocks in question should be allocated but mupip integ was not able to find the block being used anywhere in the global variable tree. Most likely cause of this is data was deleted and the global variable tree updates made it out to permanent disk but the bitmap changes didn't. But there are possible scenarios where data was added and the bitmap changes made it out but the global variable tree didn't. Knowing your application and your data is key to determining which is correct. You can also dump the blocks in question and see if there is any data in them and if it is something you might like to have back. Again though, since there are no other integrity errors, the best bet (without looking at it) is that it is deleted data that didn't get bitmap updates committed in which case your approach of marking the bitmaps free one at a time is the correct approach.

      I'd have to check but one immediate thought I have is that the find issue you were getting was perhaps because you were trying to "find" a bitmap block. The find command is used to locate blocks in the various trees and a bitmap block is never in any tree. For example, in the first integ error "cluster", you have the local bit map block 11400 and the block 11573 (whose status is in the 11400 bitmap block). If you were to find 11573, you should get the results you were looking for. References to that block's local bitmap block would go to 11400 (which covers blocks 11400 through 115ff).

      Hope this helps..

      Steve

       
      • Kevin Toppenberg

        Steve,

        Thanks you for your very helpful post.  See comments below.

        ...

        >>First off, before I go any further, the message at the top of the listing warning you that your transaction numbers are about to wrap needs to be addressed "very soon".

        I agree, but I can't reset the transaction numbers until I restore complete integrity.  Otherwise it complains about the integrity problem and quits.

        >>A potential alternative to the TN_RESET (if you have enough transaction numbers left) is to upgrade to GT.M V5 which has a 64 bit transaction number which you probably won't live long enough to wrap on any hardware existing today....

        Yes, I probably should upgrade to version 5.  I had even gotten v5 installed on my system, but then got chicken when it came time to convert the database.  I guess I was thinking "If it isn't broken, don't fix it."  I am also concerned about it breaking some of my code.  I routinely use variable names that are longer than 8 characters (but make sure the first 8 characters are unique).  I not sure if  having the number of significant characters longer will introduce some bugs that I hadn't anticipated.  Still, I need to bite the bullet and do this some day...

        >>On the integrity issues: There are two bitmaps in question. One is the master bit map which is stored as part of the file header. It has one bit for each of the second kind of bit map - the local bit map. When a master bitmap's bit is set to 1, that means there is some space available in the associated local bit map. Each local bit map contains usage information for 512 blocks (0x200). It tells whether each block is in use, never used, or reusable.

        >>When you have errors like what you report, use of the maps -master or -restore_all comands can be dangerous because you don't really know which is correct.

        Good to hear this.  I was afraid to use that command, but thought that I was being foolish to do it all by hand.  I'll continue the manual approach.

        >>The reason is behind the origin of these errors. When a database update is done, the blocks are changed in shared memory but may not make it out to the disk any time soon and the order in which they make it out is not defined. So if you had a system crash, or otherwise the system was shutdown before all GT.M processes had become quiescent, some updated blocks and/or updates to the fileheader (including the master bit map) may not have become permanent.

        Do you know if shutting down a GT.M process via MUPIP will do this?  I have had very few crashes of the entire computer, and I usually try to get all processes shut down via MUPIP before rebooting.  So I'm not sure where these errors came from...

        >>In your situation, both the master map and the local bitmap agree that the blocks in question should be allocated but mupip integ was not able to find the block being used anywhere in the global variable tree.

        OK.  Maybe this explains what I had noticed.  Namely that the errors seemed to come in groups of three, with the middle one being the only one I could do something about:

        %GTM-W-DBLOCMBINC,   11400:0     FF  Local bit map incorrect
        %GTM-W-DBMRKBUSY,    11573:0     FF  Block incorrectly marked busy
        %GTM-W-DBMBPFLDLBM,  11400:0     FF  Master bit map shows this map full, agreeing with disk local map

        So hopefully fixing the middle error will solve the other two.  I am re-running another INTEG right now to test this theory.

        ...
        >> Again though, since there are no other integrity errors, the best bet (without looking at it) is that it is deleted data that didn't get bitmap updates committed in which case your approach of marking the bitmaps free one at a time is the correct approach.

        Actually, there was corruption of the data in those blocks, and I had manually removed them and fixed all relevant pointers before getting this point.  So it looks like I just need to finish the job of completely removing the block (including updating the bitmap)
         
        >>I'd have to check but one immediate thought I have is that the find issue you were getting was perhaps because you were trying to "find" a bitmap block. The find command is used to locate blocks in the various trees and a bitmap block is never in any tree.

        Ahh, that makes sense.  I found that I can DUMP a block, even though I can't FIND it.  Now I see why.

        >>For example, in the first integ error "cluster", you have the local bit map block 11400 and the block 11573 (whose status is in the 11400 bitmap block). If you were to find 11573, you should get the results you were looking for. References to that block's local bitmap block would go to 11400 (which covers blocks 11400 through 115ff).

        How do you know that 11400 holds the bitmap for 11573?  Is it because the errors are clustered?

        >>Hope this helps..   Steve

        Yes, it has been very helpful.  I appreciate your help. 
        I'll post back further results.

        Addendum
        I have re-run INTEG, and fixing the "Block incorrectly marked busy" did seem to take care of the other grouped "Local bit map incorrect" and "Master bit map shows this map full, agreeing with disk local map".

        Thanks so much.
        Kevin

         
        • Steven Estes

          Steven Estes - 2007-04-06

          Just a clarification regarding maps -r : As Martin says, this will get rid of all the bitmap errors. What it does is throw away all the bitmaps and recreate them from the blocks it finds used in the global trees. The issue here is that if there is anything wrong in the global tree, the bitmaps won't be correct and missing data can be very difficult to find. If you are sure about the integrity of the global trees (i.e. only bitmap errors remain and you are satisfied that the bitmaps are what need correcting), maps -r can be very handy. You did make a backup of the database before you started repairing it didn't you? :-) You can always do the maps -r and check things out to make sure nothing is missing and if it is, then go back to the backup before proceeding with repair again.

          You asked about shutting down the database. GT.M traps signals and errors, and does a clean rundown of the database with one exception - if kill -9 (which GTM cannot intercept) is used or if the system is suddenly powered off or otherwise crashes without giving GTM a chance to cleanup, database damage is possible. After any system crash it is a good idea to run an integ on your databases. If you were running before image journaling on the databases, you could instead run MUPIP RECOVER and it would just do the right thing.

          Note that there is gtmstop script in the distribution which should stop all active gtm processes (in a safe way). Perhaps work on incorporating this into system shutdown scripts.

          On the question about block 11400, yes, it is because the errors are clustered and because I know that the bitmap blocks cover 0x200 blocks. Every 0x200th block is a bitmap block. So to find the bitmap block for any given block, AND the block number (in hex of course) with 0xFFFFFE00 and the result is the relevant bitmap block number.

          Steve

           
    • Kevin Toppenberg

      OK, using the maps -blox=xxx -free, I have been able to correct all the "incorrectly busy" errors, and the others have cleared as well.

      So now I want to do the INTEG TN_RESET.  But it is not letting me.
      Here is the screen log.  The script mupipPROMPT just sets up some environmental variables and then launches mupip.

      # sh mupipPROMPT
      MUPIP> INTEG
      File or Region: /var/local/OpenVistA_UserData/g/mumps.dat
      %GTM-W-MUTNWARN, Database file /var/local/OpenVistA_UserData/g/mumps.dat is approaching 4G transaction number limit.  Renew database with MUPIP INTEG TN_RESET

      Total error count from integ:           1.

      Type           Blocks         Records          % Used      Adjacent

      Directory           4             356          26.422            NA
      Index           56234          965122           7.800         48187
      Data           909241       120774094          72.595        485721
      Free            23521              NA              NA            NA
      Total          989000       121739572              NA        533908
      %GTM-E-INTEGERRS, Database integrity errors
      [root@poweredge kdt0p]# sh mupipPROMPT
      MUPIP> INTEG TN_RESET
      %GTM-E-DBOPNERR, Error opening database file TN_RESET
      %SYSTEM-E-ENO2, No such file or directory
      %GTM-I-MUSTANDALONE, Could not get exclusive access to TN_RESET
      %GTM-E-INTEGERRS, Database integrity errors
      [root@poweredge kdt0p]#               

      First, is it normal that I am getting dropped back to the linux prompt (i.e. not the mupip prompt) after I ran INTEG?

      Second, why does it say it can't get exclusive access?  I don't have any other GT.M processes running.  I checked with ps.  Also, it won't run INTEG to begin with unless it has exclusive access to the database file.

      Thanks
      Kevin

       
      • Steven Estes

        Steven Estes - 2007-04-06

        Most of the error messages you get out of GTM are common amongst several platforms -- including the VMS platform which has a different syntax for options and such. Consequently, the messages are somewhat generic in nature. The MUTWARN message was not giving you the exact syntax that you need, it was telling you the operation you needed to do. I don't have the syntax exactly in off the top of my head but I know the TN_RESET option is specified as -TN_RESET. Check the manual for exact syntax..

        Steve

         

Log in to post a comment.