Re: [Exist-open] Fwd: Problems restoring backup

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On 21 July 2011 21:59, Adam Retter <ad...@ex...> wrote:
>> I agree wholeheartedly.  I am kicking myself for allowing them to be
>> overwritten.  For one thing there may have been a clue as to what started
>> this whole mess in the first place.  But, like I said, the first time I
>> restarted the database it went into an automatic re-indexing (or rather
>> indexing as it wasn't removing docs then storing them, but just storing
>> them).  This caused a great deal of log activity and by the time I thought
>> of preserving the logs they had rolled off the edge of the earth.
>
> So it sounds like there was a journal that it needed to process at startup.
>
>> LESSON: At the first sign of trouble copy the current logs.
>
> Actually, a better lesson, change the logging settings in log4j.xml,
> so that you keep all logs and they are never (or rarely) overwritten -
> assuming you have the disk space.
>
>> Initial playing around with 1.4.1dev leads me to believe the problem with
>> requiring a restart of the database in order to do a backup (either through
>> the web admin or by using system:trigger-system-task) has been fixed.  Or at
>> least I've been able to trigger more backups sequentially than I've ever
>> been able to do before.  Can anyone confirm that this is fixed in 1.4.1dev?
>>  If so this is a big boon and means I have two more instances to upgrade.
>
> Im not aware that there ever was an issue, did you report it previously?
>
>> Thanks,
>> Anthony
>>
>> On Thu, Jul 21, 2011 at 10:08 AM, Adam Retter <ad...@ex...> wrote:
>>>
>>> Just a quick note to say, please please please keep your log files.
>>> When something goes wrong and we need to understand what, these are
>>> always our first port of call.
>>>
>>> Cheers Adam.
>>>
>>> On 20 July 2011 22:31, Anthony Mohrenweiser <hi...@sp...>
>>> wrote:
>>> > Let me tell you my tale of woe.  Perhaps some of you might make
>>> > suggestions
>>> > about what I could have done differently so that I would have avoided
>>> > some
>>> > of these problems.
>>> > Monday, I came into work to be told that the database was down.  I have
>>> > unfortunately lost the logs, so I have to report this from memory, but
>>> > the
>>> > last entry in the log was something to the effect that
>>> > "exist could not allocate a broker".
>>> > I am (actually was) running 1.4.0-rev10440-20091111 on a Windows Vista
>>> > machine as a service.  I have 512MB allocated to the JVM, and have (had)
>>> > 96M
>>> > (since increased to 128M) allocated to buffers.
>>> > I bounced the service, and eXist immediatly went into a full re-index of
>>> > the
>>> > database before it started.  While my database is not "large" it doesn't
>>> > really qualify as "small" either.  The last full backup zip file weighs
>>> > in
>>> > at around 4.5GB and 212000 files.  This is split up into a large number
>>> > of
>>> > collections averaging around 200-300 files per collection.  The re-index
>>> > took about 2 hours at which time the database appeared to come up, but:
>>> > a) didn't seem able to find anything using a query, it could return
>>> > entire
>>> > documents through the REST interface, but any queries into the documents
>>> > returned nothing.
>>> > b) In the log file, would return large numbers (several hundered I would
>>> > say, and I'm doing this from memory because the log files are gone now)
>>> > of
>>> > "collection buffers have exceeded max size".  This is almost certainly
>>> > not
>>> > right, but the log files have been overwritten and I don't remember the
>>> > precise wording.  I would get several hundred of these entries in the
>>> > log
>>> > every time I tried anything that used the database.
>>> > Based on this, I concluded (don't know if I was right or not) that the
>>> > database was corrupt, and that the collections.dbx file was probably the
>>> > source of the corruption.  Because I can't delete the collections.dbx
>>> > file
>>> > and re-index, I decided that the only recourse was a full restore from
>>> > backup.
>>> > That's when the fun began.  This was the first time I had attempted a
>>> > full
>>> > restore (I've done some partial restores before).  From the Admin Client
>>> > I
>>> > attempted to restore my last full backup.  There was some mysterious
>>> > behavior here, I would press the restore button and nothing would happen
>>> > for
>>> > literally minutes (I timed it once for 4 minutes 30 seconds), then the
>>> > dialog box would appear and it would be unresponsive for approximately
>>> > the
>>> > same period of time.  Then it would allow me to select a file.  Selected
>>> > the
>>> > file, pressed the button, and got huge error message saying that it
>>> > could
>>> > not load the backup.  I attempted to load the backup in WinZip and was
>>> > told
>>> > that the file was corrupt.  In desperation, I started a new full backup.
>>> >  This appeared to be working, so I let it run.  It took several hours.
>>> >  That
>>> > night from home, I attempted to restore this backup, and got similar
>>> > results.
>>> > That was Monday.
>>> > Tuesday morning, I tried other earlier backups with the same result.  I
>>> > decided that with the database completely down anyway, now would be a
>>> > good
>>> > time to upgrade the database to the 1.4.1 pre-release, maybe that would
>>> > be
>>> > able to handle the zip files.  So I installed 1.4.1 but had same results
>>> > with the restore.  7zip was able to open my original last backup, so I
>>> > started extracting chunks (collections) from it and restoring the
>>> > collections from the filesystem (__contents__.xml files rather than
>>> > zips).
>>> >  However I soon found that about 50% of the collections could not be
>>> > extracted by 7zip.  I found a zip repair utility on the web, and the
>>> > trial
>>> > said that it was repairing stuff, so I purchased it and let it run on
>>> > the
>>> > backup (creating a new zip file).  5 hours later, I had a new "repaired"
>>> > zip
>>> > file.  I tried to restore that (with multi-minute delays) and it failed.
>>> >  The repair utility had two options - one to create a repaired zip file
>>> > and
>>> > another to extract all files, so I re-ran it to extract all files from
>>> > the
>>> > zip.  This appeared to be working and I saw that it was recovering files
>>> > from collections that I couldn't extract with 7zip previously and that
>>> > they
>>> > appeared to be non-corrupt, so I let this run.  Late that night, when it
>>> > completed, I started a restore from home from these files.  That
>>> > appeared to
>>> > work.
>>> > That was Tuesday.
>>> > Wednesday, I got into work and timed the progress of the restore.
>>> >  Roughly
>>> > speaking it was taking about a second per file.  At 212000 files I was
>>> > looking at a significant amount of time for the restore to finish.  We
>>> > needed the database back up as soon as possible.  So, in what was
>>> > probably a
>>> > fit of utter stupidity, I started triggering system:restore calls on
>>> > selected collections (ones that we really needed) while the Admin Client
>>> > continued the full restore.  Actually, this was working rather well,
>>> > until
>>> > it stopped.  Literally.  Database appeared completely non-responsive.
>>> >  Activity on server appeared to be flat-line.  No disk or processor
>>> > activity.  Admin Client was frozen.  Last entry in log was:
>>> > (NativeBroker.java [removeXMLResource]:2246) - Removing document
>>> > users.xml
>>> > (1)...
>>> > Talk about making your blood run cold.  I don't remember _ever_ seeing
>>> > something like this before.  I waited for another 10 minutes with
>>> > absolutely
>>> > no activity on server and bounced the database.  On restart, it tried to
>>> > do
>>> > a recovery and failed.  I deleted the .lck and .log files and bounced it
>>> > again.  This time it came up, and initially looked ok, except that it
>>> > wouldn't run queries and I couldn't save to the database.  I checked the
>>> > users.xml file and it was blank.  Actually, I think that was when I
>>> > discovered that I couldn't write to the database.  Finally, I was able
>>> > to
>>> > get the Admin Client to save a users.xml file.
>>> > I now have about 80% of the database restored, and I am manually doing
>>> > it a
>>> > collection at a time.  At this point I am really skittish.
>>> > I am concerned that none of the full backups (either those made from the
>>> > Admin Client or from the web interface) could be loaded.  I suspect (but
>>> > don't know) that they are not really corrupt, but just too damn big.
>>> >  The
>>> > zip repair utility was able to recover all the data in the file (as far
>>> > as I
>>> > know, and I have no reason to suspect otherwise) and it appears to be
>>> > intact.
>>> > I am also concerned about the "Removing document users.xml..".  As I
>>> > said, I
>>> > don't ever recall seeing that before, but now (in 1.4.1dev) I am seeing
>>> > it
>>> > quite often.  Apparently it is removing and putting it back, but at
>>> > least
>>> > once it didn't and brought everything to a screeching halt.
>>> > If someone can explain the behavior of the multi-minute delays on the
>>> > Restore button of the Admin client I'd like to hear it.
>>> > If anyone has the patience to wade through this tome and has ideas on
>>> > what I
>>> > should have done, I'd really like to hear them.  I'm in a receptive mood
>>> > right now.
>>> >
>>> > Thanks,
>>> >
>>> > ------------------------------------------------------------------------------
>>> > 10 Tips for Better Web Security
>>> > Learn 10 ways to better secure your business today. Topics covered
>>> > include:
>>> > Web security, SSL, hacker attacks & Denial of Service (DoS), private
>>> > keys,
>>> > security Microsoft Exchange, secure Instant Messaging, and much more.
>>> > http://www.accelacomm.com/jaw/sfnl/114/51426210/
>>> > _______________________________________________
>>> > Exist-open mailing list
>>> > Exi...@li...
>>> > https://lists.sourceforge.net/lists/listinfo/exist-open
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Adam Retter
>>>
>>> eXist Developer
>>> { United Kingdom }
>>> ad...@ex...
>>> irc://irc.freenode.net/existdb
>>
>>
>
>
>
> --
> Adam Retter
>
> eXist Developer
> { United Kingdom }
> ad...@ex...
> irc://irc.freenode.net/existdb
>


-- 
Adam Retter

eXist Developer
{ United Kingdom }
ad...@ex...
irc://irc.freenode.net/existdb

Re: [Exist-open] Fwd: Problems restoring backup

eXist-db is a feature rich Open Source native XML database

Re: [Exist-open] Fwd: Problems restoring backup