From: Adam R. <ad...@ex...> - 2011-07-21 20:00:13
|
On 21 July 2011 21:59, Adam Retter <ad...@ex...> wrote: >> I agree wholeheartedly. I am kicking myself for allowing them to be >> overwritten. For one thing there may have been a clue as to what started >> this whole mess in the first place. But, like I said, the first time I >> restarted the database it went into an automatic re-indexing (or rather >> indexing as it wasn't removing docs then storing them, but just storing >> them). This caused a great deal of log activity and by the time I thought >> of preserving the logs they had rolled off the edge of the earth. > > So it sounds like there was a journal that it needed to process at startup. > >> LESSON: At the first sign of trouble copy the current logs. > > Actually, a better lesson, change the logging settings in log4j.xml, > so that you keep all logs and they are never (or rarely) overwritten - > assuming you have the disk space. > >> Initial playing around with 1.4.1dev leads me to believe the problem with >> requiring a restart of the database in order to do a backup (either through >> the web admin or by using system:trigger-system-task) has been fixed. Or at >> least I've been able to trigger more backups sequentially than I've ever >> been able to do before. Can anyone confirm that this is fixed in 1.4.1dev? >> If so this is a big boon and means I have two more instances to upgrade. > > Im not aware that there ever was an issue, did you report it previously? > >> Thanks, >> Anthony >> >> On Thu, Jul 21, 2011 at 10:08 AM, Adam Retter <ad...@ex...> wrote: >>> >>> Just a quick note to say, please please please keep your log files. >>> When something goes wrong and we need to understand what, these are >>> always our first port of call. >>> >>> Cheers Adam. >>> >>> On 20 July 2011 22:31, Anthony Mohrenweiser <hi...@sp...> >>> wrote: >>> > Let me tell you my tale of woe. Perhaps some of you might make >>> > suggestions >>> > about what I could have done differently so that I would have avoided >>> > some >>> > of these problems. >>> > Monday, I came into work to be told that the database was down. I have >>> > unfortunately lost the logs, so I have to report this from memory, but >>> > the >>> > last entry in the log was something to the effect that >>> > "exist could not allocate a broker". >>> > I am (actually was) running 1.4.0-rev10440-20091111 on a Windows Vista >>> > machine as a service. I have 512MB allocated to the JVM, and have (had) >>> > 96M >>> > (since increased to 128M) allocated to buffers. >>> > I bounced the service, and eXist immediatly went into a full re-index of >>> > the >>> > database before it started. While my database is not "large" it doesn't >>> > really qualify as "small" either. The last full backup zip file weighs >>> > in >>> > at around 4.5GB and 212000 files. This is split up into a large number >>> > of >>> > collections averaging around 200-300 files per collection. The re-index >>> > took about 2 hours at which time the database appeared to come up, but: >>> > a) didn't seem able to find anything using a query, it could return >>> > entire >>> > documents through the REST interface, but any queries into the documents >>> > returned nothing. >>> > b) In the log file, would return large numbers (several hundered I would >>> > say, and I'm doing this from memory because the log files are gone now) >>> > of >>> > "collection buffers have exceeded max size". This is almost certainly >>> > not >>> > right, but the log files have been overwritten and I don't remember the >>> > precise wording. I would get several hundred of these entries in the >>> > log >>> > every time I tried anything that used the database. >>> > Based on this, I concluded (don't know if I was right or not) that the >>> > database was corrupt, and that the collections.dbx file was probably the >>> > source of the corruption. Because I can't delete the collections.dbx >>> > file >>> > and re-index, I decided that the only recourse was a full restore from >>> > backup. >>> > That's when the fun began. This was the first time I had attempted a >>> > full >>> > restore (I've done some partial restores before). From the Admin Client >>> > I >>> > attempted to restore my last full backup. There was some mysterious >>> > behavior here, I would press the restore button and nothing would happen >>> > for >>> > literally minutes (I timed it once for 4 minutes 30 seconds), then the >>> > dialog box would appear and it would be unresponsive for approximately >>> > the >>> > same period of time. Then it would allow me to select a file. Selected >>> > the >>> > file, pressed the button, and got huge error message saying that it >>> > could >>> > not load the backup. I attempted to load the backup in WinZip and was >>> > told >>> > that the file was corrupt. In desperation, I started a new full backup. >>> > This appeared to be working, so I let it run. It took several hours. >>> > That >>> > night from home, I attempted to restore this backup, and got similar >>> > results. >>> > That was Monday. >>> > Tuesday morning, I tried other earlier backups with the same result. I >>> > decided that with the database completely down anyway, now would be a >>> > good >>> > time to upgrade the database to the 1.4.1 pre-release, maybe that would >>> > be >>> > able to handle the zip files. So I installed 1.4.1 but had same results >>> > with the restore. 7zip was able to open my original last backup, so I >>> > started extracting chunks (collections) from it and restoring the >>> > collections from the filesystem (__contents__.xml files rather than >>> > zips). >>> > However I soon found that about 50% of the collections could not be >>> > extracted by 7zip. I found a zip repair utility on the web, and the >>> > trial >>> > said that it was repairing stuff, so I purchased it and let it run on >>> > the >>> > backup (creating a new zip file). 5 hours later, I had a new "repaired" >>> > zip >>> > file. I tried to restore that (with multi-minute delays) and it failed. >>> > The repair utility had two options - one to create a repaired zip file >>> > and >>> > another to extract all files, so I re-ran it to extract all files from >>> > the >>> > zip. This appeared to be working and I saw that it was recovering files >>> > from collections that I couldn't extract with 7zip previously and that >>> > they >>> > appeared to be non-corrupt, so I let this run. Late that night, when it >>> > completed, I started a restore from home from these files. That >>> > appeared to >>> > work. >>> > That was Tuesday. >>> > Wednesday, I got into work and timed the progress of the restore. >>> > Roughly >>> > speaking it was taking about a second per file. At 212000 files I was >>> > looking at a significant amount of time for the restore to finish. We >>> > needed the database back up as soon as possible. So, in what was >>> > probably a >>> > fit of utter stupidity, I started triggering system:restore calls on >>> > selected collections (ones that we really needed) while the Admin Client >>> > continued the full restore. Actually, this was working rather well, >>> > until >>> > it stopped. Literally. Database appeared completely non-responsive. >>> > Activity on server appeared to be flat-line. No disk or processor >>> > activity. Admin Client was frozen. Last entry in log was: >>> > (NativeBroker.java [removeXMLResource]:2246) - Removing document >>> > users.xml >>> > (1)... >>> > Talk about making your blood run cold. I don't remember _ever_ seeing >>> > something like this before. I waited for another 10 minutes with >>> > absolutely >>> > no activity on server and bounced the database. On restart, it tried to >>> > do >>> > a recovery and failed. I deleted the .lck and .log files and bounced it >>> > again. This time it came up, and initially looked ok, except that it >>> > wouldn't run queries and I couldn't save to the database. I checked the >>> > users.xml file and it was blank. Actually, I think that was when I >>> > discovered that I couldn't write to the database. Finally, I was able >>> > to >>> > get the Admin Client to save a users.xml file. >>> > I now have about 80% of the database restored, and I am manually doing >>> > it a >>> > collection at a time. At this point I am really skittish. >>> > I am concerned that none of the full backups (either those made from the >>> > Admin Client or from the web interface) could be loaded. I suspect (but >>> > don't know) that they are not really corrupt, but just too damn big. >>> > The >>> > zip repair utility was able to recover all the data in the file (as far >>> > as I >>> > know, and I have no reason to suspect otherwise) and it appears to be >>> > intact. >>> > I am also concerned about the "Removing document users.xml..". As I >>> > said, I >>> > don't ever recall seeing that before, but now (in 1.4.1dev) I am seeing >>> > it >>> > quite often. Apparently it is removing and putting it back, but at >>> > least >>> > once it didn't and brought everything to a screeching halt. >>> > If someone can explain the behavior of the multi-minute delays on the >>> > Restore button of the Admin client I'd like to hear it. >>> > If anyone has the patience to wade through this tome and has ideas on >>> > what I >>> > should have done, I'd really like to hear them. I'm in a receptive mood >>> > right now. >>> > >>> > Thanks, >>> > >>> > ------------------------------------------------------------------------------ >>> > 10 Tips for Better Web Security >>> > Learn 10 ways to better secure your business today. Topics covered >>> > include: >>> > Web security, SSL, hacker attacks & Denial of Service (DoS), private >>> > keys, >>> > security Microsoft Exchange, secure Instant Messaging, and much more. >>> > http://www.accelacomm.com/jaw/sfnl/114/51426210/ >>> > _______________________________________________ >>> > Exist-open mailing list >>> > Exi...@li... >>> > https://lists.sourceforge.net/lists/listinfo/exist-open >>> > >>> > >>> >>> >>> >>> -- >>> Adam Retter >>> >>> eXist Developer >>> { United Kingdom } >>> ad...@ex... >>> irc://irc.freenode.net/existdb >> >> > > > > -- > Adam Retter > > eXist Developer > { United Kingdom } > ad...@ex... > irc://irc.freenode.net/existdb > -- Adam Retter eXist Developer { United Kingdom } ad...@ex... irc://irc.freenode.net/existdb |