From: tech.vronk <te...@vr...> - 2012-05-16 20:07:35
|
Hi, I have experienced quite problematic behaviour of the exist-db lately. I have the 2.0-tech-preview version running. After an (obviously) inefficient query that ran for ever, on next start the 'exist' instance" could not be found. It was not one specific queries, it's different queries on different datasets that cause it. Some of them use XUpdate (but not heavily, just to store the result of the processing). Perhaps the only thing they have in common is, that they use eval (heavily). And the database grew rather big, with around 200.000 documents the dom.dbx was 2.6 GB at last crash. The first two times I lost patience and tried first to kill the query in admin and then tried to stop the server. The last time I left it running, and it eventually returned HTTP 500 javax.servlet.ServletException: An error occurred: null at org.exist.http.servlets.EXistServlet.doGet(EXistServlet.java:279) .... with error: java.lang.IllegalStateException at org.eclipse.jetty.server.session.AbstractSession.checkValid(AbstractSession.java:91) at org.eclipse.jetty.server.session.HashedSession.checkValid(HashedSession.java:55) at org.eclipse.jetty.server.session.AbstractSession.setAttribute(AbstractSession.java:408) at org.exist.http.servlets.HttpSessionWrapper.setAttribute(HttpSessionWrapper.java:120) at org.exist.xquery.functions.xmldb.XMLDBAuthenticate.cacheUserInHttpSession(XMLDBAuthenticate.java:208) at org.exist.xquery.functions.xmldb.XMLDBAuthenticate.eval(XMLDBAuthenticate.java:183) at org.exist.xquery.BasicFunction.eval(BasicFunction.java:68) at org.exist.xquery.InternalFunctionCall.eval(InternalFunctionCall.java:55) I tried the basic RECOVERY steps (stop, remove *.lck, *.log, *.dbx (safe the three). Sometimes this did not help at all and the instance stayed unrecoverable. In some cases I was able to start the database again, but the security information seemed to have gone lost: I could access the admin-area (only) with empty password. (Probably therefore) I also wasn't able to access the database with the java-client: either getting: Wrong password error, when trying to use the "old" password. Or (when trying with empty password): XMLDBException occurred while retrieving collection: Failed to invoke method getPermissions in class org.exist.xmlrpc.RpcConnection: null Therefore I also wasn't able to run reindex, or manipulate the users and similar... (Or is there any other way to In this state I also got all kinds of errors in the admin area, for example when trying to access the INdexes or Collections panel And further work with the database was impossible Until now, the only solution for me, was to empty the whole data-folder and run a restore from last backup. But this costs hours every time, and makes the actual work quite impossible. Would be greatful for any ideas, advice. best, Matej |
From: Wolfgang M. <wol...@ex...> - 2012-05-17 08:52:46
|
Hi, > I have experienced quite problematic behaviour of the exist-db lately. > > I have the 2.0-tech-preview version running. > > After an (obviously) inefficient query that ran for ever, > on next start > the 'exist' instance" could not be found. I had a similar issue the other day: probably a deadlocked thread which prevented eXist from creating a checkpoint for several days. Upon restart I had a huge transaction log, which fortunately was ok and the data could be restored by recovery. Normally eXist should create a checkpoint when the last write transaction completes, but it seems we have a problem here in 2.x., resulting in a risky recovery run whenever you have to kill the db. I'm looking into this today. Wolfgang |
From: tech.vronk <te...@vr...> - 2012-05-17 14:54:37
|
Hi Wolfgang, thank you for the information. Is there anything I can do meanwhile preemptively, to prevent this from happening again, like restarting the database or running the consistency checks more often? Or meanwhile just backup often enough, check the queries twice before running and pray? thanks again. matej Am 17.05.2012 10:52, schrieb Wolfgang Meier: > Hi, > >> I have experienced quite problematic behaviour of the exist-db lately. >> >> I have the 2.0-tech-preview version running. >> >> After an (obviously) inefficient query that ran for ever, >> on next start >> the 'exist' instance" could not be found. > I had a similar issue the other day: probably a deadlocked thread > which prevented eXist from creating a checkpoint for several days. > Upon restart I had a huge transaction log, which fortunately was ok > and the data could be restored by recovery. > > Normally eXist should create a checkpoint when the last write > transaction completes, but it seems we have a problem here in 2.x., > resulting in a risky recovery run whenever you have to kill the db. > I'm looking into this today. > > Wolfgang > > |
From: Wolfgang M. <wol...@ex...> - 2012-05-18 09:03:43
|
Hi Matej, I found one issue and fixed it: http://exist.svn.sourceforge.net/exist/?rev=16411&view=rev eXist did not write a checkpoint if a long running thread had to be killed during shutdown, thus causing a recovery run every time. I have now changed the transaction manager to write a checkpoint if all write transactions returned properly. It should thus be safer to kill a blocking db. This could help in your case, though I suspect there must be something else because above issue does not fully explain the failure I have observed on one of my servers last week. There might be a problem with transactions not being properly aborted. I will add some code to force an abort on a transaction if it has been completely inactive for 10 minutes or so. This will at least help to locate the issue. Wolfgang 2012/5/17 tech.vronk <te...@vr...>: > Hi Wolfgang, > > thank you for the information. > > Is there anything I can do meanwhile preemptively, > to prevent this from happening again, > like restarting the database > or running the consistency checks more often? > > Or meanwhile just backup often enough, > check the queries twice before running and pray? > > thanks again. > > matej > > > Am 17.05.2012 10:52, schrieb Wolfgang Meier: > >> Hi, >> >>> I have experienced quite problematic behaviour of the exist-db lately. >>> >>> I have the 2.0-tech-preview version running. >>> >>> After an (obviously) inefficient query that ran for ever, >>> on next start >>> the 'exist' instance" could not be found. >> >> I had a similar issue the other day: probably a deadlocked thread >> which prevented eXist from creating a checkpoint for several days. >> Upon restart I had a huge transaction log, which fortunately was ok >> and the data could be restored by recovery. >> >> Normally eXist should create a checkpoint when the last write >> transaction completes, but it seems we have a problem here in 2.x., >> resulting in a risky recovery run whenever you have to kill the db. >> I'm looking into this today. >> >> Wolfgang >> >> > |
From: tech.vronk <te...@vr...> - 2012-05-19 08:07:03
|
Hi, i had the(?) issue tonight again. Last errors in the log weere: 2012-05-19 08:15:18,779 [WrapperListener_start_runner] WARN (BFile.java [append]:217) - Key length exceeds page size! Skipping key ... 2012-05-19 08:15:18,779 [WrapperListener_start_runner] WARN (NativeValueIndex.java [flush]:469) - Could not append index data for key 'org.exist.storage.NativeValueIndex$QNameKey@617d7c7f' I did the recovery steps, and again, the database seemed to work again, but with this inconsistent state (mainly having lost the security information) but I observed, that the recreated *.dbx files were at their initial size. Doesn't the database recreate them automatically on startup? Do I have to explicitely hit a reindex? If yes, there is the next problem, that admin-pwd being lost (empty) I cannot acces the database via the client. Is there some other way (script?) to start the reindexing? matej Am 18.05.2012 11:03, schrieb Wolfgang Meier: > Hi Matej, > > I found one issue and fixed it: > > http://exist.svn.sourceforge.net/exist/?rev=16411&view=rev > > eXist did not write a checkpoint if a long running thread had to be > killed during shutdown, thus causing a recovery run every time. I have > now changed the transaction manager to write a checkpoint if all write > transactions returned properly. It should thus be safer to kill a > blocking db. > > This could help in your case, though I suspect there must be something > else because above issue does not fully explain the failure I have > observed on one of my servers last week. There might be a problem with > transactions not being properly aborted. I will add some code to force > an abort on a transaction if it has been completely inactive for 10 > minutes or so. This will at least help to locate the issue. > > Wolfgang > > > 2012/5/17 tech.vronk<te...@vr...>: >> Hi Wolfgang, >> >> thank you for the information. >> >> Is there anything I can do meanwhile preemptively, >> to prevent this from happening again, >> like restarting the database >> or running the consistency checks more often? >> >> Or meanwhile just backup often enough, >> check the queries twice before running and pray? >> >> thanks again. >> >> matej >> >> >> Am 17.05.2012 10:52, schrieb Wolfgang Meier: >> >>> Hi, >>> >>>> I have experienced quite problematic behaviour of the exist-db lately. >>>> >>>> I have the 2.0-tech-preview version running. >>>> >>>> After an (obviously) inefficient query that ran for ever, >>>> on next start >>>> the 'exist' instance" could not be found. >>> I had a similar issue the other day: probably a deadlocked thread >>> which prevented eXist from creating a checkpoint for several days. >>> Upon restart I had a huge transaction log, which fortunately was ok >>> and the data could be restored by recovery. >>> >>> Normally eXist should create a checkpoint when the last write >>> transaction completes, but it seems we have a problem here in 2.x., >>> resulting in a risky recovery run whenever you have to kill the db. >>> I'm looking into this today. >>> >>> Wolfgang >>> >>> > |
From: Wolfgang M. <wol...@ex...> - 2012-05-19 08:49:13
|
> Last errors in the log weere: > 2012-05-19 08:15:18,779 [WrapperListener_start_runner] WARN (BFile.java > [append]:217) - Key length exceeds page size! Skipping key ... > 2012-05-19 08:15:18,779 [WrapperListener_start_runner] WARN > (NativeValueIndex.java [flush]:469) - Could not append index data for key > 'org.exist.storage.NativeValueIndex$QNameKey@617d7c7f' Those messages are normal and you'll see them every time you index an element or attribute with text > 4k (a range index is not efficient for strings that large and you might want to consider an ngram or lucene index instead). > I did the recovery steps, and again, the database seemed to work again, > but with this inconsistent state (mainly having lost the security > information) > but I observed, that the recreated *.dbx files were at their initial size. > Doesn't the database recreate them automatically on startup? > Do I have to explicitely hit a reindex? If you deleted the index files they will be created empty and you need to do a reindex afterwards. > If yes, there is the next problem, that admin-pwd being lost (empty) > I cannot acces the database via the client. There were some bug fixes with respect to checks on non-existing users in trunk. Did you try to access your db with trunk? It might help to get into the db and restore the users. Wolfgang |
From: tech.vronk <te...@vr...> - 2012-05-19 08:23:13
|
and - sorry for slicing the information - in the logs i also observed, what looked like (at least) two autonomous database restarts in the night. Is this plausible? (i start the server via wrapper.) matej Am 19.05.2012 10:06, schrieb tech.vronk: > Hi, > > i had the(?) issue tonight again. > > Last errors in the log weere: > 2012-05-19 08:15:18,779 [WrapperListener_start_runner] WARN (BFile.java > [append]:217) - Key length exceeds page size! Skipping key ... > 2012-05-19 08:15:18,779 [WrapperListener_start_runner] WARN > (NativeValueIndex.java [flush]:469) - Could not append index data for > key 'org.exist.storage.NativeValueIndex$QNameKey@617d7c7f' > > I did the recovery steps, and again, the database seemed to work again, > but with this inconsistent state (mainly having lost the security > information) > but I observed, that the recreated *.dbx files were at their initial size. > Doesn't the database recreate them automatically on startup? > Do I have to explicitely hit a reindex? > > If yes, there is the next problem, that admin-pwd being lost (empty) > I cannot acces the database via the client. > Is there some other way (script?) to start the reindexing? > > matej > > > Am 18.05.2012 11:03, schrieb Wolfgang Meier: >> Hi Matej, >> >> I found one issue and fixed it: >> >> http://exist.svn.sourceforge.net/exist/?rev=16411&view=rev >> >> eXist did not write a checkpoint if a long running thread had to be >> killed during shutdown, thus causing a recovery run every time. I have >> now changed the transaction manager to write a checkpoint if all write >> transactions returned properly. It should thus be safer to kill a >> blocking db. >> >> This could help in your case, though I suspect there must be something >> else because above issue does not fully explain the failure I have >> observed on one of my servers last week. There might be a problem with >> transactions not being properly aborted. I will add some code to force >> an abort on a transaction if it has been completely inactive for 10 >> minutes or so. This will at least help to locate the issue. >> >> Wolfgang >> >> >> 2012/5/17 tech.vronk<te...@vr...>: >>> Hi Wolfgang, >>> >>> thank you for the information. >>> >>> Is there anything I can do meanwhile preemptively, >>> to prevent this from happening again, >>> like restarting the database >>> or running the consistency checks more often? >>> >>> Or meanwhile just backup often enough, >>> check the queries twice before running and pray? >>> >>> thanks again. >>> >>> matej >>> >>> >>> Am 17.05.2012 10:52, schrieb Wolfgang Meier: >>> >>>> Hi, >>>> >>>>> I have experienced quite problematic behaviour of the exist-db lately. >>>>> >>>>> I have the 2.0-tech-preview version running. >>>>> >>>>> After an (obviously) inefficient query that ran for ever, >>>>> on next start >>>>> the 'exist' instance" could not be found. >>>> I had a similar issue the other day: probably a deadlocked thread >>>> which prevented eXist from creating a checkpoint for several days. >>>> Upon restart I had a huge transaction log, which fortunately was ok >>>> and the data could be restored by recovery. >>>> >>>> Normally eXist should create a checkpoint when the last write >>>> transaction completes, but it seems we have a problem here in 2.x., >>>> resulting in a risky recovery run whenever you have to kill the db. >>>> I'm looking into this today. >>>> >>>> Wolfgang >>>> >>>> > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Exist-open mailing list > Exi...@li... > https://lists.sourceforge.net/lists/listinfo/exist-open > > |
From: Wolfgang M. <wol...@ex...> - 2012-05-19 08:34:30
|
> in the logs i also observed, what looked like > (at least) two autonomous database restarts in the night. > > Is this plausible? > (i start the server via wrapper.) Aha. By default the wrapper restarts the jvm if it doesn't respond for a certain time or runs into an out of memory. I disabled this in trunk because it can really corrupt your data if the wrapper tries over and over again. So the first thing I would do is to add the following lines to tools/wrapper/conf/wrapper.conf: wrapper.filter.action.1000=SHUTDOWN wrapper.filter.message.1000=The JVM has run out of memory. # Do not allow the wrapper to restart the JVM. # A restart could cause issues if eXist-db wasn't shut down completely. wrapper.disable_restarts.automatic=TRUE Also, please check tools/wrapper/logs/ for any indication of why it restarted (probably an OutOfMemory error somewhere). Wolfgang |