From: William P. <wil...@ya...> - 2010-01-08 16:15:41
|
One of the unit tests that is giving Youjun trouble is one that checks for orphans (or in this case "childless" records) in the following chain of tables: study --- submission --- sub_treeblock --- treeblock >-- phylotree Frankly, I'm not clear why we have sub_* tables to begin with, but it turns out that there are about 4,500 orphaned treeblocks (i.e. treeblock records that lack any related phylotree records) that each have a sub_treeblock id, and almost all of them belong to a submission that belongs to a study with study_id 22. So for some strange reason, study_id 22 has thousands of sub_treeblock records that each have a treeblock record, yet they lack phylotrees. Looks to me like these were created as part of Mark Jason's last migration effort -- probably purely spurious. I tried deleting these simply by deleting study_id 22, but after many hours that resulted in the "data access failure" (below). Don't know if anyone has some ideas re. why they doesn't work. The other approach is to run a SQL query like so: DELETE FROM treeblock WHERE treeblock_id IN ( SELECT tb.treeblock_id FROM phylotree pt RIGHT JOIN treeblock tb ON (pt.treeblock_id = tb.treeblock_id) JOIN sub_treeblock stb ON (tb.treeblock_id = stb.treeblock_id) JOIN submission sub ON (stb.submission_id = sub.submission_id) JOIN study st ON (sub.study_id = st.study_id) WHERE pt.phylotree_id IS NULL AND st.study_id = 22 ) ... but the trouble is that foreign key constraints don't allow these to be deleted (and postgres does not support CASCADE DELETE unless the tables are altered to do this automatically). One way around this would be to first build a list of orphaned sub_treeblock, treeblock, etc, records, and then delete them from the list of IDs flowing with the cascade. Anyway... I wanted to touch base with everyone in case there's an easy way to do this the proper way -- i.e. to have our java code & hibernate delete study_id 22 without running into the "DataIntegrityViolationException" below. bp Data Access Failure could not delete: [org.cipres.treebase.domain.taxon.TaxonLabel#67521]; nested exception is org.hibernate.exception.ConstraintViolationException: could not delete: [org.cipres.treebase.domain.taxon.TaxonLabel#67521] org.springframework.dao.DataIntegrityViolationException: could not delete: [org.cipres.treebase.domain.taxon.TaxonLabel#67521]; nested exception is org.hibernate.exception.ConstraintViolationException: could not delete: [org.cipres.treebase.domain.taxon.TaxonLabel#67521] Caused by: org.hibernate.exception.ConstraintViolationException: could not delete: [org.cipres.treebase.domain.taxon.TaxonLabel#67521] at org.hibernate.exception.SQLStateConverter.convert(SQLStateConverter.java:71) at org.hibernate.exception.JDBCExceptionHelper.convert(JDBCExceptionHelper.java:43) at org.hibernate.persister.entity.AbstractEntityPersister.delete(AbstractEntityPersister.java:2546) at org.hibernate.persister.entity.AbstractEntityPersister.delete(AbstractEntityPersister.java:2702) at org.hibernate.action.EntityDeleteAction.execute(EntityDeleteAction.java:77) at org.hibernate.engine.ActionQueue.execute(ActionQueue.java:279) at org.hibernate.engine.ActionQueue.executeActions(ActionQueue.java:263) at org.hibernate.engine.ActionQueue.executeActions(ActionQueue.java:172) at org.hibernate.event.def.AbstractFlushingEventListener.performExecutions(AbstractFlushingEventListener.java:298) at org.hibernate.event.def.DefaultFlushEventListener.onFlush(DefaultFlushEventListener.java:27) at org.hibernate.impl.SessionImpl.flush(SessionImpl.java:1000) at org.springframework.orm.hibernate3.HibernateTemplate$27.doInHibernate(HibernateTemplate.java:811) at org.springframework.orm.hibernate3.HibernateTemplate.execute(HibernateTemplate.java:372) at org.springframework.orm.hibernate3.HibernateTemplate.flush(HibernateTemplate.java:809) at org.cipres.treebase.dao.AbstractDAO.flush(AbstractDAO.java:158) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:304) at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:198) at $Proxy50.flush(Unknown Source) at org.cipres.treebase.service.study.SubmissionServiceImpl.deleteSubmission(SubmissionServiceImpl.java:428) at org.cipres.treebase.service.study.SubmissionServiceImpl.deleteSubmission(SubmissionServiceImpl.java:900) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:304) at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:182) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:149) at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:106) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:171) at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204) at $Proxy80.deleteSubmission(Unknown Source) at org.cipres.treebase.web.controllers.DeleteStudyController.onSubmit(DeleteStudyController.java:71) at org.springframework.web.servlet.mvc.SimpleFormController.processFormSubmission(SimpleFormController.java:267) at org.springframework.web.servlet.mvc.CancellableFormController.processFormSubmission(CancellableFormController.java:140) at org.springframework.web.servlet.mvc.AbstractFormController.handleRequestInternal(AbstractFormController.java:265) at org.springframework.web.servlet.mvc.AbstractController.handleRequest(AbstractController.java:153) at org.springframework.web.servlet.mvc.SimpleControllerHandlerAdapter.handle(SimpleControllerHandlerAdapter.java:48) at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:858) at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:792) at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:476) at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:441) at javax.servlet.http.HttpServlet.service(HttpServlet.java:710) at javax.servlet.http.HttpServlet.service(HttpServlet.java:803) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:269) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188) at org.displaytag.filter.ResponseOverrideFilter.doFilter(ResponseOverrideFilter.java:125) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188) at org.springframework.orm.hibernate3.support.OpenSessionInViewFilter.doFilterInternal(OpenSessionInViewFilter.java:198) at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:75) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188) at com.opensymphony.module.sitemesh.filter.PageFilter.parsePage(PageFilter.java:119) at com.opensymphony.module.sitemesh.filter.PageFilter.doFilter(PageFilter.java:55) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188) at org.acegisecurity.util.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:264) at org.acegisecurity.intercept.web.FilterSecurityInterceptor.invoke(FilterSecurityInterceptor.java:107) at org.acegisecurity.intercept.web.FilterSecurityInterceptor.doFilter(FilterSecurityInterceptor.java:72) at org.acegisecurity.util.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:274) at org.acegisecurity.ui.ExceptionTranslationFilter.doFilter(ExceptionTranslationFilter.java:110) at org.acegisecurity.util.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:274) at org.acegisecurity.wrapper.SecurityContextHolderAwareRequestFilter.doFilter(SecurityContextHolderAwareRequestFilter.java:81) at org.acegisecurity.util.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:274) at org.acegisecurity.ui.AbstractProcessingFilter.doFilter(AbstractProcessingFilter.java:217) at org.acegisecurity.util.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:274) at org.acegisecurity.context.HttpSessionContextIntegrationFilter.doFilter(HttpSessionContextIntegrationFilter.java:191) at org.acegisecurity.util.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:274) at org.acegisecurity.util.FilterChainProxy.doFilter(FilterChainProxy.java:148) at org.acegisecurity.util.FilterToBeanProxy.doFilter(FilterToBeanProxy.java:90) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:210) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:172) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:151) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:870) at org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665) at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528) at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:685) at java.lang.Thread.run(Thread.java:636) Caused by: java.sql.BatchUpdateException: Batch entry 13 delete from TAXONLABEL where TAXONLABEL_ID=71834 and VERSION=2 was aborted. Call getNextException to see the cause. at org.postgresql.jdbc2.AbstractJdbc2Statement$BatchResultHandler.handleError(AbstractJdbc2Statement.java:2537) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1328) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:351) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeBatch(AbstractJdbc2Statement.java:2674) at com.mchange.v2.c3p0.impl.NewProxyPreparedStatement.executeBatch(NewProxyPreparedStatement.java:1723) at org.hibernate.jdbc.BatchingBatcher.doExecuteBatch(BatchingBatcher.java:48) at org.hibernate.jdbc.BatchingBatcher.addToBatch(BatchingBatcher.java:34) at org.hibernate.persister.entity.AbstractEntityPersister.delete(AbstractEntityPersister.java:2525) ... 85 more « Back |
From: Hilmar L. <hl...@ne...> - 2010-01-08 17:30:47
|
Bill - two questions first: 1) Do you still have the original data file that would clearly prove that these treeblocks must be spurious? and 2) why would we not delete only the spurious treeblocks and sub_treeblocks, rather than the entire study? Are you suspicious that the entire study is corrupt in the way it is represented in the database? Note that your join below won't work the way I think you intend it to work. Either put the outer join last, or parenthesize the serious of inner joins that you join with. If the WHERE clause below were really correct (i.e., only hit treeblock records that do not have any phylotree dependent on them) then you should not get any foreign key violation, should you? Or are there other tables that have a foreign key to treeblock, and which have rows pointing to the so-called orphan treeblocks (which might not be so orphan after all then?). -hilmar On Jan 8, 2010, at 11:15 AM, William Piel wrote: > > One of the unit tests that is giving Youjun trouble is one that > checks for orphans (or in this case "childless" records) in the > following chain of tables: > > study --- submission --- sub_treeblock --- treeblock >-- phylotree > > Frankly, I'm not clear why we have sub_* tables to begin with, but > it turns out that there are about 4,500 orphaned treeblocks (i.e. > treeblock records that lack any related phylotree records) that each > have a sub_treeblock id, and almost all of them belong to a > submission that belongs to a study with study_id 22. So for some > strange reason, study_id 22 has thousands of sub_treeblock records > that each have a treeblock record, yet they lack phylotrees. Looks > to me like these were created as part of Mark Jason's last migration > effort -- probably purely spurious. > > I tried deleting these simply by deleting study_id 22, but after > many hours that resulted in the "data access failure" (below). Don't > know if anyone has some ideas re. why they doesn't work. > > The other approach is to run a SQL query like so: > > DELETE FROM treeblock WHERE treeblock_id IN ( > SELECT tb.treeblock_id > FROM phylotree pt RIGHT JOIN treeblock tb ON (pt.treeblock_id = > tb.treeblock_id) > JOIN sub_treeblock stb ON (tb.treeblock_id = stb.treeblock_id) > JOIN submission sub ON (stb.submission_id = sub.submission_id) > JOIN study st ON (sub.study_id = st.study_id) > WHERE pt.phylotree_id IS NULL > AND st.study_id = 22 > ) > > ... but the trouble is that foreign key constraints don't allow > these to be deleted (and postgres does not support CASCADE DELETE > unless the tables are altered to do this automatically). One way > around this would be to first build a list of orphaned > sub_treeblock, treeblock, etc, records, and then delete them from > the list of IDs flowing with the cascade. > > Anyway... I wanted to touch base with everyone in case there's an > easy way to do this the proper way -- i.e. to have our java code & > hibernate delete study_id 22 without running into the > "DataIntegrityViolationException" below. > > bp > > > > > Data Access Failure > > could not delete: > [org.cipres.treebase.domain.taxon.TaxonLabel#67521]; nested > exception is org.hibernate.exception.ConstraintViolationException: > could not delete: [org.cipres.treebase.domain.taxon.TaxonLabel#67521] > > org.springframework.dao.DataIntegrityViolationException: could not > delete: [org.cipres.treebase.domain.taxon.TaxonLabel#67521]; nested > exception is org.hibernate.exception.ConstraintViolationException: > could not delete: > [org.cipres.treebase.domain.taxon.TaxonLabel#67521] Caused by: > org.hibernate.exception.ConstraintViolationException: could not > delete: [org.cipres.treebase.domain.taxon.TaxonLabel#67521] at > org > .hibernate > .exception.SQLStateConverter.convert(SQLStateConverter.java:71) at > org > .hibernate > .exception.JDBCExceptionHelper.convert(JDBCExceptionHelper.java:43) > at > org > .hibernate > .persister > .entity.AbstractEntityPersister.delete(AbstractEntityPersister.java: > 2546) at > org > .hibernate > .persister > .entity.AbstractEntityPersister.delete(AbstractEntityPersister.java: > 2702) at > org > .hibernate.action.EntityDeleteAction.execute(EntityDeleteAction.java: > 77) at org.hibernate.engine.ActionQueue.execute(ActionQueue.java: > 279) at > org.hibernate.engine.ActionQueue.executeActions(ActionQueue.java: > 263) at > org.hibernate.engine.ActionQueue.executeActions(ActionQueue.java: > 172) at > org > .hibernate > .event > .def > .AbstractFlushingEventListener > .performExecutions(AbstractFlushingEventListener.java:298) at > org > .hibernate > .event > .def > .DefaultFlushEventListener.onFlush(DefaultFlushEventListener.java: > 27) at org.hibernate.impl.SessionImpl.flush(SessionImpl.java:1000) > at org.springframework.orm.hibernate3.HibernateTemplate > $27.doInHibernate(HibernateTemplate.java:811) at > org > .springframework > .orm.hibernate3.HibernateTemplate.execute(HibernateTemplate.java: > 372) at > org > .springframework > .orm.hibernate3.HibernateTemplate.flush(HibernateTemplate.java:809) > at org.cipres.treebase.dao.AbstractDAO.flush(AbstractDAO.java:158) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun > .reflect > .NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun > .reflect > .DelegatingMethodAccessorImpl > .invoke(DelegatingMethodAccessorImpl.java:43) at > java.lang.reflect.Method.invoke(Method.java:616) at > org > .springframework > .aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java: > 304) at > org > .springframework > .aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java: > 198) at $Proxy50.flush(Unknown Source) at > org > .cipres > .treebase > .service > .study > .SubmissionServiceImpl.deleteSubmission(SubmissionServiceImpl.java: > 428) at > org > .cipres > .treebase > .service > .study > .SubmissionServiceImpl.deleteSubmission(SubmissionServiceImpl.java: > 900) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun > .reflect > .NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun > .reflect > .DelegatingMethodAccessorImpl > .invoke(DelegatingMethodAccessorImpl.java:43) at > java.lang.reflect.Method.invoke(Method.java:616) at > org > .springframework > .aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java: > 304) at > org > .springframework > .aop > .framework > .ReflectiveMethodInvocation > .invokeJoinpoint(ReflectiveMethodInvocation.java:182) at > org > .springframework > .aop > .framework > .ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java: > 149) at > org > .springframework > .transaction > .interceptor > .TransactionInterceptor.invoke(TransactionInterceptor.java:106) at > org > .springframework > .aop > .framework > .ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java: > 171) at > org > .springframework > .aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java: > 204) at $Proxy80.deleteSubmission(Unknown Source) at > org > .cipres > .treebase > .web > .controllers > .DeleteStudyController.onSubmit(DeleteStudyController.java:71) at > org.springframework.web.servlet.mvc.SimpleFormController.processFormSubmission > (SimpleFormController.java:267) at org.springframework.web.servlet.mvc.CancellableFormController.processFormSubmission > (CancellableFormController.java:140) at org.springframework.web.servlet.mvc.AbstractFormController.handleRequestInternal > (AbstractFormController.java:265) at org.springframework.web.servlet.mvc.AbstractController.handleRequest > (AbstractController.java:153) at org.springframework.web.servlet.mvc.SimpleControllerHandlerAdapter.handle > (SimpleControllerHandlerAdapter.java:48) at org.springframework.web.servlet.DispatcherServlet.doDispatch > (DispatcherServlet.java:858) at org.springframework.web.servlet.DispatcherServlet.doService > (DispatcherServlet.java:792) at org.springframework.web.servlet.FrameworkServlet.processRequest > (FrameworkServlet.java:476) at org.springframework.web.servlet.FrameworkServlet.doPost > (FrameworkServlet.java:441) at > javax.servlet.http.HttpServlet.service(HttpServlet.java:710) at > javax.servlet.http.HttpServlet.service(HttpServlet.java:803) at > org > .apache > .catalina > .core > .ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java: > 269) at > org > .apache > .catalina > .core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java: > 188) at > org > .displaytag > .filter.ResponseOverrideFilter.doFilter(ResponseOverrideFilter.java: > 125) at > org > .apache > .catalina > .core > .ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java: > 215) at > org > .apache > .catalina > .core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java: > 188) at > org > .springframework > .orm > .hibernate3 > .support > .OpenSessionInViewFilter > .doFilterInternal(OpenSessionInViewFilter.java:198) at > org > .springframework > .web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java: > 75) at > org > .apache > .catalina > .core > .ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java: > 215) at > org > .apache > .catalina > .core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java: > 188) at > com > .opensymphony > .module.sitemesh.filter.PageFilter.parsePage(PageFilter.java:119) at > com > .opensymphony > .module.sitemesh.filter.PageFilter.doFilter(PageFilter.java:55) at > org > .apache > .catalina > .core > .ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java: > 215) at > org > .apache > .catalina > .core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java: > 188) at org.acegisecurity.util.FilterChainProxy > $VirtualFilterChain.doFilter(FilterChainProxy.java:264) at > org > .acegisecurity > .intercept > .web.FilterSecurityInterceptor.invoke(FilterSecurityInterceptor.java: > 107) at org.acegisecurity.intercept.web.FilterSecurityInterceptor.doFilter > (FilterSecurityInterceptor.java:72) at > org.acegisecurity.util.FilterChainProxy > $VirtualFilterChain.doFilter(FilterChainProxy.java:274) at > org > .acegisecurity > .ui > .ExceptionTranslationFilter.doFilter(ExceptionTranslationFilter.java: > 110) at org.acegisecurity.util.FilterChainProxy > $VirtualFilterChain.doFilter(FilterChainProxy.java:274) at > org > .acegisecurity > .wrapper > .SecurityContextHolderAwareRequestFilter > .doFilter(SecurityContextHolderAwareRequestFilter.java:81) at > org.acegisecurity.util.FilterChainProxy > $VirtualFilterChain.doFilter(FilterChainProxy.java:274) at > org > .acegisecurity > .ui.AbstractProcessingFilter.doFilter(AbstractProcessingFilter.java: > 217) at org.acegisecurity.util.FilterChainProxy > $VirtualFilterChain.doFilter(FilterChainProxy.java:274) at > org > .acegisecurity > .context > .HttpSessionContextIntegrationFilter > .doFilter(HttpSessionContextIntegrationFilter.java:191) at > org.acegisecurity.util.FilterChainProxy > $VirtualFilterChain.doFilter(FilterChainProxy.java:274) at > org > .acegisecurity.util.FilterChainProxy.doFilter(FilterChainProxy.java: > 148) at > org > .acegisecurity > .util.FilterToBeanProxy.doFilter(FilterToBeanProxy.java:90) at > org > .apache > .catalina > .core > .ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java: > 215) at > org > .apache > .catalina > .core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java: > 188) at > org > .apache > .catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java: > 210) at > org > .apache > .catalina.core.StandardContextValve.invoke(StandardContextValve.java: > 172) at > org > .apache > .catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) > at > org > .apache > .catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117) > at > org > .apache > .catalina.core.StandardEngineValve.invoke(StandardEngineValve.java: > 108) at > org > .apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java: > 151) at > org > .apache.coyote.http11.Http11Processor.process(Http11Processor.java: > 870) at org.apache.coyote.http11.Http11BaseProtocol > $Http11ConnectionHandler.processConnection(Http11BaseProtocol.java: > 665) at > org > .apache > .tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java: > 528) at > org > .apache > .tomcat > .util > .net > .LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java: > 81) at org.apache.tomcat.util.threads.ThreadPool > $ControlRunnable.run(ThreadPool.java:685) at > java.lang.Thread.run(Thread.java:636) Caused by: > java.sql.BatchUpdateException: Batch entry 13 delete from TAXONLABEL > where TAXONLABEL_ID=71834 and VERSION=2 was aborted. Call > getNextException to see the cause. at > org.postgresql.jdbc2.AbstractJdbc2Statement > $BatchResultHandler.handleError(AbstractJdbc2Statement.java:2537) at > org > .postgresql > .core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java: > 1328) at > org > .postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java: > 351) at > org > .postgresql > .jdbc2 > .AbstractJdbc2Statement.executeBatch(AbstractJdbc2Statement.java: > 2674) at > com > .mchange > .v2 > .c3p0 > .impl > .NewProxyPreparedStatement > .executeBatch(NewProxyPreparedStatement.java:1723) at > org > .hibernate.jdbc.BatchingBatcher.doExecuteBatch(BatchingBatcher.java: > 48) at > org.hibernate.jdbc.BatchingBatcher.addToBatch(BatchingBatcher.java: > 34) at > org > .hibernate > .persister > .entity.AbstractEntityPersister.delete(AbstractEntityPersister.java: > 2525) ... 85 more « Back > ------------------------------------------------------------------------------ > This SF.Net email is sponsored by the Verizon Developer Community > Take advantage of Verizon's best-in-class app development support > A streamlined, 14 day to market process makes app distribution fast > and easy > Join now and get one step closer to millions of Verizon customers > http://p.sf.net/sfu/verizon-dev2dev > _______________________________________________ > Treebase-devel mailing list > Tre...@li... > https://lists.sourceforge.net/lists/listinfo/treebase-devel -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org : =========================================================== |
From: William P. <wil...@ya...> - 2010-01-08 19:27:56
|
On Jan 8, 2010, at 12:30 PM, Hilmar Lapp wrote: > Or are there other tables that have a foreign key to treeblock Yes -- the error is: ERROR: update or delete on table "treeblock" violates foreign key constraint "fk94d50830bfd107c3" on table "sub_treeblock" DETAIL: Key (treeblock_id)=(2848) is still referenced from table "sub_treeblock". so... another approach is to first delete the sub_treeblock record and then delete the treeblock record. Can I express that in a single delete query? I'm guessing not. But if I do it in two delete queries, deleting sub_treeblock first will cause treeblock to lose the connection with the study table (all this hangs off of study_id = 22). Which is why I'm thinking of building a big list of treeblock_ids, and then running two delete queries: 1. get [big list] like so: SELECT tb.treeblock_id FROM study st JOIN submission sub ON (st.study_id = sub.study_id) JOIN sub_treeblock stb ON (sub.submission_id = stb.submission_id) JOIN treeblock tb ON (stb.treeblock_id = tb.treeblock_id) LEFT JOIN phylotree pt ON (pt.treeblock_id = tb.treeblock_id) WHERE pt.phylotree_id IS NULL AND st.study_id = 22 Result is a list of 4492 distinct treeblock_ids -- reformat the result as comma-separated numbers. 2. First delete from the table that references treeblock.treeblock_id: DELETE FROM sub_treeblock WHERE treeblock_id IN ( [big list] ); 3. Then delete from treeblock_id: DELETE FROM treeblock WHERE treeblock_id IN ( [big list] ); How does that sound? bp |
From: William P. <wil...@ya...> - 2010-01-08 18:28:51
|
On Jan 8, 2010, at 12:30 PM, Hilmar Lapp wrote: > two questions first: 1) Do you still have the original data file that would clearly prove that these treeblocks must be spurious? The 4,000+ orphaned treeblocks belong to study_id 22 which is not published, lacks a citation, and is owned by user "tb1" (which is for testing). Don't know how it acquired all these records. So no, I don't have any "original" data file, but then there is nothing original about this artifact. > and 2) why would we not delete only the spurious treeblocks and sub_treeblocks, rather than the entire study?00 It seemed easier to me to click once (delete study_id 22) rather than making 4,000+ mouse clicks to delete the treeblocks. > Note that your join below won't work the way I think you intend it to work thanks -- I'll check it over. It seemed to work (in that doing a count(treeblock_id) produced the correct number of records). bp |
From: Rutger V. <rut...@gm...> - 2010-01-11 10:53:54
|
Just for context: study 22 is a dummy study that Mark Jason Dominus created when running the batch import java programs last time he ran them. The logic of the programs should have been that all dangling references to this study are updated, but apparently some of them were overlooked. In the same process, many dummy taxonlabels and taxonlabelsets with pointers to study 22 have been created - an issue which we have discussed elsewhere on the list. On Fri, Jan 8, 2010 at 6:28 PM, William Piel <wil...@ya...> wrote: > > On Jan 8, 2010, at 12:30 PM, Hilmar Lapp wrote: > >> two questions first: 1) Do you still have the original data file that would clearly prove that these treeblocks must be spurious? > > The 4,000+ orphaned treeblocks belong to study_id 22 which is not published, lacks a citation, and is owned by user "tb1" (which is for testing). Don't know how it acquired all these records. So no, I don't have any "original" data file, but then there is nothing original about this artifact. > >> and 2) why would we not delete only the spurious treeblocks and sub_treeblocks, rather than the entire study?00 > > It seemed easier to me to click once (delete study_id 22) rather than making 4,000+ mouse clicks to delete the treeblocks. > >> Note that your join below won't work the way I think you intend it to work > > thanks -- I'll check it over. It seemed to work (in that doing a count(treeblock_id) produced the correct number of records). > > bp > > > > > ------------------------------------------------------------------------------ > This SF.Net email is sponsored by the Verizon Developer Community > Take advantage of Verizon's best-in-class app development support > A streamlined, 14 day to market process makes app distribution fast and easy > Join now and get one step closer to millions of Verizon customers > http://p.sf.net/sfu/verizon-dev2dev > _______________________________________________ > Treebase-devel mailing list > Tre...@li... > https://lists.sourceforge.net/lists/listinfo/treebase-devel > -- Dr. Rutger A. Vos School of Biological Sciences Philip Lyle Building, Level 4 University of Reading Reading RG6 6BX United Kingdom Tel: +44 (0) 118 378 7535 http://www.nexml.org http://rutgervos.blogspot.com |
From: Hilmar L. <hl...@ne...> - 2010-01-08 19:01:31
|
I agree, deleting the study should work. Did you submit it as a bug report? -hilmar Sent from away On Jan 8, 2010, at 1:28 PM, William Piel <wil...@ya...> wrote: > > On Jan 8, 2010, at 12:30 PM, Hilmar Lapp wrote: > >> two questions first: 1) Do you still have the original data file >> that would clearly prove that these treeblocks must be spurious? > > The 4,000+ orphaned treeblocks belong to study_id 22 which is not > published, lacks a citation, and is owned by user "tb1" (which is > for testing). Don't know how it acquired all these records. So no, I > don't have any "original" data file, but then there is nothing > original about this artifact. > >> and 2) why would we not delete only the spurious treeblocks and >> sub_treeblocks, rather than the entire study?00 > > It seemed easier to me to click once (delete study_id 22) rather > than making 4,000+ mouse clicks to delete the treeblocks. > >> Note that your join below won't work the way I think you intend it >> to work > > thanks -- I'll check it over. It seemed to work (in that doing a > count(treeblock_id) produced the correct number of records). > > bp > > > > > --- > --- > --- > --------------------------------------------------------------------- > This SF.Net email is sponsored by the Verizon Developer Community > Take advantage of Verizon's best-in-class app development support > A streamlined, 14 day to market process makes app distribution fast > and easy > Join now and get one step closer to millions of Verizon customers > http://p.sf.net/sfu/verizon-dev2dev > _______________________________________________ > Treebase-devel mailing list > Tre...@li... > https://lists.sourceforge.net/lists/listinfo/treebase-devel |
From: William P. <wil...@ya...> - 2010-01-08 19:32:04
|
On Jan 8, 2010, at 2:01 PM, Hilmar Lapp wrote: > I agree, deleting the study should work. Did you submit it as a bug report? I'll do that. Give it a priority of "8" ? For those who want to test it: the user "tb1" (where study_id 22 sits) has the password "tb1". -- I won't put the password in the bug report, for obvious reasons. bp |
From: Hilmar L. <hl...@ne...> - 2010-01-08 23:12:20
|
On Jan 8, 2010, at 2:31 PM, William Piel wrote: > > On Jan 8, 2010, at 2:01 PM, Hilmar Lapp wrote: > >> I agree, deleting the study should work. Did you submit it as a bug >> report? > > I'll do that. Give it a priority of "8" ? > > For those who want to test it: the user "tb1" (where study_id 22 > sits) has the password "tb1". -- I won't put the password in the bug > report, for obvious reasons. Well, it's now in the mailing list archives which are public, so you might as well put it into the bug report. Or change it ... -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org : =========================================================== |
From: William P. <wil...@ya...> - 2010-01-09 15:15:18
|
On Jan 8, 2010, at 6:12 PM, Hilmar Lapp wrote: > which are public whoops! good to know. bp |
From: William P. <wil...@ya...> - 2010-01-11 21:16:50
|
The citation metadata are now available here: http://www.treebase.org/treebase/migration/Dec-09/citations_utf8.zip Much of the metadata still need looking-up (e.g. volume, issue, pages, DOIs) but these can be fixed by our student help directly into the database after the migration is done. This completes the data files required for the Jan09 thru Dec09 migration. Once the record-id-sequence problem if fixed, the migration scripts can be run. Naturally, I continue to edit and accession new submissions to TreeBASE1, which means that we will need to do a small additional Jan2010 migration after the Dec09 migration is complete. But this should be easy since, finger's crossed, seeing as all teething issues re. migration will have been resolved. bp |
From: Ryan S. <rsc...@ne...> - 2010-01-12 19:07:04
|
Does this include the updates that Rosie did last Fall? She filled out the majority of (volume, issue, pages, DOIs) for content that TreeBASE and Dryad have in common. The Endnote file she created is in the subversion, in trunk/treebase-curation/studyCitations.enl -- Ryan On Jan 11, 2010, at 4:16 PM, William Piel wrote: > > The citation metadata are now available here: > > http://www.treebase.org/treebase/migration/Dec-09/citations_utf8.zip > > Much of the metadata still need looking-up (e.g. volume, issue, pages, DOIs) but these can be fixed by our student help directly into the database after the migration is done. > > This completes the data files required for the Jan09 thru Dec09 migration. Once the record-id-sequence problem if fixed, the migration scripts can be run. > > Naturally, I continue to edit and accession new submissions to TreeBASE1, which means that we will need to do a small additional Jan2010 migration after the Dec09 migration is complete. But this should be easy since, finger's crossed, seeing as all teething issues re. migration will have been resolved. > > bp > > > > ------------------------------------------------------------------------------ > This SF.Net email is sponsored by the Verizon Developer Community > Take advantage of Verizon's best-in-class app development support > A streamlined, 14 day to market process makes app distribution fast and easy > Join now and get one step closer to millions of Verizon customers > http://p.sf.net/sfu/verizon-dev2dev > _______________________________________________ > Treebase-devel mailing list > Tre...@li... > https://lists.sourceforge.net/lists/listinfo/treebase-devel |
From: William P. <wil...@ya...> - 2010-01-12 19:52:15
|
I don't think so because I believe she was working with the citation data up until Jan 2009. This Endnote file is for Jan 09 through Dec 09. But indeed, let's fuse Rosie's pre 09 data with this 2009 data and used the fused file to update all data in TreeBASE2. bp On Jan 12, 2010, at 2:06 PM, Ryan Scherle wrote: > Does this include the updates that Rosie did last Fall? She filled out the majority of (volume, issue, pages, DOIs) for content that TreeBASE and Dryad have in common. The Endnote file she created is in the subversion, in trunk/treebase-curation/studyCitations.enl > > -- Ryan > > On Jan 11, 2010, at 4:16 PM, William Piel wrote: > >> >> The citation metadata are now available here: >> >> http://www.treebase.org/treebase/migration/Dec-09/citations_utf8.zip >> >> Much of the metadata still need looking-up (e.g. volume, issue, pages, DOIs) but these can be fixed by our student help directly into the database after the migration is done. >> >> This completes the data files required for the Jan09 thru Dec09 migration. Once the record-id-sequence problem if fixed, the migration scripts can be run. >> >> Naturally, I continue to edit and accession new submissions to TreeBASE1, which means that we will need to do a small additional Jan2010 migration after the Dec09 migration is complete. But this should be easy since, finger's crossed, seeing as all teething issues re. migration will have been resolved. >> >> bp |
From: Hilmar L. <hl...@ne...> - 2010-01-12 19:57:21
|
Wouldn't that complicate the data import in some way? I'm OK if it doesn't, but if it does I think we may need to defer until after the release. -hilmar On Jan 12, 2010, at 2:52 PM, William Piel wrote: > > I don't think so because I believe she was working with the citation > data up until Jan 2009. This Endnote file is for Jan 09 through Dec > 09. But indeed, let's fuse Rosie's pre 09 data with this 2009 data > and used the fused file to update all data in TreeBASE2. > > bp > > > On Jan 12, 2010, at 2:06 PM, Ryan Scherle wrote: > >> Does this include the updates that Rosie did last Fall? She filled >> out the majority of (volume, issue, pages, DOIs) for content that >> TreeBASE and Dryad have in common. The Endnote file she created is >> in the subversion, in trunk/treebase-curation/studyCitations.enl >> >> -- Ryan >> >> On Jan 11, 2010, at 4:16 PM, William Piel wrote: >> >>> >>> The citation metadata are now available here: >>> >>> http://www.treebase.org/treebase/migration/Dec-09/citations_utf8.zip >>> >>> Much of the metadata still need looking-up (e.g. volume, issue, >>> pages, DOIs) but these can be fixed by our student help directly >>> into the database after the migration is done. >>> >>> This completes the data files required for the Jan09 thru Dec09 >>> migration. Once the record-id-sequence problem if fixed, the >>> migration scripts can be run. >>> >>> Naturally, I continue to edit and accession new submissions to >>> TreeBASE1, which means that we will need to do a small additional >>> Jan2010 migration after the Dec09 migration is complete. But this >>> should be easy since, finger's crossed, seeing as all teething >>> issues re. migration will have been resolved. >>> >>> bp > > > ------------------------------------------------------------------------------ > This SF.Net email is sponsored by the Verizon Developer Community > Take advantage of Verizon's best-in-class app development support > A streamlined, 14 day to market process makes app distribution fast > and easy > Join now and get one step closer to millions of Verizon customers > http://p.sf.net/sfu/verizon-dev2dev > _______________________________________________ > Treebase-devel mailing list > Tre...@li... > https://lists.sourceforge.net/lists/listinfo/treebase-devel -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org : =========================================================== |
From: William P. <wil...@ya...> - 2010-01-12 20:14:47
|
On Jan 12, 2010, at 2:57 PM, Hilmar Lapp wrote: > Wouldn't that complicate the data import in some way? I'm OK if it doesn't, but if it does I think we may need to defer until after the release. If the citation-update-scirpt can take the Endnote output and update 284 records, I don't see why it can't take a bigger Endnote output and update 2,350 records (which is total number of citations now in TreeBASE1). I don't recall that this is a slow process, and it should have been designed for multi- and incremental updates. If Mark didn't design it for multiple & incremental updating (e.g. if each time you run it it creates more and more duplicate author records), than yes, we should only do the 284-record update. Perhaps Vladimir can examine the code and confirm that it is designed for multiple runs of the same script on the same set of records. bp |
From: Hilmar L. <hl...@ne...> - 2010-01-12 20:17:43
|
On Jan 12, 2010, at 3:14 PM, William Piel wrote: > If Mark didn't design it for multiple & incremental updating (e.g. > if each time you run it it creates more and more duplicate author > records), than yes, we should only do the 284-record update. That was exactly my concern, not the size of the file. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org : =========================================================== |
From: Rutger V. <rut...@gm...> - 2010-01-18 11:10:10
|
Actually, I do seem to recall that the import program attached more and more authors over incremental runs. On Tue, Jan 12, 2010 at 8:14 PM, William Piel <wil...@ya...> wrote: > > On Jan 12, 2010, at 2:57 PM, Hilmar Lapp wrote: > >> Wouldn't that complicate the data import in some way? I'm OK if it doesn't, but if it does I think we may need to defer until after the release. > > If the citation-update-scirpt can take the Endnote output and update 284 records, I don't see why it can't take a bigger Endnote output and update 2,350 records (which is total number of citations now in TreeBASE1). I don't recall that this is a slow process, and it should have been designed for multi- and incremental updates. > > If Mark didn't design it for multiple & incremental updating (e.g. if each time you run it it creates more and more duplicate author records), than yes, we should only do the 284-record update. Perhaps Vladimir can examine the code and confirm that it is designed for multiple runs of the same script on the same set of records. > > bp > > > > ------------------------------------------------------------------------------ > This SF.Net email is sponsored by the Verizon Developer Community > Take advantage of Verizon's best-in-class app development support > A streamlined, 14 day to market process makes app distribution fast and easy > Join now and get one step closer to millions of Verizon customers > http://p.sf.net/sfu/verizon-dev2dev > _______________________________________________ > Treebase-devel mailing list > Tre...@li... > https://lists.sourceforge.net/lists/listinfo/treebase-devel > -- Dr. Rutger A. Vos School of Biological Sciences Philip Lyle Building, Level 4 University of Reading Reading RG6 6BX United Kingdom Tel: +44 (0) 118 378 7535 http://www.nexml.org http://rutgervos.blogspot.com |
From: Vladimir G. <vla...@du...> - 2010-02-02 20:23:22
|
On Jan 11, 2010, at 4:16 PM, William Piel wrote: > > The citation metadata are now available here: > > http://www.treebase.org/treebase/migration/Dec-09/citations_utf8.zip Bill, could you remind me what tools are available to handle this citation metadata? I just went through the emails since December and through the instructions at https://sourceforge.net/apps/mediawiki/treebase/index.php?title=DataDumps , and neither seemed to mention anything relevant. (I only see instructions for trees, characters, dump_utf8.txt, and taxon intelligence.) Thanks, --Vladimir |
From: William P. <wil...@ya...> - 2010-02-02 22:05:12
|
On Feb 2, 2010, at 3:23 PM, Vladimir Gapeyev wrote: > I just went through ... the instructions at https://sourceforge.net/apps/mediawiki/treebase/index.php?title=DataDumps It should have been described there, I guess Mark J. D. forgot to add that step. Can this be inferred from code that Mark has written? (e.g. perhaps in his perl directory). Or do we need to contact Mark again? I'll go over the basics as follows: Step 1- after the dump.txt has been processed, what you have is a set of study records and a set of author records (plus all the other tables: trees, matrices, etc). Each study record links to a citation record where (1) the entire citation is stored as a single string in the title field, and (2) the abstract is stored in the abstract field. Additionally, for each citation record there is a set of author records in which author first names are written out in full and author emails are included where available (I guess this would be in the "person" table) -- unfortunately the order of the authors is not known (or does not reflect the real order of the names in the associated paper). Step 2- what the citations_utf8.zip file contains is a tab-separated text of all the citation data in much more granular form: i.e. different columns contain different fields -- title, journal, page numbers, etc. So the easy part is to update the citation table, so that instead of storing the entire citation in one field, all the bits have been parsed out into separate field. Since one column in the citations_utf8.zip file contains the legacy study_id, and a field in the study table also contains the legacy study_id, you can use the matching between these two in order to know which citation record to update with which row of data. That's the easy part. Step 3- the difficult part is to update the citation_author and citation_editor tables with the correct order of the authors (these are bridge tables between the person table and the citation table). The citations_utf8.zip file contains a column that lists the primary authors separated with semi-colons, like so: "Lapp, H.; Piel, W. H.; Gapeyev, V." while the persons table has the following records, in no particular order, for example: 1 William H. Piel wil...@ya... 2 Hilmar Lapp hl...@ne... 3 Vladimir Gapeyev vla...@du... What's needed is a script that separates the string of abbreviated names ("Lapp, H.; Piel, W. H.; Gapeyev, V.") using the semi colon, clips out the last name (i.e. the beginning part to the comma), learns the order of the names, and then uses that to reorder the full names + email addresses by updating the citation_author bridge table. Likewise, the secondary authors column in citations_utf8.zip needs to be used to reorder the related records in the citation_editor bridge table. Unfortunately, it is not uncommon for two authors to have the same last name (e.g. the husband and wife team, Barbara and Mike Wingfield, have tons of records in TreeBASE), so in those cases you need to match the last names plus the first initial in order to know how to reorder the citation_author and citation_editor tables. Some caveats: 1. it may be, in fact, that Mark designed his dump.txt parser to divine the author order from the full citation string, in which case the citations_utf8.zip is trivial (you just stop at step 2 because the authors are already in the correct order). But I'm going to guess that he used the citations_utf8.zip for the reordering of authors. 2. the person table is supposed to be a "one" table -- meaning each person gets one unique record. Unfortunately, the person table does not seem to store the legacy author_id from TreeBASE1, so there is not and obvious way to insure that new publications of existing authors don't create duplicate person records. Please make sure that somewhere in Mark's migration scripts, the author names are somehow matched. For example, if the first, last, and email fields match, they must be the same author. 3. I seem to remember that Mark's scripts created twice as many authors per publication (i.e. all authors were duplicated). This may have been fixed by running another script, instead of fixing the original bug. So we need to beware of this. I will list the meaning of the columns in citations_utf8.zip below. regards, Bill Here are the columns for citations_utf8.zip: 1. pub_type The choices are: Book, Book Section, Conference Proceedings, Electronic Source, Journal Article, Thesis (these are standard Endnote categories -- I think we use fewer ones, so we should treat "Conference Proceedings" as "Book Section," "Electronic Source" as "Journal Article", and "Thesis" as "Book" -- or something like that) 2. author These are the primary authors, listed like so: "Aanen, D. K.; Kuyper, T. W.; Boekhout, T.; Hoekstra, R. F." 3. year All are given a year, even those that are "in press" 4. title Primary title. This field comes with its own punctuation at the end. 5. s_author Secondary authors (e.g. book editors). Same format as authors above. 6. s_title Secondary title. For Journal Articles, this column holds the journal name (punctuation not included). For Book Section, this holds the title of the book (punctuation included) 7. place_pub Only for books and book sections 8. publisher Only for books and book sections 9. volume Update the citation table with this. 10. num_of_vols This is an Endnote field that we don't use (no data in this column) 11. number Update the citation table with this. (Same as the "issue" number) 12. pages Update the citation table with this. 13. section This is an Endnote field that we don't use (no data in this column) 14. edition Likewise, no data in this column 15. isbn Likewise, no data in this column 16. label This contains either nothing or "in press". The "in press" label means that the volume, number, and pages data are missing -- unfortunately over 1,000 records have this problem. Let's preserve this so that it can be searched on. Later, when we have some work-study students, we can have them search for "in press", look up the full citation, and update the records accordingly. 17. keywords Needs updating in the citation table. 18. abstract Probably not needed if the abstract field in the citation table already contains text. But if it doesn't, best to update with this version. 19. study_id This is the legacy ID needed to match these rows with the correct study record 20. url If not empty, it contains the correct prefix (e.g. "http://") as needed. 21. doi Does not contain a "http://" prefix -- i.e. it starts with 10. (etc). |
From: Rutger V. <rut...@gm...> - 2010-02-03 14:37:29
|
Data import is not handled by perl scripts but by java programs. Step 1 is handled by org.cipres.treebase.util.AuxiliaryDataImporter. Step 2 and 3 is handled by org.cipres.treebase.util.CitationDataImporter. Both are standalone java programs (i.e. with a main() method). They, and related programs, are located in treebase-core/src/main/java/org/cipres/treebase/util/*, including a barebones package.html documentation file. These programs aren't yet documented on the wiki. On Tue, Feb 2, 2010 at 11:05 PM, William Piel <wil...@ya...> wrote: > On Feb 2, 2010, at 3:23 PM, Vladimir Gapeyev wrote: > > I just went through ... the instructions at > https://sourceforge.net/apps/mediawiki/treebase/index.php?title=DataDumps > > It should have been described there, I guess Mark J. D. forgot to add that > step. Can this be inferred from code that Mark has written? (e.g. perhaps > in his perl directory). Or do we need to contact Mark again? > I'll go over the basics as follows: > Step 1- after the dump.txt has been processed, what you have is a set of > study records and a set of author records (plus all the other tables: trees, > matrices, etc). Each study record links to a citation record where (1) the > entire citation is stored as a single string in the title field, and (2) the > abstract is stored in the abstract field. Additionally, for each citation > record there is a set of author records in which author first names are > written out in full and author emails are included where available (I guess > this would be in the "person" table) -- unfortunately the order of the > authors is not known (or does not reflect the real order of the names in the > associated paper). > Step 2- what the citations_utf8.zip file contains is a tab-separated text of > all the citation data in much more granular form: i.e. different columns > contain different fields -- title, journal, page numbers, etc. So the easy > part is to update the citation table, so that instead of storing the entire > citation in one field, all the bits have been parsed out into separate > field. Since one column in the citations_utf8.zip file contains the legacy > study_id, and a field in the study table also contains the legacy study_id, > you can use the matching between these two in order to know which citation > record to update with which row of data. That's the easy part. > Step 3- the difficult part is to update the citation_author > and citation_editor tables with the correct order of the authors (these are > bridge tables between the person table and the citation table). > The citations_utf8.zip file contains a column that lists the primary authors > separated with semi-colons, like so: > "Lapp, H.; Piel, W. H.; Gapeyev, V." > while the persons table has the following records, in no particular order, > for example: > 1 William H. Piel wil...@ya... > 2 Hilmar Lapp hl...@ne... > 3 Vladimir Gapeyev vla...@du... > What's needed is a script that separates the string of abbreviated names > ("Lapp, H.; Piel, W. H.; Gapeyev, V.") using the semi colon, clips out the > last name (i.e. the beginning part to the comma), learns the order of the > names, and then uses that to reorder the full names + email addresses by > updating the citation_author bridge table. Likewise, the secondary authors > column in citations_utf8.zip needs to be used to reorder the related records > in the citation_editor bridge table. Unfortunately, it is not uncommon for > two authors to have the same last name (e.g. the husband and wife team, > Barbara and Mike Wingfield, have tons of records in TreeBASE), so in those > cases you need to match the last names plus the first initial in order to > know how to reorder the citation_author and citation_editor tables. > Some caveats: > 1. it may be, in fact, that Mark designed his dump.txt parser to divine the > author order from the full citation string, in which case > the citations_utf8.zip is trivial (you just stop at step 2 because the > authors are already in the correct order). But I'm going to guess that he > used the citations_utf8.zip for the reordering of authors. > 2. the person table is supposed to be a "one" table -- meaning each person > gets one unique record. Unfortunately, the person table does not seem to > store the legacy author_id from TreeBASE1, so there is not and obvious way > to insure that new publications of existing authors don't create duplicate > person records. Please make sure that somewhere in Mark's migration scripts, > the author names are somehow matched. For example, if the first, last, and > email fields match, they must be the same author. > 3. I seem to remember that Mark's scripts created twice as many authors per > publication (i.e. all authors were duplicated). This may have been fixed by > running another script, instead of fixing the original bug. So we need to > beware of this. > > I will list the meaning of the columns in citations_utf8.zip below. > regards, > Bill > > > > > Here are the columns for citations_utf8.zip: > 1. pub_type > > The choices are: Book, Book Section, Conference Proceedings, Electronic > Source, Journal Article, Thesis > > (these are standard Endnote categories -- I think we use fewer ones, so we > should treat "Conference Proceedings" as "Book Section," "Electronic Source" > as "Journal Article", and "Thesis" as "Book" -- or something like that) > > 2. author > > These are the primary authors, listed like so: "Aanen, D. K.; Kuyper, T. W.; > Boekhout, T.; Hoekstra, R. F." > > 3. year > > All are given a year, even those that are "in press" > > 4. title > > Primary title. This field comes with its own punctuation at the end. > > 5. s_author > > Secondary authors (e.g. book editors). Same format as authors above. > > 6. s_title > > Secondary title. For Journal Articles, this column holds the journal name > (punctuation not included). For Book Section, this holds the title of the > book (punctuation included) > > 7. place_pub > > Only for books and book sections > > 8. publisher > > Only for books and book sections > > 9. volume > Update the citation table with this. > > 10. num_of_vols > > This is an Endnote field that we don't use (no data in this column) > > 11. number > > Update the citation table with this. (Same as the "issue" number) > > 12. pages > Update the citation table with this. > > 13. section > > This is an Endnote field that we don't use (no data in this column) > > 14. edition > > Likewise, no data in this column > > 15. isbn > > Likewise, no data in this column > > 16. label > > This contains either nothing or "in press". The "in press" label means that > the volume, number, and pages data are missing -- unfortunately over 1,000 > records have this problem. Let's preserve this so that it can be searched > on. Later, when we have some work-study students, we can have them search > for "in press", look up the full citation, and update the records > accordingly. > > 17. keywords > > Needs updating in the citation table. > 18. abstract > Probably not needed if the abstract field in the citation table already > contains text. But if it doesn't, best to update with this version. > > 19. study_id > > This is the legacy ID needed to match these rows with the correct study > record > > 20. url > > If not empty, it contains the correct prefix (e.g. "http://") as needed. > > 21. doi > > Does not contain a "http://" prefix -- i.e. it starts with 10. (etc). > > > > > > > ------------------------------------------------------------------------------ > The Planet: dedicated and managed hosting, cloud storage, colocation > Stay online with enterprise data centers and the best network in the > business > Choose flexible plans and management services without long-term contracts > Personal 24x7 support from experience hosting pros just a phone call away. > http://p.sf.net/sfu/theplanet-com > _______________________________________________ > Treebase-devel mailing list > Tre...@li... > https://lists.sourceforge.net/lists/listinfo/treebase-devel > > -- Dr. Rutger A. Vos School of Biological Sciences Philip Lyle Building, Level 4 University of Reading Reading RG6 6BX United Kingdom Tel: +44 (0) 118 378 7535 http://www.nexml.org http://rutgervos.blogspot.com |