biomed_XML_parsers - Browse Files at SourceForge.net

The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size
xmlToDatabase_source.zip	2015-12-09	49.3 kB
ctti_mysqlscript.txt	2015-12-07	13.2 kB
cttiparser.zip	2015-12-07	10.3 MB
README.txt	2015-11-13	2.7 kB
ClinicalXmlToDatabase-0.zip	2015-11-13	10.5 MB
clinicalparser.jar	2015-11-13	54.8 kB
Totals: 6 Items		21.0 MB

Data Source:

This parser has been designed to extract data and metadata derived from XML files downloaded from the following source:

https://www.clinicaltrials.gov/ct2/resources/download



Example Data:

In current form, the parser is capable of properly handling most NCT XML records, with a handful of such examples being provided in the subfolder xml within clinical.zip.

An ongoing investigation is underway to identify XML records that produce SQL exceptions arising from the parsing process.  These records will be stored within the subfolder xml_problems.



Configuration:

The file ClinTrialXmlToDatabase.ini must be modified to reflect the name and configuration of the database, as well as the location of 
 files to be processed.



Source code:

All source files for non-generic code are provided in ClinicalXmlToDatabase-0.zip

Compiled generic libraries are available as subfolders within the above zip file



Jar file:

The all-inclusive executable is ./executations/clinicalparser.jar



Issues:

The key refinement that is required at this point is to identify strategies by which the parser can reliably treat all XML files for which current processing produces SQL exceptions.  This can be done, but typically requires making key decision regarding the SQL schema (in particular table and column naming conventions).

Key decisions must be made for the following issues:

  - the default table and column naming convention produces unique names for tables and columns by ensuring that each such entity reflects the full XML hierarchical position of the entity.  While this produces a fairly unambiguous naming convention, it frequently yields entity names that exceed the 64 character limit imposed by MySQL.  In the current implementation, object names are truncated to only reflect character fields encompassing the three right-most underscore ("_") characters.  For example, the table name:
   clinical_study_milestone_participants_list_participants
becomes:
   milestone_participants_list_participants
While this generally does reduce objecct names to less than 64 characters, it is unknown whether this convention is compatible with existing schemas for clinical trials data storage.

  - Another issue is that some XML records have multiple tags whose labels are identical (i.e., multiple <group> tags) but whose content differs.  The parser will need to synchronize with existing schema specifications in order to find a consistent protocol form column specification that does not violate the SQL prohibition on having multiple columns with the same name in the same table.

Source: README.txt, updated 2015-11-13

biomed_XML_parsers Files

Parsers to extract biomedical XML and inject into SQL

biomed_XML_parsers Files

Parsers to extract biomedical XML and inject into SQL

Get an email when there's a new version of biomed_XML_parsers