[Linkbat-devel] Data format, XML to DB conversion

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi All!

I think we have reached the point where we need to be talking about the actual 
data/table structure in CSV text or a database. For now let's just work on 
the database model, to make things simpler. However, we all need to keep in 
mind that I eventually want to have the system work with CSV text files.

Rama has volunteered to work on the code to convert the XML files into 
something the database can handle. Shanta is working on the "presentation 
layer", so obviously you two will be working closely together. When the code 
is done for the CSV->XML conversion, I would like Luu to start working on the 
code to validate the XML files. That is, making sure the right tags are 
there, tags match, etc. 

Following are some thoughts and ideas about how the tables can be created. I 
have no particular affection for any of these ideas, so feel free to tell me 
they stink.

The first thing we need to decided for the data model is how to store the KUs 
themselves. Since different KU types have different attributes, I do not see 
a way of storing them all in a single table. Instead, I would see one table 
for each KU type and then a central table that contains the unique KU ID and 
then the KU type. When making a cross reference from one KU to another, we 
would then go through this central table. 

Next is the inter-relationship between the KUs, which is a core aspect of 
linkbat. There is a potential that each KU (regardless of type) can reference 
every other type of KU. However, I cannot really imagine an index for each 
relationship. For example, the Concept KU has an index listing all of the 
Concept KUs and with pointers to all of the referenced Glossary KUs.  Then 
another index with the MoreInfo KUs. In my mind that would mean too many 
tables. If there are five KU types and each has a index for the relationship 
to the others, then we would have 20 tables (5x4). With 6 KU tables, 30 
tables (6x5). Is my math right?

However, the question is whether 20 or 30 indexes is "too many".  Does having 
seperate indexes for each relationship provide any advantages? Quicker 
access? If so, does it compensate for the extra work to manage 20-30 tables??

Alternatively, I could imagine one index that keeps tracks of the type of KUs, 
as well:

UNIQUE_KU_ID:REFERENCED_TYPE:REFERENCED_KU_ID

UNIQUE_KU_ID - The ID of the current KU
REFERENCED_TYPE - The type of KU that this KU is referencing
REFERENCED_KU_ID - The ID of that KU that is being refereced. 

A query could be done sorted by UNIQUE_KU_ID  and then REFERENCED_TYPE. Once 
we have that, the code to present a list of referenced KUs grouped by KU type 
would be easy.  Currently we are only dealing with a few thousands KUs. 
However, what about when we get to tens of thousands?? Hopefully MySQL should 
be in a position to deal with that **few** records. So, I would not see it as 
a problem to have all of the KUs referenced in a single table like that. 

Is the REFERENCED_TYPE even necessary here? We obviously need a table that 
contains the attributes of each KU and a unique ID, so the REFERENCED_KU_ID 
can be used to get the type through the central table that I mentioned above 
and we have the KU type. On the other hand, to get a list of all referenced 
KUs in the correct order we could have a query like this:

select REFERENCED_KU_ID from WHATEVER_TABLE where UNIQUE_KU_ID = 
'CURRENT_KU_ID" sort by REFERENCED_TYPE

This is a single table access and is obviously faster than mutliple access. 
Therefore, I think this would more than make up for having an extra field in 
the index. 

Since we can expect that new data files will have mixed KU types, I am 
thinking that a single index which also contains the KU type would be more 
efficient. Each time we read a new KU, it is inserted into that single index, 
so we don't have to deal with mulitple indexes.  Comments?

Although not KUs in the true sense, I think we need an index by topic and 
skill level (if we decide to implement that). This would be used in 
generateing the technical FAQs or lists of all info on a particular topic. 

What about the reverse references. For example a list of all Concept KUs that 
reference a particular Glossary KU. This is not the same as the list of 
Glossary KUs that this Concept KU references. We click on a link that 
displays a Glossary KU, with a link to all the KUs that it references. I 
would also like links to all of the KUs that reference this glossary term. 
Although it seems logical that each Concept KU that references a particular 
Glossary KU should also be referenced by that Glossary KU, it is likely that 
we will forget the reverse references when create the XML files. Think about 
the glossary terms on the content pages. Each time we add a new page we have 
to update the referenced glossary KUs to point back to this page.

I'm sure there is a lot more, but I should get this off to start the 
discussion going.

Regards,

jimmo
-- 
---------------------------------------
"Be more concerned with your character than with your reputation. Your
character is what you really are while your reputation is merely what others
think you are." -- John Wooden
---------------------------------------
Be sure to visit the Linux Tutorial:  http://www.linux-tutorial.info