[Linkbat-devel] Re: question on moreinfo.data (Everyone please read)
Brought to you by:
jimmo
From: James M. <lin...@ji...> - 2002-11-19 08:34:44
|
(Note this was sent to the list.) Hey everyone, the conversion is almost done (well, at least the code for it). Thanks Luu! However, there are some important questions to answer NOW, before we continue. PLEASE, please, please read this and give me your input. On Tuesday 19 November 2002 00:31, Luan Luu wrote: > according the the moreinfo.data, the format: > ID#:TYPE:DESCRIPTION:LOCATION > > the XML are > > <KnowledgeUnit> > <Atrributes> > <Type sub-type="[TYPE]" location="[LOCATION]">MoreInfo</Type> > <Text>[DESCRIPTION]</Text> > </Atrributes> > </KnowledgeUnit> > > In the reference to the brackets, is the pointer the the type, location, > and description like that? Perfect. The only question is whether we should actually do it that way or not. That is, should the sub-type and location be attributes within the <Type> tag or should they be seperate tags, i.e. <SubType> <Location>? By gut feeling is that they should be attributes within the <Type> tag. They not necessarily attributes of KU, but rather provide additional info for the type. Comments anyone? This needs to be answered before we continue. > is the url be absolute path with the http infront right? Yes. You will note that in the data file, they just begin with // and not http://. This was because I made an unwise decision to use the colon (:) as the field seperator. However, the colon appears frequently in Linux (especially with URLs) so it became a problem. I was going to change to something like a pipe (|) which comes less frequently. Regardless, we will have a problem since the odds are that whatever character we use, it will appear in text somewhere in the data. Obviously this is not a problem when we import directly into a database. However, as I said to Shanta, I don't have a problem with importing the files directly into a database for the first release. However, eventually I want the system to be independant of the data source (CSV, database) and independant of the presentation (eXtropia, other portal). Therefore, we need to consider a new seperator. Suggestions? > Inside the tyk xml question tags, there is a topicRef tag, which is the > reference of the PAGE_ID. So, do you want to put the Page_id in the > topicRef tags or the actual topic name in the page.data ? The TopicRef is a references to a topic, such as Administration, Networking, Security, etc. This is just text. There are PageRef tags and these contains the page *name* from page.data. However, I am nop longer sure we should do it that way (see below). This needs to be changed in tykToXML.pl. Keep in mind that the questions will exist only within a KU. This KU will have a primary page so we automatically have the primary page for the question. However, the question could reference multiple topics. We have a problem with some of the questions where an angle bracket is one of the answer ("What symbol is used to "pipe" two commands together?") This means we have two angle brackets together (<< or >>) which could confuse the XML parser. There are only a few and we can change them by hand. We just need to be aware of them. Also watch the format of the answers, even for the T/F questions: <Correct> <Text>T</Text> <Reason>why this answer is correct</Reason> </Correct> not just <Correct>T</Correct> I think that as much as possible it is better to have the same format for all types of questions. More than likely, "fill in the blank" type questions won't have an <Incorrect> answer, but I still want to have a <Reason> tag to provide an explanation why the answer is correct. However, since we do not yet have the reasons, I think you should simply leave as <Reason></Reason>. I think if we leave the text as "put your reason why is correct/incorrect. " we might forget to change it and then displaying that text would look silly. If the <Reason> tag is empty, we can just ignore it. OR we could simply not include the <Reason> tag at this point. What do you think? With the Glossary KUs please create a <GlossaryTerms> container with the GlossaryRefs to the other terms. These are the numbers at the end of each line in glossary.data. They are the ID numbers of the other glossary terms. Therefore, instead of reading in each line from glossary.data and processing it, you will need to read it all at once and put it into an array, then parse that array. EVERYONE PLEASE READ AND COMMENT: Currently the glossary.idx file contains a list of pages that contain each glossary item. This is created by an external script and is **not** done when the glossary item is loaded. That would take way too much time. The question is whether we should have PageRefs within the Glossary KU. Personally, I do not think so. We can create the index of glossary-page_id along with everything else. If we include the page ID/page name within the Glossary KU and add a new glossary item, then we would need to go looking for all of the pages that have that glossary term. Obviously we need to search for the pages to add the <Glossary> tag within the page. However, I just see it as unnecessary work to add PageRefs withing the Glossary KU since we can create the index by other, more efficient means. EVERYONE PLEASE READ AND COMMENT: It just hit me that we might be building a trap for ourselves. If we use the full path instead of the ID number, we will have problems if we ever rename the file, move it to a different directory, etc. I **expect** to be moving files to different directories real soon! I want to change the order of the files and their locations. As we get more content, I can imagine that we change locations again. I see three options: Use the full-path as the PageRef: - Easy to find/insert the reference we want - Tracking down the actual page from a KU is easy - In the display code we don't need to do a look up to display the page. - PROBLEM: moving/renaming the file. Since the XML files are text, we can use sed/perl to make a global change. Use just the page name without the path: - Once the file name is defined, it is less likely that the page name will be changed. PROBLEM: We must have *completely unique* page names. We cannot have a "Known Problems" in the Network section AND in the Printing section. They must be named "Known Problems-Network" and "Known Problems-Printing" (or something like that). Use the ID as the PageRef: - Remains constant, independant of the actual name of the file. - PROBLEM: Need to do a look-up to find the correct file. However, since page.data is current sorted by chapter/section, I have found that it is not al that hard. For the existing moreinfo and DYK entries one PageRef can be inserted automatically. Still, if we want to include more PageRefs, we will hve to do it by hand and look up the ID, but we will have to look it up any way to get the full path. So whether we lookup and insert the page name or the page ID it's the same amount of work. I still like the idea of using the full-path and NOT and ID number. You need to do a look-up anyway to find the ID or the correct text for the full path. Tracking down the original page from the XML file is straight forward. Making a change would be a simple matter of running a sed/perl script. We could even write it in advance and it becomes a part of our "utility" package: rename_page.pl [-f filename] original_name new_name It then scans all PageRefs in the named file and changes them accordingly. Using an ID number bothers me because makes the construct dependant on an external file or we are imposing a structure on it unnecessarily. Therefore, the knowledge base is not self-contained. EVERYONE PLEASE READ AND COMMENT: We have a similar problem with the MoreInfoRefs for the Page KUs. Currently they are referenced by their ID number and Luu did the same thing in her code. However, once again, I am not happy with idea of using ID numbers instead of text. So, do we reference the text of the MoreInfo KUs?? I have pretty much decided to go through the existing data files and add up to three topics. I will add these to the **end** of each line for all of the data files. So, Luu, could you change the code to create a <Topics> container and <TopicRefs> for all of the data files? Note that I will probably not list three topics for everything. Therefore, the code will need to be smart enough to recognized this. Since you are probably asleep already I can work on it today and send you at least one file with the topics, so you will see the format. Regards, jimmo -- --------------------------------------- "Be more concerned with your character than with your reputation. Your character is what you really are while your reputation is merely what others think you are." -- John Wooden --------------------------------------- Be sure to visit the Linux Tutorial: http://www.linux-tutorial.info |