Re: [Linkbat-devel] Re: question on moreinfo.data (Everyone please read)
Brought to you by:
jimmo
From: Luan L. <lu...@ho...> - 2002-11-20 20:05:47
|
Hi All, The seperator ":" is good for now in my opinion, since the other characters, we will also have in some of the other data files. The reason tag should be there and stay empty, not removed, because when you extract XML to DB, i guess we expect a REASON tag to be there. Is it possible the tyk.data file be modified a little bit? Like you mentioned, there are a couple of character in there which can not be convert into valid XML format, such as the alone '&' need to change to '&', in the line number 45. The othere '&' in the data file was already in converted format, so it is alright. only the previous one mentioned causes problem. Also, I tried to convert all the < and > into the '<' and '>' respectively. Is that ok? You mentioned the Topics and TopicRef tag should be insert into all the data file? Where do you think it should be? For the new dyk.data.NEW you sent me, it would produce this. <KnowledgeUnit> <Attributes> <Type>Concept</Type> <Text>Linux can be started from any partition.</Text> </Attributes> <Pages> <PageRef primary="true">106</PageRef> </Pages> <Questions> </Questions> </KnowledgeUnit> Where should the topics and topicref goes? thanks. Best Regards -Luu >From: James Mohr <lin...@ji...> >Reply-To: lin...@ji... >To: lin...@li... >Subject: [Linkbat-devel] Re: question on moreinfo.data (Everyone please >read) >Date: Tue, 19 Nov 2002 11:03:51 +0100 > >(Note this was sent to the list.) > >Hey everyone, the conversion is almost done (well, at least the code for >it). >Thanks Luu! However, there are some important questions to answer NOW, >before >we continue. PLEASE, please, please read this and give me your input. > >On Tuesday 19 November 2002 00:31, Luan Luu wrote: > > according the the moreinfo.data, the format: > > ID#:TYPE:DESCRIPTION:LOCATION > > > > the XML are > > > > <KnowledgeUnit> > > <Atrributes> > > <Type sub-type="[TYPE]" location="[LOCATION]">MoreInfo</Type> > > <Text>[DESCRIPTION]</Text> > > </Atrributes> > > </KnowledgeUnit> > > > > In the reference to the brackets, is the pointer the the type, location, > > and description like that? > >Perfect. The only question is whether we should actually do it that way or >not. That is, should the sub-type and location be attributes within the ><Type> tag or should they be seperate tags, i.e. <SubType> <Location>? > >By gut feeling is that they should be attributes within the <Type> tag. >They >not necessarily attributes of KU, but rather provide additional info for >the >type. > >Comments anyone? This needs to be answered before we continue. > > > is the url be absolute path with the http infront right? > >Yes. You will note that in the data file, they just begin with // and not >http://. This was because I made an unwise decision to use the colon (:) as >the field seperator. However, the colon appears frequently in Linux >(especially with URLs) so it became a problem. > >I was going to change to something like a pipe (|) which comes less >frequently. Regardless, we will have a problem since the odds are that >whatever character we use, it will appear in text somewhere in the data. > >Obviously this is not a problem when we import directly into a database. >However, as I said to Shanta, I don't have a problem with importing the >files >directly into a database for the first release. However, eventually I want >the system to be independant of the data source (CSV, database) and >independant of the presentation (eXtropia, other portal). Therefore, we >need >to consider a new seperator. > >Suggestions? > > > Inside the tyk xml question tags, there is a topicRef tag, which is the > > reference of the PAGE_ID. So, do you want to put the Page_id in the > > topicRef tags or the actual topic name in the page.data ? > >The TopicRef is a references to a topic, such as Administration, >Networking, >Security, etc. This is just text. There are PageRef tags and these contains >the page *name* from page.data. However, I am nop longer sure we should do >it >that way (see below). This needs to be changed in tykToXML.pl. > >Keep in mind that the questions will exist only within a KU. This KU will >have >a primary page so we automatically have the primary page for the question. >However, the question could reference multiple topics. > >We have a problem with some of the questions where an angle bracket is one >of >the answer ("What symbol is used to "pipe" two commands together?") This >means we have two angle brackets together (<< or >>) which could confuse >the >XML parser. There are only a few and we can change them by hand. We just >need >to be aware of them. > >Also watch the format of the answers, even for the T/F questions: > > <Correct> > <Text>T</Text> > <Reason>why this answer is correct</Reason> > </Correct> > >not just > ><Correct>T</Correct> > >I think that as much as possible it is better to have the same format for >all >types of questions. More than likely, "fill in the blank" type questions >won't have an <Incorrect> answer, but I still want to have a <Reason> tag >to >provide an explanation why the answer is correct. > >However, since we do not yet have the reasons, I think you should simply >leave >as <Reason></Reason>. I think if we leave the text as "put your reason why >is >correct/incorrect. " we might forget to change it and then displaying that >text would look silly. If the <Reason> tag is empty, we can just ignore it. >OR we could simply not include the <Reason> tag at this point. What do you >think? > >With the Glossary KUs please create a <GlossaryTerms> container with the >GlossaryRefs to the other terms. These are the numbers at the end of each >line in glossary.data. They are the ID numbers of the other glossary terms. >Therefore, instead of reading in each line from glossary.data and >processing >it, you will need to read it all at once and put it into an array, then >parse >that array. > >EVERYONE PLEASE READ AND COMMENT: >Currently the glossary.idx file contains a list of pages that contain each >glossary item. This is created by an external script and is **not** done >when >the glossary item is loaded. That would take way too much time. The >question >is whether we should have PageRefs within the Glossary KU. > >Personally, I do not think so. We can create the index of glossary-page_id >along with everything else. If we include the page ID/page name within the >Glossary KU and add a new glossary item, then we would need to go looking >for >all of the pages that have that glossary term. Obviously we need to search >for the pages to add the <Glossary> tag within the page. However, I just >see >it as unnecessary work to add PageRefs withing the Glossary KU since we can >create the index by other, more efficient means. > > >EVERYONE PLEASE READ AND COMMENT: >It just hit me that we might be building a trap for ourselves. If we use >the >full path instead of the ID number, we will have problems if we ever rename >the file, move it to a different directory, etc. I **expect** to be moving >files to different directories real soon! I want to change the order of the >files and their locations. As we get more content, I can imagine that we >change locations again. > >I see three options: > >Use the full-path as the PageRef: >- Easy to find/insert the reference we want >- Tracking down the actual page from a KU is easy >- In the display code we don't need to do a look up to display the page. >- PROBLEM: moving/renaming the file. Since the XML files are text, we can >use >sed/perl to make a global change. > >Use just the page name without the path: >- Once the file name is defined, it is less likely that the page name will >be >changed. >PROBLEM: We must have *completely unique* page names. We cannot have a >"Known >Problems" in the Network section AND in the Printing section. They must be >named "Known Problems-Network" and "Known Problems-Printing" (or something >like that). > >Use the ID as the PageRef: >- Remains constant, independant of the actual name of the file. >- PROBLEM: Need to do a look-up to find the correct file. However, since >page.data is current sorted by chapter/section, I have found that it is >not >al that hard. For the existing moreinfo and DYK entries one PageRef can be >inserted automatically. Still, if we want to include more PageRefs, we will >hve to do it by hand and look up the ID, but we will have to look it up any >way to get the full path. So whether we lookup and insert the page name or >the page ID it's the same amount of work. > >I still like the idea of using the full-path and NOT and ID number. You >need >to do a look-up anyway to find the ID or the correct text for the full >path. >Tracking down the original page from the XML file is straight forward. >Making >a change would be a simple matter of running a sed/perl script. We could >even >write it in advance and it becomes a part of our "utility" package: > >rename_page.pl [-f filename] original_name new_name > >It then scans all PageRefs in the named file and changes them accordingly. > >Using an ID number bothers me because makes the construct dependant on an >external file or we are imposing a structure on it unnecessarily. >Therefore, >the knowledge base is not self-contained. > >EVERYONE PLEASE READ AND COMMENT: >We have a similar problem with the MoreInfoRefs for the Page KUs. Currently >they are referenced by their ID number and Luu did the same thing in her >code. However, once again, I am not happy with idea of using ID numbers >instead of text. So, do we reference the text of the MoreInfo KUs?? > >I have pretty much decided to go through the existing data files and add up >to >three topics. I will add these to the **end** of each line for all of the >data files. So, Luu, could you change the code to create a <Topics> >container >and <TopicRefs> for all of the data files? Note that I will probably not >list >three topics for everything. Therefore, the code will need to be smart >enough >to recognized this. Since you are probably asleep already I can work on it >today and send you at least one file with the topics, so you will see the >format. > >Regards, > >jimmo >-- >--------------------------------------- >"Be more concerned with your character than with your reputation. Your >character is what you really are while your reputation is merely what >others >think you are." -- John Wooden >--------------------------------------- >Be sure to visit the Linux Tutorial: http://www.linux-tutorial.info > > >------------------------------------------------------- >This sf.net email is sponsored by: To learn the basics of securing >your web site with SSL, click here to get a FREE TRIAL of a Thawte >Server Certificate: http://www.gothawte.com/rd524.html >_______________________________________________ >Linkbat-devel mailing list >Lin...@li... >https://lists.sourceforge.net/lists/listinfo/linkbat-devel _________________________________________________________________ Add photos to your e-mail with MSN 8. Get 2 months FREE*. http://join.msn.com/?page=features/featuredemail |