[Linkbat-devel] Re: questions (CSV to XML conversion)
Brought to you by:
jimmo
From: James M. <lin...@ji...> - 2002-11-18 21:00:43
|
Hi Luan! (for the rest of you, please take a look at this as well, I would like some feedback). Wow! That was quick. First, note that I replied to the linkbat-devel mailing list. I think all of these discussions should be on the list. The rest of the comments are below. On Monday 18 November 2002 21:20, Luan Luu wrote: > 1. Do you want the code automatic create the output file, or we could > manually in the command line to concatenate to new xml file? I think you should just write to standard output. This gives us greater flexability to specify any file name we choose. > 2. Only one top level is allowed, so, i put another <KUs> tag wrap around > the <knowledgeUnit> tags. is that ok? In the specification, I had defined the top-level as being <KnowledgeUnits> (with the 's' at the end). In general, I had it so that the container were all plurals like <KnowledgeUnits> and <MoreInfos>. > check the first one out, see if it is correct. let me know. thanks. So far that looks great!!! However, a couple of things. The <PageRef> should contain the file name out of the page.data file. These numbers will more or less disappear once the conversion is made. Also, please put in an line between the KnowledgeUnits. What I was thinking about was going through the original data files and add topics. It would be fairly straight forward to define a handful of topics (administration, network, security, users, hardware, etc.) and add them to the CSV files. That would mean an extra field to parse, but the code is pretty much written. In fact, I could add multiple topics and if the field is empty, just don't print a topic tag. How does that sound? I am thinking that we could save a fair bit of time that way. If at least one (or maybe even 2 or 3) topics are already present and can be added automatically, then we don't need to do it by hand. Granted, sub-topics will probably need to be added, later. But we have saved some work. In the CVS state right now, it is alot easier to add information in bulk. Comments, anyone? Are there other references that we could add in bulk now? Regards, Jim ================= NOTE: I snipped most of the stuff. This is just an example to see the results and the original file. <KnowledgeUnit> <Attributes> <Type>Concept</Type> <Text>Linux can be started from any partition.</Text> </Attributes> <Pages> <PageRef primary="true">106</PageRef> </Pages> <Questions> </Questions> </KnowledgeUnit> <KnowledgeUnit> <Attributes> <Type>Concept</Type> <Text>Linux can combine multiple drives into a single RAID system, even if the drives are of different types.</Text> </Attributes> <Pages> <PageRef primary="true">127</PageRef> </Pages> <Questions> </Questions> </KnowledgeUnit> <KnowledgeUnit> <Attributes> <Type>Concept</Type> <Text>Unwanted cron output can be redirected just like any other command.</Text> </Attributes> <Pages> <PageRef primary="true">21</PageRef> </Pages> <Questions> </Questions> </KnowledgeUnit> <KnowledgeUnit> <Attributes> <Type>Concept</Type> <Text>Cron is started through an rc-script like most system daemons. </Text> </Attributes> <Pages> <PageRef primary="true">67</PageRef> </Pages> <Questions> </Questions> </KnowledgeUnit> <KnowledgeUnit> <Attributes> <Type>Concept</Type> <Text>Cron can be disabled through the /etc/rc.config file.</Text> </Attributes> <Pages> <PageRef primary="true">67</PageRef> </Pages> <Questions> 1:106:Linux can be started from any partition. 2:127:Linux can combine multiple drives into a single RAID system, even if the drives are of different types. 3:21:Unwanted cron output can be redirected just like any other command. 4:67:Cron is started through an rc-script like most system daemons. 5:67:Cron can be disabled through the /etc/rc.config file. #!perl #!/usr/bin/perl #convert the dyk.data into xml data. open(DATA,"<dyk.data"); $TYPE = "Concept"; #hardcoded. $TEXT = "put your text here"; $PAGE_ID = "your Page id from the file."; #required, only one top level is allowed. print "<KUs>\n"; while (<DATA>){ chomp; $READLINE=$_; ($ID,$PAGE_ID,$TEXT) = split(/:/,$READLINE); #print the output... print " <KnowledgeUnit>\n"; print " <Attributes>\n"; print " <Type>$TYPE</Type>\n"; print " <Text>$TEXT</Text>\n"; print " </Attributes>\n"; print " <Pages>\n"; print " <PageRef primary=\"true\">$PAGE_ID</PageRef>\n"; print " </Pages>\n"; print " <Questions>\n"; print " \n"; #blank for now, insert the questions later. print " </Questions>\n"; print " </KnowledgeUnit>\n"; } print "</KUs>\n"; close DATA; |