From: Maarten t. H. <ma...@tr...> - 2007-07-14 22:48:54
|
Hi, This was already discussed a while ago, but we never reached a conclusion.= =20 It's about an addition to the software DB XML format used by blueMSX and=20 openMSX. I'd like to design something small and simple, which might not wor= k=20 in 100% of the cases, but will at least allow us to get some things working. A software ID is a string that uniquely identifies a piece of software. Usi= ng=20 the software ID, we can match other data with the software in the software= =20 DB. For example: =2D offer cheats for the game the user is currently playing =2D in a front-end, show screenshots when browsing the list of available ga= mes =2D offer the ability to look up more info about a software title on the web In the XML format, it would look like this: <software> <id type=3D"genmsx">1234</id> ... </software> =46rom an XML perspective, these are the rules: =2D a <software> entry can have any number of <id> tags =2D for each <software> entry, each ID type should occur at most once (othe= rwise=20 it wouldn't be a unique ID) =2D the ID value is a string (a specific ID scheme might interpret it as fo= r=20 example an integer, but at the XML level it is a string) The XML format supports any number of ID schemes. However, as a policy we'l= l=20 use the "genmsx" type for MSX software, which means the IDs of the Generati= on=20 MSX software database. This is similar to the <hash> tags, where the XML=20 supports any hashing algorithm, but we decided to use only SHA1. blueMSX al= so=20 supports non-MSX software; they are free to set policies for the ID schemes= =20 to be used for the other systems. The motivation for using only one type of ID is that matching two data sets= =20 becomes less efficient and less effective if multiple schemes are used: =2D if there is no overlap in stored ID types, no match will be made, even= =20 though both data sets contain information about the same piece of software =2D a lookup will have to be done for each ID type until a match is found, = this=20 will slow down the search for a match =2D storing multiple IDs will increase the data size =2D inconsistencies between ID schemes can occur, for example one ID contai= ns a=20 typo and another does not, so depending on the order in which ID schemes ar= e=20 tried you will get the right or wrong result The motivation for using Generation MSX IDs as the preferred ID scheme: =2D they have a large number of titles in their DB already =2D they have a lot of useful information about MSX software that we could = link=20 to at some point in the future =2D we have the ability to add missing entries (at least Manuel can do that= ; if=20 needed probably more people could get access) =2D Sandy (Generation MSX admin) is interested in cooperating Unresolved issues: What exactly is one piece of software? In Generation MSX some games have=20 multiple releases, for example the original Japanese version and a Korean=20 version. Should we consider this as one or two entries in the software DB? How are we going to get all the data in the software DB? About a year ago,= =20 Patrick made a program to do a fuzzy match between our software DB and the= =20 Generation MSX DB. I think it would be useful to repeat this process, since= a=20 lot of titles have been added to Generation MSX in the last year. The title= s=20 that can not be automatically matched should either be matched by hand (if= =20 the title is in GenMSX) or added to GenMSX. Any volunteers? Bye, Maarten |