[openMSX-devel] Software IDs

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi,

This was already discussed a while ago, but we never reached a conclusion.=
=20
It's about an addition to the software DB XML format used by blueMSX and=20
openMSX. I'd like to design something small and simple, which might not wor=
k=20
in 100% of the cases, but will at least allow us to get some things working.

A software ID is a string that uniquely identifies a piece of software. Usi=
ng=20
the software ID, we can match other data with the software in the software=
=20
DB. For example:
=2D offer cheats for the game the user is currently playing
=2D in a front-end, show screenshots when browsing the list of available ga=
mes
=2D offer the ability to look up more info about a software title on the web

In the XML format, it would look like this:
<software>
	<id type=3D"genmsx">1234</id>
	...
</software>

=46rom an XML perspective, these are the rules:
=2D a <software> entry can have any number of <id> tags
=2D for each <software> entry, each ID type should occur at most once (othe=
rwise=20
it wouldn't be a unique ID)
=2D the ID value is a string (a specific ID scheme might interpret it as fo=
r=20
example an integer, but at the XML level it is a string)

The XML format supports any number of ID schemes. However, as a policy we'l=
l=20
use the "genmsx" type for MSX software, which means the IDs of the Generati=
on=20
MSX software database. This is similar to the <hash> tags, where the XML=20
supports any hashing algorithm, but we decided to use only SHA1. blueMSX al=
so=20
supports non-MSX software; they are free to set policies for the ID schemes=
=20
to be used for the other systems.

The motivation for using only one type of ID is that matching two data sets=
=20
becomes less efficient and less effective if multiple schemes are used:
=2D if there is no overlap in stored ID types, no match will be made, even=
=20
though both data sets contain information about the same piece of software
=2D a lookup will have to be done for each ID type until a match is found, =
this=20
will slow down the search for a match
=2D storing multiple IDs will increase the data size
=2D inconsistencies between ID schemes can occur, for example one ID contai=
ns a=20
typo and another does not, so depending on the order in which ID schemes ar=
e=20
tried you will get the right or wrong result

The motivation for using Generation MSX IDs as the preferred ID scheme:
=2D they have a large number of titles in their DB already
=2D they have a lot of useful information about MSX software that we could =
link=20
to at some point in the future
=2D we have the ability to add missing entries (at least Manuel can do that=
; if=20
needed probably more people could get access)
=2D Sandy (Generation MSX admin) is interested in cooperating

Unresolved issues:

What exactly is one piece of software? In Generation MSX some games have=20
multiple releases, for example the original Japanese version and a Korean=20
version. Should we consider this as one or two entries in the software DB?

How are we going to get all the data in the software DB? About a year ago,=
=20
Patrick made a program to do a fuzzy match between our software DB and the=
=20
Generation MSX DB. I think it would be useful to repeat this process, since=
 a=20
lot of titles have been added to Generation MSX in the last year. The title=
s=20
that can not be automatically matched should either be matched by hand (if=
=20
the title is in GenMSX) or added to GenMSX. Any volunteers?

Bye,
		Maarten