From: Kym F. \(Chess\) <che...@in...> - 2003-01-25 22:37:59
|
They key issues is the ability to spellcheck and normalaize the player names. I have 3,091,076 games in a single DB. I have taken a very agressive strategy to removing duplicates (was 3.8 million game :-). I'm still getting some duplicates. Shane helped with a script to remove (XXX) where XXX is country from the names. (Previously posted - see archive) The trouble is that if you tweak the code to match zero characters on the player name, you end up with an full cartesian join to check for dupes. 3,091,076 x 3,091,076 = don't bother. (well not quite that bad, 3,091,076 + 3,091,075 + ... + 2) but you get the picture. Maybe there needs to be a central repository of spell check results that can have fixes added over time. ?? Grant Wrote: | Subject: [Scid-users] Database clean up | I, like may others, am trying to sort through | and clean up my database of 1 million games. | To standardize the | names of a database this size is a daunting task | which I accept, but in order for Scid to recognize a | duplicate game the players names have to be the same | (or at least the first 4 letters). Therefore if I have | two identical games (J Smith vs J Jones and Smith, J vs | Jones, J) I must first correct the format of their | names before Scid can detect a twin. A duplicated | game serves no purpose in my database but to find and | delete them all is going to take a long time. If there | | are others that would agree to these sentiments then | the facility to delete duplicated games regardless | of player names would be a welcome addition to what is | already an excellent piece of software. --- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.445 / Virus Database: 250 - Release Date: 21/01/2003 |