[open-stats-talk] DB Files Uploaded
Status: Planning
Brought to you by:
siddfinch
From: Mat K. <me...@me...> - 2007-01-05 17:02:40
|
I have uploaded database dumps for MySQL (http://www.mysql.org) and Postgres (http://www.postgresql.org). They should import without problems into current (and previous) releases of both databases on Windows, Mac OS X, and Unix systems. I have also uploaded a binary Sqlite3 (http://www.sqlite.org) database. This database was created with Sqlite 3.3.8 but should work with all versions of Sqlite3. Files are available for download from Sourceforge.net <http://sourceforge.net/project/showfiles.php?group_id=184914&package_id=217079&release_id=476073> MyTurl: <http://myturl.com/08QMm> Files have been zipped with gzip. (Learn more about gzip from http://www.gzip.org/). If you are running Windows you'll need to have a unzipper that works with gzip. If you are a curious Windows user and want to play around with the data a bit, but don't feel comfortable running a database, I would recommend using the Sqlite3 database. Here is a step by step example. I store all the files in a folder 'c:\retrosheet'. 1. Download the Retrosheet Sqlite3 database: http://easynews.dl.sourceforge.net/sourceforge/open-stats/retrosheet-sqlite3-20070101.db.gz 2. Make you can unzip the .gz file: ``PowerArchiver 6.1, 7-zip and Winzip include the gzip compression code and can decompress .gz and tar.gz files. Win-GZ can compress and decompress files in gzip format. Please note that gzip, 7-zip, PowerArchiver 6.1 and Win-GZ are freeware but you must register Winzip and PowerArchiver > 6.1 if you use them regularly.'' Personally I recommend 7-zip <http://www.7-zip.org>, but if you have WinZip that will work fine. 3. Download Sqlite3, unzip it and stop the files in c:\retrosheet http://www.sqlite.org/sqlite-3_3_9.zip With the extracted retrosheet Sqlite database sqlite3.exe in the c:\retrosheet directory. From the start menu, goto Run... and type cmd.exe in the dialog box, you'll pull up a command line window. Type cd c:\retrosheet then type dir, and it should look like this: Microsoft Windows XP [Version 5.1.2600] (C) Copyright 1985-2001 Microsoft Corp. C:\Documents and Settings\Administrator>cd c:\retrosheet C:\retrosheet>dir Volume in drive C has no label. Volume Serial Number is 8723-79C9 Directory of C:\retrosheet 01/05/2007 09:57 AM <DIR> . 01/05/2007 09:57 AM <DIR> .. 01/04/2007 10:55 AM 4,461,889,536 retrosheet-sqlite3-20070101.db 01/04/2007 12:01 PM 400,589 sqlite3.exe 2 File(s) 4,462,290,125 bytes 2 Dir(s) 9,422,229,504 bytes free Now lets open the database. As an example, let's pull out the six grand slams that Tribe DH Travis Hafner hit in 2006 and list the pitchers he hit them off and display the pitch sequence, if available. To do this we'll use a the programming language, SQL, available to us in Sqlite. First we'll find out what Retrosheet's id for Mr. Hafner. We'll need to know a bit about the table rosters first. Here schema gives us the "recipe" for the table. The first column is the name and the second column is the type of data we can expect to find for that name. What we'll be concerned about it yearid, retroid, firstname and lastname. The schema tells us that yearid will be an integer (a number without a decimal point) while retroid and lastname will be characters (letters, number, symbols, etc.) Varying just means the length will be no large than the number in between (..). You don't need to worry about that part. We want to select the retroid from the rosters table where the last name is Hafner and the year is 2006. We'll include the first name in case there are two Hafners and we want to make sure we get the right one. To make things a bit easier, we'll tell Sqlite to print headers on the date and then do some searching in the rosters table: **Note: When using the where command, Character items need to have a single quote around them, numbers to not. **Note: You need to end the command with a semicolon, that will tell Sqlite to run the command. :\retrosheet>sqlite3 retrosheet-sqlite3-20070101.db SQLite version 3.3.9 Enter ".help" for instructions sqlite> .tables boxscores events eventtypes games rosters teams sqlite> .schema rosters CREATE TABLE rosters ( yearid INTEGER, retroid CHARACTER VARYING(8), lastname CHARACTER VARYING(25), firstname CHARACTER VARYING(25), bats CHARACTER VARYING(1), throws CHARACTER VARYING(1), team CHARACTER VARYING(3), pos CHARACTER VARYING(5) ); CREATE INDEX rosters_retroid_idx ON rosters (retroid); sqlite> .header on sqlite> select retroid,yearid,firstname,lastname from rosters where lastname = 'Hafner' and yearid = 2006; retroid|yearid|firstname|lastname hafnt001|2006|Travis|Hafner Travis Hafner's retroid is hafnt001. To show you why we include the first name, let's look for all the Smith's in the roster table for 2006. sqlite> select retroid,yearid,firstname,lastname from rosters where lastname = 'Smith' and yearid = 2006; retroid|yearid|firstname|lastname smitj001|2006|Jason|Smith smitm006|2006|Matt|Smith smitt001|2006|Travis|Smith smitm006|2006|Matt|Smith smitm005|2006|Mike|Smith As you can see, including the firstname item can be helpful. To look at Hafner's grand slams, we need to look at the events table. This is a really big table. It holds all the event information for each game from 1957 too 2006, excluding 1999. This over 7 million records. sqlite> .schema events CREATE TABLE events ( gameid CHARACTER VARYING(12), visitingteam CHARACTER VARYING(3), inning INTEGER, battingteam INTEGER, outs INTEGER, balls INTEGER, strikes INTEGER, pitchsequence CHARACTER VARYING(40), visscore INTEGER, homescore INTEGER, batter CHARACTER VARYING(8), batterhand CHARACTER VARYING(1), resbatter CHARACTER VARYING(8), resbatterhand CHARACTER VARYING(1), pitcher CHARACTER VARYING(8), pitcherhand CHARACTER VARYING(1), respitcher CHARACTER VARYING(8), respitcherhand CHARACTER VARYING(1), catcher CHARACTER VARYING(8), firstbase CHARACTER VARYING(8), secondbase CHARACTER VARYING(8), thirdbase CHARACTER VARYING(8), shortstop CHARACTER VARYING(8), leftfield CHARACTER VARYING(8), centerfield CHARACTER VARYING(8), rightfield CHARACTER VARYING(8), firstrunner CHARACTER VARYING(8), secondrunner CHARACTER VARYING(8), thirdrunner CHARACTER VARYING(8), eventtext CHARACTER VARYING(100), leadoffflag CHARACTER VARYING(1), pinchhitflag CHARACTER VARYING(1), defensiveposition CHARACTER VARYING(2), lineupposition INTEGER, eventtype INTEGER, battereventflag CHARACTER VARYING(1), abflag CHARACTER VARYING(1), hitvalue INTEGER, shflag character varying(1), sfflag character varying(1), outsonplay INTEGER, doubleplayflag CHARACTER VARYING(1), tripleplayflag CHARACTER VARYING(1), rbionplay INTEGER, wildpitchflag CHARACTER VARYING(1), passedballflag CHARACTER VARYING(1), fieldedby INTEGER, battedballtype CHARACTER VARYING(1), buntflag CHARACTER VARYING(1), foulflag CHARACTER VARYING(1), hitlocation CHARACTER VARYING(5), numerrors INTEGER, firsterrorplayer CHARACTER VARYING(8), firsterrortype CHARACTER VARYING(1), seconderrorplayer CHARACTER VARYING(8), seconderrortype CHARACTER VARYING(1), thirderrorplayer CHARACTER VARYING(8), thirderrortype CHARACTER VARYING(1), batterdest INTEGER, runneronfirstdest INTEGER, runneronseconddest INTEGER, runneronthirddest INTEGER, playonbatter CHARACTER VARYING(8), playonrunneronfirst CHARACTER VARYING(15), playonrunneronsecond CHARACTER VARYING(15), playonrunneronthird CHARACTER VARYING(15), sbforrunneronfirstflag CHARACTER VARYING(1), sbforrunneronsecondflag CHARACTER VARYING(1), sbforrunneronthirdflag CHARACTER VARYING(1), csforrunneronfirstflag CHARACTER VARYING(1), csforrunneronsecondflag CHARACTER VARYING(1), csforrunneronthirdflag CHARACTER VARYING(1), poforrunneronfirstflag CHARACTER VARYING(1), poforrunneronsecondflag CHARACTER VARYING(1), poforrunneronthirdflag CHARACTER VARYING(1), responsiblepitcherforrunneratfirst CHARACTER VARYING(8), responsiblepitcherforrunneratsecond CHARACTER VARYING(8), responsiblepitcherforrunneratthird CHARACTER VARYING(8), newgameflag CHARACTER VARYING(1), endgameflag CHARACTER VARYING(1), pinchrunneronfirst CHARACTER VARYING(1), pinchrunneronsecond CHARACTER VARYING(1), pinchrunneronthird CHARACTER VARYING(1), runnerremovedforpinchrunnerfirst CHARACTER VARYING(8), runnerremovedforpinchrunneronsecond CHARACTER VARYING(8), runnerremovedforpinchrunneronthird CHARACTER VARYING(8), batterremovedforpinchhitter CHARACTER VARYING(8), positionofbatterremovedforpinchhitter INTEGER, fielderwithfirstputout INTEGER, fielderwithsecondputout INTEGER, fielderwiththirdputout INTEGER, fielderwithfirstassist INTEGER, fielderwithsecondassist INTEGER, fielderwiththirdassist INTEGER, fielderwithfourthassist INTEGER, fielderwithfifthassist INTEGER, eventnum INTEGER ); CREATE INDEX events_batter_idx ON events (batter); CREATE INDEX events_gameid_idx ON events (gameid); CREATE INDEX events_pitcher_idx ON events (pitcher); To find the grand slams, we need to examine two items; eventtype and rbionplay. If the event type is a Home Run and the RBI's on the play is 4, we have a grand slam. The event types are describe in the eventtypes table, so we'll just select all of the event types. We'll tell Sqlite to select ALL the information from the eventtypes table: sqlite> select * from eventtypes; eventtype|eventname 0|Unknown event 1|No event 2|Generic out 3|Strikeout 4|Stolen base 5|Defensive indifference 6|Caught stealing 7|Pickoff error 8|Pickoff 9|Wild pitch 10|Passed ball 11|Balk 12|Other advance 13|Foul error 14|Walk 15|Intentional walk 16|Hit by 17|Interference 18|Error 19|Fielder's choice 20|Single 21|Double 22|Triple 23|Home run 24|Missing play Event type 23 is a Home Run. Now let's find all the home runs from Hanfer that had 4 RBI's record. We'll display the game id, pitcher, batter, eventtext, and pitch sequence. sqlite> select gameid,pitcher,batter,eventtext,pitchsequence from events where batter = 'hafnt001' and eventtype = 23 and rbionplay = 4; gameid|pitcher|batter|eventtext|pitchsequence CLE200404120|lohsk001|hafnt001|HR/9/F.3-H;2-H;1-H|BX BOS200506280|foulk001|hafnt001|HR/9/L.3-H;2-H;1-H|CSX CIN200607010|maysj001|hafnt001|HR/8/F.3-H;2-H;1-H|.*BFX CLE200605010|mccab001|hafnt001|HR/9/F.3-H;2-H;1-H|B*BX CLE200605170|affej001|hafnt001|HR/9/F.3-H(UR);2-H;1-H|X CLE200606030|donnb001|hafnt001|HR/8/F.3-H(UR);2-H(UR);1-H(UR);B-H(UR)|SBX CLE200607070|bensk001|hafnt001|HR/9/L.3-H;2-H;1-H|X CLE200608130|hudsl001|hafnt001|HR/9/F.3-H;2-H;1-H|BSX Opps, we included ALL his grand slams, not just the ones from 2006! To correct that, we'll add another where team that says find all the gameids that contain something2006something. SQL uses '%' to indicated 'something'. sqlite> select gameid,pitcher,batter,eventtext,pitchsequence from events where batter = 'hafnt001' and eventtype = 23 and rbionplay = 4 and gameid like '%2006%'; gameid|pitcher|batter|eventtext|pitchsequence CIN200607010|maysj001|hafnt001|HR/8/F.3-H;2-H;1-H|.*BFX CLE200605010|mccab001|hafnt001|HR/9/F.3-H;2-H;1-H|B*BX CLE200605170|affej001|hafnt001|HR/9/F.3-H(UR);2-H;1-H|X CLE200606030|donnb001|hafnt001|HR/8/F.3-H(UR);2-H(UR);1-H(UR);B-H(UR)|SBX CLE200607070|bensk001|hafnt001|HR/9/L.3-H;2-H;1-H|X CLE200608130|hudsl001|hafnt001|HR/9/F.3-H;2-H;1-H|BSX Now we know we can find grand slams, let's see all the grand slams for last year. We'll use the previous command and just remove one search term ... batter. This will take a bit longer than the previous searches. Don't worry. sqlite> select gameid,pitcher,batter,eventtext,pitchsequence from events where eventtype = 23 and rbionplay = 4 and gameid like '%2006%'; gameid|pitcher|batter|eventtext|pitchsequence HOU200604240|lidgb001|garcn001|HR/78/F.3-H;2-H;1-H|BBFSX HOU200605150|bucht001|felip001|HR/7/F.3-H;2-H;1-H|X HOU200608290|kolbd001|lanej001|HR/7/F.3-H;2-H;1-H|.*BBB.X HOU200609150|clemr001|burrp001|HR/7/F.3-H;2-H;1-H|FBBCB>X OAK200604030|zitob001|rodra001|HR/7/F.3-H;2-H;1-H|BCX OAK200604220|weavj002|swisn001|HR/9/L.3-H;2-H;1-H|BX OAK200605160|hernf002|melha001|HR/89/F.3-H(UR);2-H(UR);1-H(UR);B-H(UR)|FBBX OAK200607290|zitob001|glaut001|HR/78/F.3-H;2-H;1-H|BX [ Only showing a partial of this, it is quite long ] Now, you'll probably want to save the data to a file. Sqlite can do that for you. The Sqlite command '.show' will show you various settings. One of them is output, which is currently set to stdout. Stdout means standard output, which is the screen. You also see a setting call separator, which is set to | (I call it pipe). This is why all the outputs have had |'s in between the values. Instead of using a comma (,) to separate the values, I recommend using |. Since we are using numbers here, you might encounter a comma or two in the numbers. Let's save the output to a file that can be loaded into Excel. We'll add three more items the runners that were on first, second, and third saving the data in the file 2006grandslams.csv. This query will take some time to run and remember, the output is going to a file so you won't see anything on the screen. sqlite> .output 2006grandslams.csv sqlite> select gameid,pitcher,batter,firstrunner,secondrunner,thirdrunner,eventtext,pitchsequence from events where eventtype = 23 and rbionplay = 4 and gameid like '%2006%'; I have place a copy of the output file at http://kuiper.alal.com/~mkovach/2006grandslams.csv.txt if you want to try this yourself and compare the results. You may also just be concerned about the number (or count) of events, you can do that using the SQL command count. We'll run the same command but select count(*) which really just tells the system to count the number of instances returned from the query. Remember to send the output back to the screen, stdout (I say this from experience, as I wiped out the file running the count command!) sqlite> .output stdout sqlite> select count(*) from events where eventtype = 23 and rbionplay = 4 and gameid like '%2006%'; count(*) 132 sqlite> .quit 132 Grand Slams for 2006 if we did everything right! I'll be adding a few things, editing and putting this tutorial on the web site. If you are incline to try it, please feel free to point out any mistakes, problems, and things that are confusing. I have a few things that should be ready this month. One is a program that will take the output in CSV format (with the headers at the top) and attempted to parse out a few things to make them more readable. It will replace the retroid with the actual name, convert the pitch sequence into a count (3 balls, 2 strikes) and also convert the eventext to a phrase (Batter hit a flyball to center field. The man on third scored, RBI and Sacrifice to the Batter. The man on second was tag out advancing). Thanks, Mat Kovach |