Home
Name Modified Size Downloads / Week Status
Totals: 7 Items   7.9 kB 1
Semantic Web 2010-11-17 11 weekly downloads
Datastore Catalogue Harvester 2010-11-17 22 weekly downloads
Police API Harvester 2010-11-11 22 weekly downloads
Excel CSV Toolkit 2010-10-15 11 weekly downloads
Data Quality 2010-10-07 0
XML Generation 2010-10-07 0
readme.txt 2010-10-15 7.9 kB 11 weekly downloads
# ## # # ### ## # # ### ## ### ## #### ### ## #### #### # # # ## # # # # # ## # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #### # #### #### # # # #### #### # # # # ## # # # # # ## # # # # # # # # # # # # # # ### ## # # ### ## # # ### # # # # # #### # ## # # #### SOURCEFORGE RELEASES BRIAN ARNOLD GREATER LONDON AUTHORITY 07-OCT-2010 The London Datastore (http://data.london.gov.uk) was created by the Greater London Authority (GLA) as an innovation towards freeing London's data. This SourceForge Project will be used to Open Source our development efforts surrounding data formats. CODE STRUCTURE -------------- Code will be released into the following Directories : DATA QUALITY - This is code to help improve the Data Quality and Homogeneity that is necessary for Semantic Translation. DATASTORE CATALOGUE HARVESTER - This is code to implement the harvesting of the London Datastore catalogue (the dataset metadata that is the core of the datastore) Both a simple 'Web Scraper' and a direct query from the MySQL Database are supplied. Please don't generate unnecessary traffic to our site, the catalogue CSV is available from the datastore (http://data.london.gov.uk/datastore/package/npia-police-api-datasets) EXCEL CSV TOOLKIT - This is an Excel 2003 Spreadsheet containing VBA code and an attached toolbar to implement the easy creation of CSV format from Excel, which is a common base format for us. This enables non-technical staff to create CSV files from within Excel using a common tool that can be controlled centrally which strives to get around cell formatting issues by formatting cells as text and preserving contents as seen on the screen. 1) Headers This collapses merged column headers by appending contents together to give the single row of column headers necessary for CSV export. Just select the rows requiring 'flattening' (NB check for proper merge formatting first) then press the button. 2) Data Contains functionality such as removing blank rows and columns, solving formatting issues, removing annotations and replacing cell contents such as 'N/A', '-', '*' etc. with NULL (zeroes are preserved - The fact that something was measured as zero is seen as being significant) 3) .csv(s) This exports the Workbook to CSV file(s) If multiple Worksheets are present, each is exported in turn. Each of the above steps should be performed in order. POLICE API HARVESTER - This is code to implement the harvesting of XML data from a REST API of police data. NB This code has been supplied as a working example that can be adapted for other specific purposes of the same generic type - Registration is required to obtain an API key. The data harvested is being posted to the London Datastore. The Police Data (http://policeapi.rkh.co.uk) is updated every month (around the 21st) There is a fetch to find out when it was last updated, this is performed each day. If it becomes necessary to update the data held at the GLA (indicated by the last updated date from the API being in advance of our last updated dates for each of the different kinds of data held) then as much work as possible is performed each day (we are limited to 1,000 API fetches a day) As progress is made on the many thousands of API fetches necessary to extract all of the data it is recorded and the process resumes from where it got to the following day. It takes around 10 days to harvest all of the data from the XML returned from the various API fetches. Data is then extracted in CSV format. Additionally, automated copying to the datastore of the files generated may be implemented using SCP. SEMANTIC WEB - This is code to support the production of n3 format linked data for our Proof of Concept Virtuoso Quad Store and SPARQL endpoint. This was written by Kai Van-Duuren in Python. XML GENERATION - This is code to support the production of datasets in XML format. RELEASE SCHEDULE ---------------- The 'Police API Harvester', 'Excel CSV Toolkit' and 'Datastore Catalogue Harvester' code will be released first (for the middle of November 2010) Some simple 'Semantic Web', 'Data Quality' and 'XML Generation' code will then be released and actively developed. There is NO intention to release the code of the London Datastore itself which has been developed in Drupal. TECHNOLOGIES USED ----------------- The overwhelming majority of the code will be implemented in the Oracle Database 10g r2 environment using PL/SQL. Oracle Express (http://www.oracle.com/technetwork/database/express-edition/overview/index.html) is a free release (within limits) of the Oracle database that should allow you to use all the code. However, this has not been specifically tested. Oracle Express is available on both Windows and Linux platforms. Some APEX applications may also be eventually developed and posted (especially to support the 'Semantic Web', 'Data Quality' and 'XML Generation') Again, this is a free release available from (http://apex.oracle.com) Other code posted will most likely be in Python or Java - again free and cross-platform environments. TERMS AND CONDITIONS -------------------- All the code is released under a Creative Commons Attribution 3.0 unported license (http://creativecommons.org/licenses/by/3.0/) This basically means you can do anything with it, especially copy, share, distribute, change and use it anywhere, for any purpose, as long as you continue to state the original creator(s) : - Brian Arnold - Kai Van-Duuren Both of GLA ITU - Information Systems and Development Also, please be aware of these additional caveats : - The code is supplied AS IS: If it breaks, you get to keep both pieces (However, it is in production at the GLA) - Not all of the code may work as is i.e. the 'Police API Harvester' - Commenting and documentation may well be lacking - Unfortunately the support we are able to offer is severely limited, we CANNOT, for example: - Assist you in getting the code working at your site - Answer questions about the code or discuss it - Enter into partnered development - Get involved in re-implementation in other technologies - Accept suggestions for improvement or requests for new features - Respond to bug reports We have released the code on the understanding that it may well be of use to other developers attempting to do the same. Due to the nature of the GLA our primary concern must be developing the code. FINAL NOTE ---------- Not all of the code presented here may be entirely original (i.e. of our own authorship) As all developers will know code is often adapted from that found on the Internet. This code is either 1) Taken as a working example to serve as a starting point or 2) Re-used in its entirety. We may have done either of the above without giving credit to the source. If you are an original developer who recognises your code within ours firstly please accept our apologies, it is not our intention to detract from your contribution. If you find yourself in this situation and are unhappy, please feel free to contact us and we'll discuss what can be done to remedy the situation. CONTACT DETAILS --------------- Follow the London Datastore on : Twitter http://twitter.com/londondatastore Our Blog http://data.london.gov.uk/blog Our Google Group http://groups.google.com/group/londondatastore or Email : datastore@london.gov.uk
Source: readme.txt, updated 2010-10-15