You can subscribe to this list here.
2007 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(3) |
Oct
|
Nov
|
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2009 |
Jan
(9) |
Feb
(13) |
Mar
(4) |
Apr
(4) |
May
(13) |
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
(2) |
Nov
|
Dec
|
From: <ku...@us...> - 2009-02-26 10:49:32
|
Revision: 331 http://mypyspace.svn.sourceforge.net/mypyspace/?rev=331&view=rev Author: kurtjx Date: 2009-02-26 10:49:27 +0000 (Thu, 26 Feb 2009) Log Message: ----------- fixed another bug - this one in mpsSong where the except argument causes some problems in line 350 or so - also checking if avas is None in line 180ish Modified Paths: -------------- musicGrabber/branches/webserv-branch/myspace2rdf.py Modified: musicGrabber/branches/webserv-branch/myspace2rdf.py =================================================================== --- musicGrabber/branches/webserv-branch/myspace2rdf.py 2009-02-25 10:15:29 UTC (rev 330) +++ musicGrabber/branches/webserv-branch/myspace2rdf.py 2009-02-26 10:49:27 UTC (rev 331) @@ -177,16 +177,18 @@ #availableAs = song.getAttribute('durl')''' thisSong = mpsSong(self, song, 'downloadprefix') thisSong.getUri() - availableAs = thisSong.uri + track = mopy.mo.Track() track.title.set(thisSong.title) - - avas = mopy.mo.MusicalItem(availableAs) - track.available_as.set(avas) + availableAs = thisSong.uri + if availableAs: + avas = mopy.mo.MusicalItem(availableAs) + track.available_as.set(avas) + self.mi.add(avas) #track.available_as.set(mopy.rdfs.Resource(availableAs)) self.subject.made.add(track) self.mi.add(track) - self.mi.add(avas) + self.createCommonRDF() self.scrapeGenre() @@ -346,8 +348,9 @@ try: self.uri = self.exhaustiveXML.getElementsByTagName('link')[0].firstChild.nodeValue except AttributeError, err: - logging.info("mpsUser::getUri ran into a problem finding the download link for a song by artist with uid: " + - str(self.parent().uid) + " link will be left blank.\n\tError msg: " + str(err)) + #logging.info("mpsUser::getUri ran into a problem finding the download link for a song by artist with uid: " + + # str(self.parent().uid) + " link will be left blank.\n\tError msg: " + str(err)) + pass self.uri = '' def setTrackNum(self, trackNumber, totalTracks): This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <ku...@us...> - 2009-02-25 10:15:32
|
Revision: 330 http://mypyspace.svn.sourceforge.net/mypyspace/?rev=330&view=rev Author: kurtjx Date: 2009-02-25 10:15:29 +0000 (Wed, 25 Feb 2009) Log Message: ----------- fixed myspace ontology prefix in myspaceuris.py Modified Paths: -------------- musicGrabber/branches/webserv-branch/myspaceuris.py Modified: musicGrabber/branches/webserv-branch/myspaceuris.py =================================================================== --- musicGrabber/branches/webserv-branch/myspaceuris.py 2009-02-25 10:13:56 UTC (rev 329) +++ musicGrabber/branches/webserv-branch/myspaceuris.py 2009-02-25 10:15:29 UTC (rev 330) @@ -5,7 +5,7 @@ # ### append user id to this ### rdfStoreURL = "http://myrdfspace.com/alpha/" -myspaceOntology = 'http://purl.org/ontology/myspace.owl#' +myspaceOntology = 'http://purl.org/ontology/myspace#' ######################################################################################################### ######################################################################################################### This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <ku...@us...> - 2009-02-25 10:14:00
|
Revision: 329 http://mypyspace.svn.sourceforge.net/mypyspace/?rev=329&view=rev Author: kurtjx Date: 2009-02-25 10:13:56 +0000 (Wed, 25 Feb 2009) Log Message: ----------- fixed myspace ontology prefix in myspaceuris.py Modified Paths: -------------- musicGrabber/branches/webserv-branch/myspaceuris.py Modified: musicGrabber/branches/webserv-branch/myspaceuris.py =================================================================== --- musicGrabber/branches/webserv-branch/myspaceuris.py 2009-02-25 10:06:09 UTC (rev 328) +++ musicGrabber/branches/webserv-branch/myspaceuris.py 2009-02-25 10:13:56 UTC (rev 329) @@ -5,7 +5,7 @@ # ### append user id to this ### rdfStoreURL = "http://myrdfspace.com/alpha/" -myspaceOntology = 'http://grasstunes.net/ontology/myspace.owl#' +myspaceOntology = 'http://purl.org/ontology/myspace.owl#' ######################################################################################################### ######################################################################################################### This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <ku...@us...> - 2009-02-25 10:06:19
|
Revision: 328 http://mypyspace.svn.sourceforge.net/mypyspace/?rev=328&view=rev Author: kurtjx Date: 2009-02-25 10:06:09 +0000 (Wed, 25 Feb 2009) Log Message: ----------- fixed REM bug where an empty xmlPage made a crash at line 168 - now we just skip song info if xmlPage is None Modified Paths: -------------- musicGrabber/branches/webserv-branch/myspace2rdf.py Modified: musicGrabber/branches/webserv-branch/myspace2rdf.py =================================================================== --- musicGrabber/branches/webserv-branch/myspace2rdf.py 2009-02-18 12:56:50 UTC (rev 327) +++ musicGrabber/branches/webserv-branch/myspace2rdf.py 2009-02-25 10:06:09 UTC (rev 328) @@ -160,35 +160,34 @@ # self.subject.sameAs.set(thing2) # self.mi.add(thing2) - idx=0 xmlPage = try_open(mediaBase[0] + str(self.artistID) + mediaBase[1] + str(self.playlistID) + mediaBase[2] + str(self.uid) + mediaBase[3]) #print mediaBase[0] + str(self.artistID) + mediaBase[1] + str(self.playlistID) + mediaBase[2] + str(self.uid) + mediaBase[3] - self.xmlStruct = dom.parseString(''.join(xmlPage.readlines())) - - songList = self.xmlStruct.getElementsByTagName('song') - for song in songList: - '''try: - songTitle = unicodedata.normalize('NFKC',song.getAttribute('title')).encode('ascii','ignore') - except AttributeError, err: - songTitle = str(None) - except IndexError, err: - songTitle = str(None) - #availableAs = song.getAttribute('durl')''' - thisSong = mpsSong(self, song, 'downloadprefix') - thisSong.getUri() - availableAs = thisSong.uri - track = mopy.mo.Track() - track.title.set(thisSong.title) + + if xmlPage: + self.xmlStruct = dom.parseString(''.join(xmlPage.readlines())) + songList = self.xmlStruct.getElementsByTagName('song') + for song in songList: + '''try: + songTitle = unicodedata.normalize('NFKC',song.getAttribute('title')).encode('ascii','ignore') + except AttributeError, err: + songTitle = str(None) + except IndexError, err: + songTitle = str(None) + #availableAs = song.getAttribute('durl')''' + thisSong = mpsSong(self, song, 'downloadprefix') + thisSong.getUri() + availableAs = thisSong.uri + track = mopy.mo.Track() + track.title.set(thisSong.title) - avas = mopy.mo.MusicalItem(availableAs) - track.available_as.set(avas) - #track.available_as.set(mopy.rdfs.Resource(availableAs)) - self.subject.made.add(track) - self.mi.add(track) - self.mi.add(avas) + avas = mopy.mo.MusicalItem(availableAs) + track.available_as.set(avas) + #track.available_as.set(mopy.rdfs.Resource(availableAs)) + self.subject.made.add(track) + self.mi.add(track) + self.mi.add(avas) - idx+=1 self.createCommonRDF() self.scrapeGenre() self.mi.add(self.subject) This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <ku...@us...> - 2009-02-18 12:56:54
|
Revision: 327 http://mypyspace.svn.sourceforge.net/mypyspace/?rev=327&view=rev Author: kurtjx Date: 2009-02-18 12:56:50 +0000 (Wed, 18 Feb 2009) Log Message: ----------- added some regex stuff to get rid of bad genre tags, sometime 1324123.rdf was set as a theme which was a bug in the old code i guess Modified Paths: -------------- graphRDF/branches/old2sparul/old2sparul.py Modified: graphRDF/branches/old2sparul/old2sparul.py =================================================================== --- graphRDF/branches/old2sparul/old2sparul.py 2009-02-18 12:51:29 UTC (rev 326) +++ graphRDF/branches/old2sparul/old2sparul.py 2009-02-18 12:56:50 UTC (rev 327) @@ -3,8 +3,10 @@ """ old2sparul.py +This is an ad hoc script for taking data from myrdfspace.com, cleaning it, and putting in sparql endpoint + Created by Kurtis Random on 2009-02-03. -Copyright (c) 2009 __MyCompanyName__. All rights reserved. +Copyright (c) 2009 C4DM QMUL. All rights reserved. """ import sys @@ -12,21 +14,22 @@ from logging import log, error, warning, info, debug import logging import ftplib -#from SPARQLWrapper import SPARQLWrapper import SPARQLWrapper import mopy import urllib2 +import re from time import sleep help_message = ''' take old myrdfspace files and add to the sparql endpoint... -b --base <uri base from myrdfspace> + -s --start <uid to start from> useful after a crash ;-) ''' failedList = [] badQueryList = [] -defaultGraph = "http://dbtune.org/myspace-fj-set-2008" +defaultGraph = "http://dbtune.org/myspace-fj-2008" sparqlEndPoint = "http://dbtune.org/cmn/sparql" myspaceBase = "http://dbtune.org/myspace/uid" myspaceOnt = "http://purl.org/ontology/myspace" @@ -50,7 +53,8 @@ sleep(1.0) attempt+=1 tryImportRDF(filename, attempt) - return mi + else: + return mi debug("import failed after tries: " + str(attempt)) return None @@ -58,45 +62,55 @@ '''parse the rdf and return a sparql update query''' sparqlU='' mi = tryImportRDF(base+filename, 0) - keys = mi.PersonIdx.keys() - for key in keys: - person = mi.PersonIdx[key] - if person.name: - # if we find the name, this is the main subject - suid = person.URI.split(base)[1] - subject = "<"+myspaceBase+"/"+suid+">" - name = person.name.pop() - sparqlU = sparqlU + '\n'+subject+' rdf:type mo:MusicArtist .' - sparqlU = sparqlU + '\n'+subject+' myspace:myspaceID "'+filename.rstrip('.rdf')+'"^^xsd:int .' - sparqlU = sparqlU + """\n"""+subject+' foaf:name "' + urllib2.quote(name)+'"@en . ' + if mi: + keys = mi.PersonIdx.keys() + for key in keys: + person = mi.PersonIdx[key] + if person.name: + # if we find the name, this is the main subject + suid = person.URI.split(base)[1] + subject = "<"+myspaceBase+"/"+suid+">" + name = person.name.pop() + sparqlU = sparqlU + '\n'+subject+' rdf:type mo:MusicArtist .' + sparqlU = sparqlU + '\n'+subject+' myspace:myspaceID "'+filename.rstrip('.rdf')+'"^^xsd:int .' + sparqlU = sparqlU + """\n"""+subject+' foaf:name "' + urllib2.quote(name)+'"@en . ' - # get all the top friends - while(1): - try: - p = person.knows.pop() - ouid = p.URI.split(base)[1] - obj = "<"+myspaceBase+"/"+ouid+">" - sparqlU=sparqlU+ "\n"+subject+" foaf:knows "+ obj+ ' . ' "\n"+subject+" myspace:topFriend "+obj+ ' . ' - sparqlU = sparqlU + '\n'+obj+' rdf:type mo:MusicArtist .' - except: - break + # get all the top friends + while(1): + try: + p = person.knows.pop() + except: + break + else: + ouid = p.URI.split(base)[1] + obj = "<"+myspaceBase+"/"+ouid+">" + sparqlU=sparqlU+ "\n"+subject+" foaf:knows "+ obj+ ' . ' "\n"+subject+" myspace:topFriend "+obj+ ' . ' + sparqlU = sparqlU + '\n'+obj+' rdf:type mo:MusicArtist .' - while(1): + while(1): + try: + thm = person.theme.pop() + except: + debug("breaking from genre pops") + break + else: + thm = thm.URI.split(base)[1] + # do some cleaning, bad genres in there like 35123543.rdf instead of hip hop + if not re.match(".*\.rdf",thm): + debug("adding genre: "+thm) + genre = "<"+myspaceOnt + "#"+urllib2.quote(thm)+">" + sparqlU=sparqlU+ "\n"+subject+ " myspace:genreTag "+ genre+ ' . ' + try: - thm = person.theme.pop() - genre = "<"+myspaceOnt + "#"+urllib2.quote(thm.URI.split(base)[1])+">" - sparqlU=sparqlU+ "\n"+subject+ " myspace:genreTag "+ genre+ ' . ' + playcount = person.tipjar.pop().URI.split(base)[1] + sparqlU=sparqlU+ "\n"+subject+ ' myspace:totalPlays "'+ playcount+'"^^xsd:int . ' except: - break - - try: - playcount = person.tipjar.pop().URI.split(base)[1] - sparqlU=sparqlU+ "\n"+subject+ ' myspace:totalPlays "'+ playcount+'"^^xsd:int . ' - except: - pass + pass - sparqlU=sparqlU+'}' - return sparqlU + sparqlU=sparqlU+'}' + return sparqlU + else: + return None def setLogger(): '''just set the logger''' @@ -216,17 +230,20 @@ sparul = parseRDF(f, base) sparql = SPARQLWrapper.SPARQLWrapper(sparqlEndPoint) sparql.addDefaultGraph(defaultGraph) - - # we have to deal w/ queries that are too long - if len(sparul) > apacheLimit: - debug('query too long, splitting...') - splitSparul = splitQuery(sparul) - for split in splitSparul: - sparql.setQuery(prefixes+split) + if sparul: + # we have to deal w/ queries that are too long + if len(sparul) > apacheLimit: + debug('query too long, splitting...') + splitSparul = splitQuery(sparul) + for split in splitSparul: + sparql.setQuery(prefixes+split) + trySparql(sparql, 0, f) + else: + sparql.setQuery(prefixes+insert+sparul) trySparql(sparql, 0, f) else: - sparql.setQuery(prefixes+insert+sparul) - trySparql(sparql, 0, f) + debug('failure on '+str(f)) + failedList.append(f) This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <ku...@us...> - 2009-02-18 12:51:33
|
Revision: 326 http://mypyspace.svn.sourceforge.net/mypyspace/?rev=326&view=rev Author: kurtjx Date: 2009-02-18 12:51:29 +0000 (Wed, 18 Feb 2009) Log Message: ----------- added foaf:primaryTopic to rdf, mostly cuz kingsley said to Modified Paths: -------------- musicGrabber/branches/webserv-branch/myspace2rdf.py Modified: musicGrabber/branches/webserv-branch/myspace2rdf.py =================================================================== --- musicGrabber/branches/webserv-branch/myspace2rdf.py 2009-02-05 15:06:38 UTC (rev 325) +++ musicGrabber/branches/webserv-branch/myspace2rdf.py 2009-02-18 12:51:29 UTC (rev 326) @@ -105,8 +105,10 @@ genrePresent = scrapePage(self.page, [genreTag[0]], genreTag[1]) if genrePresent: self.subject = mopy.mo.MusicArtist(dbtuneMyspace+'uid/'+str(self.uid)) - #self.subjecttwo = mopy.foaf.Person('http://dbtune.org/myspace/uid/'+str(self.uid)) - #self.subject = mopy.mo.MusicArtist('http://dbtune.org/myspace/uid/'+str(self.uid)) + # add foaf:primaryTopic + ppd = mopy.foaf.PersonalProfileDocument("") + ppd.primaryTopic.set(self.subject) + self.mi.add(ppd) self.name = scrapePage(self.page, [nameTag[0]], nameTag[1]) if self.name: self.subject.name.set(self.name) @@ -117,6 +119,11 @@ else: #self.subject = mopy.mo.Agent('http://dbtune.org/myspace/uid/'+str(self.uid)) self.subject = mopy.foaf.Person(dbtuneMyspace+'uid/'+str(self.uid)) + self.subject = mopy.mo.MusicArtist(dbtuneMyspace+'uid/'+str(self.uid)) + # add foaf:primaryTopic + ppd = mopy.foaf.PersonalProfileDocument("") + ppd.primaryTopic.set(self.subject) + self.mi.add(ppd) self.name = scrapePage(self.page, [nameTag[0]], nameTag[1]) #print self.name if self.name: This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <ku...@us...> - 2009-02-05 15:06:42
|
Revision: 325 http://mypyspace.svn.sourceforge.net/mypyspace/?rev=325&view=rev Author: kurtjx Date: 2009-02-05 15:06:38 +0000 (Thu, 05 Feb 2009) Log Message: ----------- some additional error handling for fails on importRDFFile and an arguement to restart mid directory Modified Paths: -------------- graphRDF/branches/old2sparul/old2sparul.py Modified: graphRDF/branches/old2sparul/old2sparul.py =================================================================== --- graphRDF/branches/old2sparul/old2sparul.py 2009-02-04 17:20:12 UTC (rev 324) +++ graphRDF/branches/old2sparul/old2sparul.py 2009-02-05 15:06:38 UTC (rev 325) @@ -40,10 +40,24 @@ def __init__(self, msg): self.msg = msg +def tryImportRDF(filename, attempt): + if attempt < 5: + debug("importing rdf") + try: + mi = mopy.importRDFFile(filename) + except urllib2.URLError: + debug("URLError importing RDF, retrying") + sleep(1.0) + attempt+=1 + tryImportRDF(filename, attempt) + return mi + debug("import failed after tries: " + str(attempt)) + return None + def parseRDF(filename, base): '''parse the rdf and return a sparql update query''' sparqlU='' - mi = mopy.importRDFFile(base+filename) + mi = tryImportRDF(base+filename, 0) keys = mi.PersonIdx.keys() for key in keys: person = mi.PersonIdx[key] @@ -156,12 +170,13 @@ argv = sys.argv try: try: - opts, args = getopt.getopt(argv[1:], "ho:b:v", ["help", "output=","base="]) + opts, args = getopt.getopt(argv[1:], "ho:b:s:v", ["help", "output=","base=", "start="]) except getopt.error, msg: raise Usage(msg) # option processing base = None + start = None for option, value in opts: if option == "-v": verbose = True @@ -171,6 +186,8 @@ output = value if option in ("-b", "--base"): base = value + if option in ("-s", "--start"): + start = value '''if option in ("-g", '--graph'): defaultGraph = value insert = """ \ninsert into graph <"""+defaultGraph+"""> {"""''' @@ -186,7 +203,14 @@ fileList = getFileListing(folder) debug('got list of files') #fileList = ['238729309.rdf', '13280592.rdf', '26412401.rdf', '8557307.rdf', '176635064.rdf', '12656647.rdf'] - for f in fileList: + startIndex=0 + if start: + try: + startIndex=fileList.index(start) + except: + debug("not a valid start file, not in list") + + for f in fileList[startIndex:]: debug('parsing on file: '+str(f)) #parse each file and do a sparql update to the repository sparul = parseRDF(f, base) This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <ku...@us...> - 2009-02-04 17:20:16
|
Revision: 324 http://mypyspace.svn.sourceforge.net/mypyspace/?rev=324&view=rev Author: kurtjx Date: 2009-02-04 17:20:12 +0000 (Wed, 04 Feb 2009) Log Message: ----------- old2sparul working properly :-) Modified Paths: -------------- graphRDF/branches/old2sparul/old2sparul.py Modified: graphRDF/branches/old2sparul/old2sparul.py =================================================================== --- graphRDF/branches/old2sparul/old2sparul.py 2009-02-04 15:18:08 UTC (rev 323) +++ graphRDF/branches/old2sparul/old2sparul.py 2009-02-04 17:20:12 UTC (rev 324) @@ -26,11 +26,11 @@ failedList = [] badQueryList = [] -defaultGraph = "http://dbtune.org/myspace-fj-2008p" +defaultGraph = "http://dbtune.org/myspace-fj-set-2008" sparqlEndPoint = "http://dbtune.org/cmn/sparql" myspaceBase = "http://dbtune.org/myspace/uid" myspaceOnt = "http://purl.org/ontology/myspace" -prefixes = """PREFIX owl: <http://www.w3.org/2002/07/owl#> \nPREFIX foaf: <http://xmlns.com/foaf/0.1/> \nPREFIX dc: <http://purl.org/dc/elements/1.1/> \nPREFIX mo: <http://purl.org/ontology/mo/>\nPREFIX myspace: <http://purl.org/ontology/myspace#>\nPREFIX xsd: <http://www.w3.org/2001/XMLSchema#>""" +prefixes = """PREFIX owl: <http://www.w3.org/2002/07/owl#> \nPREFIX foaf: <http://xmlns.com/foaf/0.1/> \nPREFIX dc: <http://purl.org/dc/elements/1.1/> \nPREFIX mo: <http://purl.org/ontology/mo/>\nPREFIX myspace: <http://purl.org/ontology/myspace#>\nPREFIX xsd: <http://www.w3.org/2001/XMLSchema#>\nPREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>""" insert = """ \ninsert into graph <"""+defaultGraph+"""> {""" @@ -52,6 +52,8 @@ suid = person.URI.split(base)[1] subject = "<"+myspaceBase+"/"+suid+">" name = person.name.pop() + sparqlU = sparqlU + '\n'+subject+' rdf:type mo:MusicArtist .' + sparqlU = sparqlU + '\n'+subject+' myspace:myspaceID "'+filename.rstrip('.rdf')+'"^^xsd:int .' sparqlU = sparqlU + """\n"""+subject+' foaf:name "' + urllib2.quote(name)+'"@en . ' # get all the top friends @@ -61,6 +63,7 @@ ouid = p.URI.split(base)[1] obj = "<"+myspaceBase+"/"+ouid+">" sparqlU=sparqlU+ "\n"+subject+" foaf:knows "+ obj+ ' . ' "\n"+subject+" myspace:topFriend "+obj+ ' . ' + sparqlU = sparqlU + '\n'+obj+' rdf:type mo:MusicArtist .' except: break This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <ku...@us...> - 2009-02-04 15:18:11
|
Revision: 323 http://mypyspace.svn.sourceforge.net/mypyspace/?rev=323&view=rev Author: kurtjx Date: 2009-02-04 15:18:08 +0000 (Wed, 04 Feb 2009) Log Message: ----------- splits big queries down now Modified Paths: -------------- graphRDF/branches/old2sparul/old2sparul.py Modified: graphRDF/branches/old2sparul/old2sparul.py =================================================================== --- graphRDF/branches/old2sparul/old2sparul.py 2009-02-03 20:55:00 UTC (rev 322) +++ graphRDF/branches/old2sparul/old2sparul.py 2009-02-04 15:18:08 UTC (rev 323) @@ -22,22 +22,27 @@ take old myrdfspace files and add to the sparql endpoint... -b --base <uri base from myrdfspace> ''' + failedList = [] badQueryList = [] -defaultGraph = "http://dbtune.org/myspace-test" +defaultGraph = "http://dbtune.org/myspace-fj-2008p" sparqlEndPoint = "http://dbtune.org/cmn/sparql" myspaceBase = "http://dbtune.org/myspace/uid" myspaceOnt = "http://purl.org/ontology/myspace" prefixes = """PREFIX owl: <http://www.w3.org/2002/07/owl#> \nPREFIX foaf: <http://xmlns.com/foaf/0.1/> \nPREFIX dc: <http://purl.org/dc/elements/1.1/> \nPREFIX mo: <http://purl.org/ontology/mo/>\nPREFIX myspace: <http://purl.org/ontology/myspace#>\nPREFIX xsd: <http://www.w3.org/2001/XMLSchema#>""" +insert = """ \ninsert into graph <"""+defaultGraph+"""> {""" + +apacheLimit = 2000 + class Usage(Exception): def __init__(self, msg): self.msg = msg def parseRDF(filename, base): '''parse the rdf and return a sparql update query''' - sparqlU = prefixes+""" \ninsert into graph <"""+defaultGraph+"""> {""" + sparqlU='' mi = mopy.importRDFFile(base+filename) keys = mi.PersonIdx.keys() for key in keys: @@ -99,8 +104,7 @@ try: debug('attempting sparql update, try #' + str(attempt)) sparql.setReturnFormat(SPARQLWrapper.TURTLE) - ret = sparql.query() - print ret.convert() + ret = sparql.query().convert() except urllib2.HTTPError: debug('caught an http error, retrying...') if attempt<5: @@ -113,17 +117,36 @@ except SPARQLWrapper.sparqlexceptions.QueryBadFormed: error("query failed for "+ str(f)) debug('$$$$$$$$$$$$$$$$BADLY FORMED QUERY$$$$$$$$$$$$$$$$$$$') + print sparql.queryString badQueryList.append(f) failedList.append(f) except: error("query failed for "+ str(f)) debug('************UPDATE FAILED***********') failedList.append(f) - error("Unexpected error:", sys.exc_info()[0]) + print "Unexpected error:", sys.exc_info()[0] + print sparql.queryString + else: + print ret + return ret + return None def splitQuery(query): '''sometime the query is too long and should be broke in two pieces''' - pass + lines = query.splitlines(1) + splits = [] + split = "" + count = 0 + for line in lines: + if count < apacheLimit: + split = split+line + count+=len(line) + else: + splits.append(insert+split+'}') + split= line + count = 0 + splits.append(insert+split) + return splits def main(argv=None): if argv is None: @@ -145,6 +168,10 @@ output = value if option in ("-b", "--base"): base = value + '''if option in ("-g", '--graph'): + defaultGraph = value + insert = """ \ninsert into graph <"""+defaultGraph+"""> {"""''' + setLogger() if base == None: @@ -153,50 +180,27 @@ # parse base uri folder = base.split("http://myrdfspace.com/")[1] debug('getting list of files') - #fileList = getFileListing(folder) + fileList = getFileListing(folder) debug('got list of files') - fileList = ['238729309.rdf', '13280592.rdf', '26412401.rdf', '8557307.rdf', '176635064.rdf', '12656647.rdf'] + #fileList = ['238729309.rdf', '13280592.rdf', '26412401.rdf', '8557307.rdf', '176635064.rdf', '12656647.rdf'] for f in fileList: debug('parsing on file: '+str(f)) #parse each file and do a sparql update to the repository sparul = parseRDF(f, base) sparql = SPARQLWrapper.SPARQLWrapper(sparqlEndPoint) sparql.addDefaultGraph(defaultGraph) - sparql.setQuery(sparul) - trySparql(sparql, 0, f) - '''try: - debug('attempting sparql update') - sparql.setReturnFormat(SPARQLWrapper.TURTLE) - ret = sparql.query() - print ret.convert() - except urllib2.HTTPError: - debug('caught an http error, retrying...') - try: - ret = sparql.query() - print ret.convert() - except urllib2.HTTPError: - debug('second http error...') - try: - ret = sparql.query() - print ret.convert() - except: - print "query failed for "+ str(f) - debug('************UPDATE FAILED***********') - failedList.append(f) - print "FINAL error:", sys.exc_info()[0] - except: - print "query failed for "+ str(f) - debug('************UPDATE FAILED***********') - failedList.append(f) - print "Unexpected error:", sys.exc_info()[0] - except SPARQLWrapper.sparqlexceptions.QueryBadFormed: - debug('$$$$$$$$$$$$$$$$BADLY FORMED QUERY$$$$$$$$$$$$$$$$$$$') - badQueryList.append(f) - except: - print "query failed for "+ str(f) - debug('************UPDATE FAILED***********') - failedList.append(f) - print "Unexpected error:", sys.exc_info()[0]''' + + # we have to deal w/ queries that are too long + if len(sparul) > apacheLimit: + debug('query too long, splitting...') + splitSparul = splitQuery(sparul) + for split in splitSparul: + sparql.setQuery(prefixes+split) + trySparql(sparql, 0, f) + else: + sparql.setQuery(prefixes+insert+sparul) + trySparql(sparql, 0, f) + debug("Complete!!!") This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <ku...@us...> - 2009-02-03 20:55:05
|
Revision: 322 http://mypyspace.svn.sourceforge.net/mypyspace/?rev=322&view=rev Author: kurtjx Date: 2009-02-03 20:55:00 +0000 (Tue, 03 Feb 2009) Log Message: ----------- lil python script for importing old myrdfspace data into 3 store, need to add a function to break long queries in two Added Paths: ----------- graphRDF/branches/old2sparul/old2sparul.py Added: graphRDF/branches/old2sparul/old2sparul.py =================================================================== --- graphRDF/branches/old2sparul/old2sparul.py (rev 0) +++ graphRDF/branches/old2sparul/old2sparul.py 2009-02-03 20:55:00 UTC (rev 322) @@ -0,0 +1,216 @@ +#!/usr/bin/env python +# encoding: utf-8 +""" +old2sparul.py + +Created by Kurtis Random on 2009-02-03. +Copyright (c) 2009 __MyCompanyName__. All rights reserved. +""" + +import sys +import getopt +from logging import log, error, warning, info, debug +import logging +import ftplib +#from SPARQLWrapper import SPARQLWrapper +import SPARQLWrapper +import mopy +import urllib2 +from time import sleep + +help_message = ''' +take old myrdfspace files and add to the sparql endpoint... + -b --base <uri base from myrdfspace> +''' +failedList = [] +badQueryList = [] + +defaultGraph = "http://dbtune.org/myspace-test" +sparqlEndPoint = "http://dbtune.org/cmn/sparql" +myspaceBase = "http://dbtune.org/myspace/uid" +myspaceOnt = "http://purl.org/ontology/myspace" +prefixes = """PREFIX owl: <http://www.w3.org/2002/07/owl#> \nPREFIX foaf: <http://xmlns.com/foaf/0.1/> \nPREFIX dc: <http://purl.org/dc/elements/1.1/> \nPREFIX mo: <http://purl.org/ontology/mo/>\nPREFIX myspace: <http://purl.org/ontology/myspace#>\nPREFIX xsd: <http://www.w3.org/2001/XMLSchema#>""" + +class Usage(Exception): + def __init__(self, msg): + self.msg = msg + +def parseRDF(filename, base): + '''parse the rdf and return a sparql update query''' + sparqlU = prefixes+""" \ninsert into graph <"""+defaultGraph+"""> {""" + mi = mopy.importRDFFile(base+filename) + keys = mi.PersonIdx.keys() + for key in keys: + person = mi.PersonIdx[key] + if person.name: + # if we find the name, this is the main subject + suid = person.URI.split(base)[1] + subject = "<"+myspaceBase+"/"+suid+">" + name = person.name.pop() + sparqlU = sparqlU + """\n"""+subject+' foaf:name "' + urllib2.quote(name)+'"@en . ' + + # get all the top friends + while(1): + try: + p = person.knows.pop() + ouid = p.URI.split(base)[1] + obj = "<"+myspaceBase+"/"+ouid+">" + sparqlU=sparqlU+ "\n"+subject+" foaf:knows "+ obj+ ' . ' "\n"+subject+" myspace:topFriend "+obj+ ' . ' + except: + break + + while(1): + try: + thm = person.theme.pop() + genre = "<"+myspaceOnt + "#"+urllib2.quote(thm.URI.split(base)[1])+">" + sparqlU=sparqlU+ "\n"+subject+ " myspace:genreTag "+ genre+ ' . ' + except: + break + + try: + playcount = person.tipjar.pop().URI.split(base)[1] + sparqlU=sparqlU+ "\n"+subject+ ' myspace:totalPlays "'+ playcount+'"^^xsd:int . ' + except: + pass + + sparqlU=sparqlU+'}' + return sparqlU + +def setLogger(): + '''just set the logger''' + loggingConfig = {"format":'%(asctime)s %(levelname)-8s %(message)s', + "datefmt":'%d.%m.%y %H:%M:%S', + "level": logging.DEBUG, + #"filename":logPath + "musicGrabber.log", + "filemode":"w"} + logging.basicConfig(**loggingConfig) + +def getFileListing(rdfFolder): + '''return a list of all the rdf files found w/ given base''' + rdfFolder = rdfFolder.rstrip('/') + rdfFolder = rdfFolder+'/' + ftp = ftplib.FTP("myrdfspace.com") + ftp.login("myrdf", "my1stRDF") + ftp.cwd("myrdfspace.com/"+rdfFolder) + vList = ftp.nlst() + return vList + +def trySparql(sparql, attempt, f): + try: + debug('attempting sparql update, try #' + str(attempt)) + sparql.setReturnFormat(SPARQLWrapper.TURTLE) + ret = sparql.query() + print ret.convert() + except urllib2.HTTPError: + debug('caught an http error, retrying...') + if attempt<5: + attempt+=1 + sleep(2) + trySparql(sparql, attempt, f) + else: + error("more that 5 http errors, giving up") + failedList.append(f) + except SPARQLWrapper.sparqlexceptions.QueryBadFormed: + error("query failed for "+ str(f)) + debug('$$$$$$$$$$$$$$$$BADLY FORMED QUERY$$$$$$$$$$$$$$$$$$$') + badQueryList.append(f) + failedList.append(f) + except: + error("query failed for "+ str(f)) + debug('************UPDATE FAILED***********') + failedList.append(f) + error("Unexpected error:", sys.exc_info()[0]) + +def splitQuery(query): + '''sometime the query is too long and should be broke in two pieces''' + pass + +def main(argv=None): + if argv is None: + argv = sys.argv + try: + try: + opts, args = getopt.getopt(argv[1:], "ho:b:v", ["help", "output=","base="]) + except getopt.error, msg: + raise Usage(msg) + + # option processing + base = None + for option, value in opts: + if option == "-v": + verbose = True + if option in ("-h", "--help"): + raise Usage(help_message) + if option in ("-o", "--output"): + output = value + if option in ("-b", "--base"): + base = value + + setLogger() + if base == None: + raise Usage(help_message) + return 2 + # parse base uri + folder = base.split("http://myrdfspace.com/")[1] + debug('getting list of files') + #fileList = getFileListing(folder) + debug('got list of files') + fileList = ['238729309.rdf', '13280592.rdf', '26412401.rdf', '8557307.rdf', '176635064.rdf', '12656647.rdf'] + for f in fileList: + debug('parsing on file: '+str(f)) + #parse each file and do a sparql update to the repository + sparul = parseRDF(f, base) + sparql = SPARQLWrapper.SPARQLWrapper(sparqlEndPoint) + sparql.addDefaultGraph(defaultGraph) + sparql.setQuery(sparul) + trySparql(sparql, 0, f) + '''try: + debug('attempting sparql update') + sparql.setReturnFormat(SPARQLWrapper.TURTLE) + ret = sparql.query() + print ret.convert() + except urllib2.HTTPError: + debug('caught an http error, retrying...') + try: + ret = sparql.query() + print ret.convert() + except urllib2.HTTPError: + debug('second http error...') + try: + ret = sparql.query() + print ret.convert() + except: + print "query failed for "+ str(f) + debug('************UPDATE FAILED***********') + failedList.append(f) + print "FINAL error:", sys.exc_info()[0] + except: + print "query failed for "+ str(f) + debug('************UPDATE FAILED***********') + failedList.append(f) + print "Unexpected error:", sys.exc_info()[0] + except SPARQLWrapper.sparqlexceptions.QueryBadFormed: + debug('$$$$$$$$$$$$$$$$BADLY FORMED QUERY$$$$$$$$$$$$$$$$$$$') + badQueryList.append(f) + except: + print "query failed for "+ str(f) + debug('************UPDATE FAILED***********') + failedList.append(f) + print "Unexpected error:", sys.exc_info()[0]''' + + + debug("Complete!!!") + print "\n\nREPORT:\n\tfailures: "+str(len(failedList)) + print "\nfails: " + print failedList + print "\n\nbad queries: " + print badQueryList + + except Usage, err: + print >> sys.stderr, sys.argv[0].split("/")[-1] + ": " + str(err.msg) + print >> sys.stderr, "\t for help use --help" + return 2 + + +if __name__ == "__main__": + sys.exit(main()) This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <ku...@us...> - 2009-02-03 20:53:30
|
Revision: 321 http://mypyspace.svn.sourceforge.net/mypyspace/?rev=321&view=rev Author: kurtjx Date: 2009-02-03 20:53:25 +0000 (Tue, 03 Feb 2009) Log Message: ----------- new directory Added Paths: ----------- graphRDF/branches/old2sparul/ This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <ku...@us...> - 2009-02-03 15:15:46
|
Revision: 320 http://mypyspace.svn.sourceforge.net/mypyspace/?rev=320&view=rev Author: kurtjx Date: 2009-02-03 15:15:36 +0000 (Tue, 03 Feb 2009) Log Message: ----------- fixed a bug in the genre scraping Modified Paths: -------------- musicGrabber/branches/webserv-branch/myspace2rdf.py musicGrabber/branches/webserv-branch/myspaceuris.py Modified: musicGrabber/branches/webserv-branch/myspace2rdf.py =================================================================== --- musicGrabber/branches/webserv-branch/myspace2rdf.py 2009-01-27 13:55:06 UTC (rev 319) +++ musicGrabber/branches/webserv-branch/myspace2rdf.py 2009-02-03 15:15:36 UTC (rev 320) @@ -263,7 +263,7 @@ def scrapeGenre(self): - genreraw = scrapePage(self.page, [genreTag[0]], genreTag[1]) + '''genreraw = scrapePage(self.page, [genreTag[0]], genreTag[1]) if genreraw == None: return genreraw genreraw = str(genreraw).lstrip() @@ -278,8 +278,22 @@ self.mi.add(g) self.subject.genreTag.add(g) genresfixed.append(genre) - return genresfixed + return genresfixed''' + localGenres = scrapePage(self.page, [genreTag[0]], genreTag[1]) + if localGenres == None: + return None + genreNums = re.findall(''':"(.|..|...)"''', localGenres) # should return only 2 or 3 char string between + genres = [] + for gnum in genreNums: + genre = mopy.mo.Genre(myspaceOntology+urllib.quote(genreDict[int(gnum)])) + genre.name.set(genreDict[int(gnum)]) + self.mi.add(genre) + self.subject.genreTag.add(genre) + genres.append(genre) + + return genres + class mpsSong: """a class that wraps around the downloading, feature extracting and modeling of a piece of media attached to a mpsUser mpsSong object instances have the following public variables: Modified: musicGrabber/branches/webserv-branch/myspaceuris.py =================================================================== --- musicGrabber/branches/webserv-branch/myspaceuris.py 2009-01-27 13:55:06 UTC (rev 319) +++ musicGrabber/branches/webserv-branch/myspaceuris.py 2009-02-03 15:15:36 UTC (rev 320) @@ -26,7 +26,8 @@ # ### tag terminated by a ';' ### nameTag = """<span class="nametext">""", '''<''' # ### tag term by '<' -genreTag = '''<font color="#033330" size="1" face="Arial, Helvetica, sans-serif"><strong>\r\n\t\t\t\t\t''', ''' \r''' +genreTag = '''MySpace.Ads.BandType = {''', '''}''' +#genreTag = '''<font color="#033330" size="1" face="Arial, Helvetica, sans-serif"><strong>\r\n\t\t\t\t\t''', ''' \r''' #'''<font color="#033330" size="1" face="Arial, Helvetica, sans-serif"><strong>''', '''<''' # ### tag terminated by '<' niceURLTag = '''<td><div align="left"> <span><a href="''', '''">''' @@ -70,6 +71,9 @@ #adding this back in to lessen the broken... dbtuneMyspace = 'http://dbtune.org/myspace/' +#new dict for genres valid as of 2009 feb 3 +genreDict = {0:"", 61:"2-step", 59:"A'cappella", 125:"Acousmatic / Tape music", 1:"Acoustic", 73:"Afro-beat", 2:"Alternative", 3:"Ambient", 93:"Americana", 98:"Anime Song", 65:"Big Beat", 51:"Black Metal", 4:"Bluegrass", 5:"Blues", 105:"Bossa Nova", 60:"Breakbeat", 129:"Breakcore", 118:"Celtic", 109:"Children", 134:"Chinese pop", 135:"Chinese traditional", 6:"Christian", 7:"Christian Rap", 8:"Classic Rock", 77:"Classical", 110:"Classical - Opera and Vocal", 9:"Club", 10:"Comedy", 126:"Concrete", 11:"Country", 12:"Death Metal", 63:"Disco House", 70:"Down-tempo", 50:"Drum & Bass", 68:"Dub", 123:"Dutch pop", 67:"Electro", 127:"Electroacoustic", 13:"Electronica", 14:"Emo", 133:"Emotronic", 15:"Experimental", 107:"Flamenco", 16:"Folk", 17:"Folk Rock", 119:"French pop", 18:"Funk", 124:"Fusion", 56:"Garage", 120:"German pop", 79:"Glam", 112:"Gospel", 46:"Gothic", 95:"Grime", 47:"Grindcore", 19:"Grunge", 71:"Happy Hardcore", 57:"Hard House", 20:"Hardcore", 104:"Healing & EasyListening", 21:"Hip Hop", 22:"House", 69:"IDM", 97:"Idol", 23:"Indie", 45:"Industrial", 121:"Italian pop", 24:"Jam Band", 103:"Japanese Classic Music", 100:"Japanese Pop", 25:"Jazz", 58:"Jungle", 101:"Korean Pop", 49:"Latin", 128:"Live Electronics", 75:"Lounge", 113:"Lyrical", 102:"Melodramatic Popular Song", 26:"Metal", 131:"Minimalist", 76:"New Wave", 66:"Nu-Jazz", 27:"Other", 28:"Pop", 29:"Pop Punk", 130:"Post punk", 31:"Powerpop", 32:"Progressive", 62:"Progrsv House", 33:"Psychedelic", 43:"Psychobilly", 34:"Punk", 35:"R&B", 36:"Rap", 37:"Reggae", 111:"Religious", 38:"Rock", 44:"Rockabilly", 94:"Roots Music", 115:"Salsa", 116:"Samba", 39:"Screamo", 78:"Shoegaze", 96:"Showtunes", 40:"Ska", 41:"Soul", 106:"Soundtracks / Film music", 42:"Southern Rock", 122:"Spanish pop", 48:"Surf", 114:"Swing", 108:"Tango", 53:"Techno", 54:"Thrash", 52:"Trance", 132:"Trance", 55:"Trip Hop", 92:"Tropical", 99:"Visual", 117:"Zouk"} + def setRDFStoreURL(url): '''set the rdf uri path''' rdfStoreURL = url This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <gea...@us...> - 2009-01-27 13:55:12
|
Revision: 319 http://mypyspace.svn.sourceforge.net/mypyspace/?rev=319&view=rev Author: gearmonkey Date: 2009-01-27 13:55:06 +0000 (Tue, 27 Jan 2009) Log Message: ----------- fixed the image url problem, now actually point to song images not myspace default filler images. added a bit more inline documentation to the api examples code. Modified Paths: -------------- myspaceCrawler/trunk/examples.py myspaceCrawler/trunk/mpsUser.py Modified: myspaceCrawler/trunk/examples.py =================================================================== --- myspaceCrawler/trunk/examples.py 2009-01-27 12:43:42 UTC (rev 318) +++ myspaceCrawler/trunk/examples.py 2009-01-27 13:55:06 UTC (rev 319) @@ -6,7 +6,7 @@ some simple functions demonstrating mpsUser and mpsSong functionality Created by Benjamin Fields on 2008-11-09. -Copyright (c) 2008 __MyCompanyName__. All rights reserved. +Copyright (c) 2008 Goldsmiths. All rights reserved. """ import sys @@ -36,7 +36,7 @@ return 0 def socialCharts(initArtist, radius, chartLength=1): - '''breadth first crawl of width radius to find most chartLength popular songs from the center initArtist.''' + '''breadth first crawl of width radius to find at most chartLength popular songs from the center initArtist.''' songQueue = [] visitedArtists = [] artistsInThisLevel = [initArtist] Modified: myspaceCrawler/trunk/mpsUser.py =================================================================== --- myspaceCrawler/trunk/mpsUser.py 2009-01-27 12:43:42 UTC (rev 318) +++ myspaceCrawler/trunk/mpsUser.py 2009-01-27 13:55:06 UTC (rev 319) @@ -46,11 +46,13 @@ isArtist -- Boolean, True means instance describes a MySpace artist with media rdfprefix -- prefix for all rdf UIRs page -- locally loaded copy of html pointed to by source + The following are only set if user is found to be an artist mediaXML -- locally loaded (via miniDom) copy of xml describing playlist of media assciated - with myspace Artist (not set in non artists) + with myspace Artist totalPlays -- sum of playcounts of all songs associated with myspace Artist - (not set in non Artist) - artist -- self declared name of artist (not set in non Artist) + artist -- self declared name of artist + artistID -- unique ID possessed by artists only, needed to retrieve media and media related meta data + playlistID -- unique ID used to retrieve playlist found on page ''' @@ -342,7 +344,7 @@ else: self.extractionprefix = extractionprefix self.title = self.exhaustiveXML.getElementsByTagName('title')[0].firstChild.nodeValue - self.image = self.exhaustiveXML.getElementsByTagName('small')[0].firstChild.nodeValue + self.getimage() self.playcount = xmlNode.getElementsByTagName('stats')[0].getAttribute('plays') self.comments = "" #this is a blank string hold for the comments fields. Might be used later. self.trackNum, self.totalTracks = None, None @@ -357,10 +359,31 @@ try: self.uri = self.exhaustiveXML.getElementsByTagName('link')[0].firstChild.nodeValue except AttributeError, err: - logging.info("mpsUser::getUri ran into a problem finding the download link for a song by artist with uid: " + + logging.info("mpsUser::mpsSong::getUri ran into a problem finding the download link for a song by artist with uid: " + str(self.parent().uid) + " link will be left blank.\n\tError msg: " + str(err)) self.uri = '' - + def getimage(self): + '''find an image associated with the song, getting the largest resolution available''' + try: + self.image = self.exhaustiveXML.getElementsByTagName('track')[0].getElementsByTagName('large')[0].firstChild.nodeValue + except AttributeError: + try: + self.image = self.exhaustiveXML.getElementsByTagName('track')[0].getElementsByTagName('medium')[0].firstChild.nodeValue + except AttributeError: + try: + self.image = self.exhaustiveXML.getElementsByTagName('track')[0].getElementsByTagName('small')[0].firstChild.nodeValue + except Exception, err: + logging.info("mpsUser::mpsSong::getimage ran into a problem finding an image for a song by artist with uid: " + + str(self.parent().uid) + " image will be left blank.\n\tError msg: " + str(err)) + self.image = '' + except Exception, err: + logging.info("mpsUser::mpsSong::getimage ran into a problem finding an image for a song by artist with uid: " + + str(self.parent().uid) + " image will be left blank.\n\tError msg: " + str(err)) + self.image = '' + except Exception, err: + logging.info("mpsUser::mpsSong::getimage ran into a problem finding an image for a song by artist with uid: " + + str(self.parent().uid) + " image will be left blank.\n\tError msg: " + str(err)) + self.image = '' def setTrackNum(self, trackNumber, totalTracks): '''set the track number for this song and the number of tracks in the album it is in.''' self.trackNum = trackNumber This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <gea...@us...> - 2009-01-27 13:19:12
|
Revision: 318 http://mypyspace.svn.sourceforge.net/mypyspace/?rev=318&view=rev Author: gearmonkey Date: 2009-01-27 12:43:42 +0000 (Tue, 27 Jan 2009) Log Message: ----------- I think things are finally back in working order after the transition to myspace music 2.0 Turns out the new artistID and playlistID identifiers are terminated with ampersands not commas. I don't know where I got the comma termination from, but it's fixed now. These changes need to get to merged with the web-serv branch, I'll sort that out later today. Also, when I was testing these I noticed that the song pictures are all getting filled in with the myspace no photo icon. Not sure where that's coming from, but I'll try to sort that out in the near term as well. Modified Paths: -------------- myspaceCrawler/trunk/myspaceuris.py Modified: myspaceCrawler/trunk/myspaceuris.py =================================================================== --- myspaceCrawler/trunk/myspaceuris.py 2009-01-26 10:28:38 UTC (rev 317) +++ myspaceCrawler/trunk/myspaceuris.py 2009-01-27 12:43:42 UTC (rev 318) @@ -15,7 +15,7 @@ # new tag updated 13/1/2009 #""" <a href="http://profile.myspace.com/index.cfm?fuseaction=user.viewprofile&friendid=""", '''"''' # ### tag will be terminated by a '"' ### -friendNameTag = '''_friendLink">''', '''<''' +friendNameTag = '''_friendLink">''', '''</a>''' ### tag terminated by '<' ### userIDTag = '''"DisplayFriendId":''', ''',"IsLoggedIn"''' # 13/1/2009 @@ -41,9 +41,9 @@ ### #these two tag scraps are provisional for grabbing the ArtistID and playlist number, which are now nessecary to grab audio -#both of these should be terminated by a comma -playlistIDtag = """plid=""", ''',''' -artistIDtag = """artid=""",''',''' +#both of these should be terminated by an ampersand +playlistIDtag = """plid=""", '''&''' +artistIDtag = """artid=""",'''&''' ######################################################################################################### This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <gea...@us...> - 2009-01-26 10:28:43
|
Revision: 317 http://mypyspace.svn.sourceforge.net/mypyspace/?rev=317&view=rev Author: gearmonkey Date: 2009-01-26 10:28:38 +0000 (Mon, 26 Jan 2009) Log Message: ----------- fixed lots of spelling errors in the newly added bits of myspace2rdf. it's scrape not scrap or crap and scraping not scrapping good. Modified Paths: -------------- musicGrabber/branches/webserv-branch/myspace2rdf.py Modified: musicGrabber/branches/webserv-branch/myspace2rdf.py =================================================================== --- musicGrabber/branches/webserv-branch/myspace2rdf.py 2009-01-25 17:14:03 UTC (rev 316) +++ musicGrabber/branches/webserv-branch/myspace2rdf.py 2009-01-26 10:28:38 UTC (rev 317) @@ -131,10 +131,10 @@ def createArtistRDF(self): '''write RDF for an artist page''' - if self.scrapArtistID() and self.scrapPlaylistNumber(): + if self.scrapeArtistID() and self.scrapePlaylistNumber(): pass else: - print 'crap failed' + print 'scrape failed' # get the image imageURL = scrapePage(self.page, [picTag[0]+str(self.uid)+'''"><img src="'''], picTag[1]) @@ -190,22 +190,22 @@ p = f.read() print p - def scrapArtistID(self): - '''attempt to find via scrap of page the internal artist number.''' + def scrapeArtistID(self): + '''attempt to find via scrape of page the internal artist number.''' try: self.artistID = scrapePage(self.page, [artistIDtag[0]], artistIDtag[1]) return True except Exception, err: - print "Ran into trouble trying to scrap the ArtistID for page from " + self.source + "\nError::" + str(err) + print "Ran into trouble trying to scrape the ArtistID for page from " + self.source + "\nError::" + str(err) return False - def scrapPlaylistNumber(self): - """attempts to find via scrap of the internal identifier of an artist's playlist of songs""" + def scrapePlaylistNumber(self): + """attempts to find via scrape of the internal identifier of an artist's playlist of songs""" try: self.playlistID = scrapePage(self.page, [playlistIDtag[0]], playlistIDtag[1]) return True except Exception, err: - print "Ran into trouble trying to scrap the playlistID for page from " + self.source + "\nError::" + str(err) + print "Ran into trouble trying to scrape the playlistID for page from " + self.source + "\nError::" + str(err) return False This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <ku...@us...> - 2009-01-25 17:14:06
|
Revision: 316 http://mypyspace.svn.sourceforge.net/mypyspace/?rev=316&view=rev Author: kurtjx Date: 2009-01-25 17:14:03 +0000 (Sun, 25 Jan 2009) Log Message: ----------- added tracks to webserv Modified Paths: -------------- musicGrabber/branches/webserv-branch/myspace2rdf.py musicGrabber/branches/webserv-branch/myspaceuris.py Modified: musicGrabber/branches/webserv-branch/myspace2rdf.py =================================================================== --- musicGrabber/branches/webserv-branch/myspace2rdf.py 2009-01-23 17:20:28 UTC (rev 315) +++ musicGrabber/branches/webserv-branch/myspace2rdf.py 2009-01-25 17:14:03 UTC (rev 316) @@ -98,13 +98,12 @@ def isArtist(self): '''is current page an artist??? - - currently check for the flash player - - should switch to check for genre tags instead???''' + - previously checked for the flash player + - new check for genre tags instead''' if self.page: - - player = scrapePage(self.page, [playerTag[0]], playerTag[1]) - if player: + genrePresent = scrapePage(self.page, [genreTag[0]], genreTag[1]) + if genrePresent: self.subject = mopy.mo.MusicArtist(dbtuneMyspace+'uid/'+str(self.uid)) #self.subjecttwo = mopy.foaf.Person('http://dbtune.org/myspace/uid/'+str(self.uid)) #self.subject = mopy.mo.MusicArtist('http://dbtune.org/myspace/uid/'+str(self.uid)) @@ -132,6 +131,11 @@ def createArtistRDF(self): '''write RDF for an artist page''' + if self.scrapArtistID() and self.scrapPlaylistNumber(): + pass + else: + print 'crap failed' + # get the image imageURL = scrapePage(self.page, [picTag[0]+str(self.uid)+'''"><img src="'''], picTag[1]) img = mopy.foaf.Image(imageURL) @@ -150,19 +154,25 @@ # self.mi.add(thing2) idx=0 - xmlPage = try_open(mediaBase + str(self.uid)) - xmlStruct = dom.parseString(''.join(xmlPage.readlines())) - songs = xmlStruct.getElementsByTagName('song') - for song in songs: - try: + + xmlPage = try_open(mediaBase[0] + str(self.artistID) + mediaBase[1] + str(self.playlistID) + mediaBase[2] + str(self.uid) + mediaBase[3]) + #print mediaBase[0] + str(self.artistID) + mediaBase[1] + str(self.playlistID) + mediaBase[2] + str(self.uid) + mediaBase[3] + self.xmlStruct = dom.parseString(''.join(xmlPage.readlines())) + + songList = self.xmlStruct.getElementsByTagName('song') + for song in songList: + '''try: songTitle = unicodedata.normalize('NFKC',song.getAttribute('title')).encode('ascii','ignore') except AttributeError, err: songTitle = str(None) except IndexError, err: songTitle = str(None) - availableAs = song.getAttribute('durl') + #availableAs = song.getAttribute('durl')''' + thisSong = mpsSong(self, song, 'downloadprefix') + thisSong.getUri() + availableAs = thisSong.uri track = mopy.mo.Track() - track.title.set(songTitle) + track.title.set(thisSong.title) avas = mopy.mo.MusicalItem(availableAs) track.available_as.set(avas) @@ -180,7 +190,25 @@ p = f.read() print p - + def scrapArtistID(self): + '''attempt to find via scrap of page the internal artist number.''' + try: + self.artistID = scrapePage(self.page, [artistIDtag[0]], artistIDtag[1]) + return True + except Exception, err: + print "Ran into trouble trying to scrap the ArtistID for page from " + self.source + "\nError::" + str(err) + return False + + def scrapPlaylistNumber(self): + """attempts to find via scrap of the internal identifier of an artist's playlist of songs""" + try: + self.playlistID = scrapePage(self.page, [playlistIDtag[0]], playlistIDtag[1]) + return True + except Exception, err: + print "Ran into trouble trying to scrap the playlistID for page from " + self.source + "\nError::" + str(err) + return False + + def createRDF(self): '''write the info to RDF for non-artist page''' match = re.findall('viewAlbums&friendID='+str(self.uid)+'">\s*<img border="\d*" alt="[^"]*" src="([^"]*?)"', str(self.page)) @@ -245,14 +273,147 @@ for genre in genres: genre = genre.rstrip() genre = genre.lstrip() - g = mopy.mo.Genre('http://grasstunes.net/ontology/myspace.owl#'+urllib.quote(str(genre))) + g = mopy.mo.Genre(myspaceOntology+urllib.quote(str(genre))) g.name.set(genre) self.mi.add(g) self.subject.genreTag.add(g) genresfixed.append(genre) return genresfixed + +class mpsSong: + """a class that wraps around the downloading, feature extracting and modeling of a piece of media attached to a mpsUser + mpsSong object instances have the following public variables: + parent -- a weakref to the mpsUser that generated the mpsSong instance + uri -- lo res cached download link + betterUri -- hi res cached download link (not always available) + downloadprefix -- local prefix to stick the file when downloaded + extractionprefix -- local prefix to stick the feature files when extracted + title -- title of song + image -- url to get image associated with song + playcount -- number of times song has been played via myspace player + trackNum -- track number based on order presented on myspace + totalTracks -- number of songs available for parent + filename -- name used for local lofi file, when downloaded + HIFIfilename -- name used for local hifi file, when downloaded + beats -- local name of beat segmentaton file, used to do variable segment length feature extraction + """ + def __init__(self, parent, xmlNode, downloadprefix = '', extractionprefix = ''): + """initializes the mpsSong class. Parent is a pointer to the calling mpsUser, xmlNode should be a DOM object with the songs info. downloadprefix is the local directory prefix where the media will be put, default is an empty string. If no extractionprefix is given, extracted features will be places in the dir pointed to by downloadprefix""" + #self.parent = weakref.ref(parent) + self.xmlNode = xmlNode + self.getUri() + #the nicer file download is currently broken... + #self.betterURI = xmlNode.getAttribute('downloadable') + self.downloadprefix = downloadprefix + if extractionprefix == '': + self.extractionprefix = downloadprefix + else: + self.extractionprefix = extractionprefix + self.title = self.exhaustiveXML.getElementsByTagName('title')[0].firstChild.nodeValue + self.image = self.exhaustiveXML.getElementsByTagName('small')[0].firstChild.nodeValue + self.playcount = xmlNode.getElementsByTagName('stats')[0].getAttribute('plays') + self.comments = "" #this is a blank string hold for the comments fields. Might be used later. + self.trackNum, self.totalTracks = None, None + self.filename, self.HIFIfilename = None, None + self.beats = None + def getUri(self): + self.songID = self.xmlNode.getAttribute('songId') + xmlPage = try_open(songBase[0] + str(self.songID) + songBase[1]) + self.exhaustiveXML = dom.parseString(''.join(xmlPage.readlines())) + xmlPage.close() + try: + self.uri = self.exhaustiveXML.getElementsByTagName('link')[0].firstChild.nodeValue + except AttributeError, err: + logging.info("mpsUser::getUri ran into a problem finding the download link for a song by artist with uid: " + + str(self.parent().uid) + " link will be left blank.\n\tError msg: " + str(err)) + self.uri = '' + + def setTrackNum(self, trackNumber, totalTracks): + '''set the track number for this song and the number of tracks in the album it is in.''' + self.trackNum = trackNumber + self.totalTracks = totalTracks + + def download(self): + '''download the track. + Upon success set self.filename to the local location of the downloaded song and return true. + On FAIL return false.''' + logging.debug("downloading " + self.title + " by " + self.parent().artist + " to " + self.downloadprefix) + if self.trackNum != None: + filename = unicode(str(self.trackNum), 'utf8') + u'_' + self.title + u'.mp3' + else: + filename = self.title + u'.mp3' + if try_get(self.uri, os.path.join(self.downloadprefix, filename)) != None: + logging.debug("success on " + self.title + " by " + self.parent().artist + " to " + os.path.join(self.downloadprefix,filename)) + self.filename = filename + return True + else: + logging.debug("FAIL on " + self.title + " by " + self.parent().artist + " to " + os.path.join(self.downloadprefix,filename)) + return False + + def downloadHIFI(self): + '''if it exists, download the hi fidelity version of the track. + Upon success set self.HIFIfilename to the local location of the downloaded song and return true. + On FAIL return false.''' + if not self.betterURI: + logging.info("NO hi-fi version of " + self.title + " by " + self.parent().artist + " but we did look for it.") + return False + logging.debug("downloading hifi copy of " + self.title + "by" + self.parent().artist + " to " + self.downloadprefix) + if self.trackNum != None: + filename = unicode(str(self.trackNum), 'utf8') + u'_' + self.title + u'_hifi.mp3' + else: + filename = self.title + u'_hifi.mp3' + if (try_get(self.betteruri, os.path.join(self.downloadprefix,filename)) != None): + logging.debug("success on hi-fi version of " + self.title + " by " + self.parent().artist + " to " + os.path.join(self.downloadprefix,filename)) + self.HIFIfilename = filename + return True + else: + logging.debug("FAIL on hi-fi version of " + self.title + " by " + self.parent().artist + " to " + os.path.join(self.downloadprefix,filename)) + return False + + + def tag(self, hifi = False): + '''create or modify the id3 tag for downloaded song associated with self. set optional hifi arg to tag the hifi download''' + if hifi: + fileToTag = os.path.join(self.downloadprefix,self.HIFIfilename) + else: + fileToTag = os.path.join(self.downloadprefix,self.filename) + if fileToTag == None: + logging.info("asked to tag a file associated with uid: " + str(self.parent().uid) + " but the song does not exist locally") + logging.debug("adding tags to " + fileToTag) + try: id3 = mutagen.id3.ID3(fileToTag) + except mutagen.id3.ID3NoHeaderError: + logging.info("No ID3 header found for " + fileToTag + "; creating tag from scratch") + id3 = mutagen.id3.ID3() + except Exception, err: + logging.error(str(err)) + return + id3.add(mutagen.id3.TIT2(encoding=3,text=self.title)) + id3.add(mutagen.id3.TPE1(encoding=3,text=self.parent().artist)) + id3.add(mutagen.id3.COMM(encoding=3,text=self.comments, lang="eng", desc="")) + #id3.add(mutagen.id3.COMM(encoding=3,text=relationshipLink, lang="eng", desc="MusicGrabberSig")) + id3.add(mutagen.id3.TALB(encoding=3,text=self.parent().album)) + if self.trackNum != None: + id3.add(mutagen.id3.TRCK(encoding=3,text=str(self.trackNum) + '/' + str(self.totalTracks))) + id3.add(mutagen.id3.POPM(encoding=3,email=str(self.parent().uid)+"@myspace", rating = 128, count=self.playcount)) + if self.image == None: + logging.error("No image present for " + self.title + ", " + self.parent().artist) + try: + logging.debug("trying to get image from " + self.image) + localImgPath, imgHeader = try_get(self.image, os.path.join("/tmp",os.path.basename(self.image))) + imgHandle = open(localImgPath) + id3.add(mutagen.id3.APIC(encoding=3, mime=imgHeader.type, data=imgHandle.read(), type=17, desc="Song pic from myspace.com")) + except: + logging.error("Unable to retieve image for " + self.title + ", " + self.parent().artist) + try: + id3.save(fileToTag) + except Exception, err: + logging.error(str(err) + ";couldn\'t save the tag for " + self.title + " by " + self.parent().artist) + + + + def main(argv=None): if argv is None: argv = sys.argv Modified: musicGrabber/branches/webserv-branch/myspaceuris.py =================================================================== --- musicGrabber/branches/webserv-branch/myspaceuris.py 2009-01-23 17:20:28 UTC (rev 315) +++ musicGrabber/branches/webserv-branch/myspaceuris.py 2009-01-25 17:14:03 UTC (rev 316) @@ -4,6 +4,8 @@ # ### append user id to this ### rdfStoreURL = "http://myrdfspace.com/alpha/" + +myspaceOntology = 'http://grasstunes.net/ontology/myspace.owl#' ######################################################################################################### ######################################################################################################### @@ -37,8 +39,8 @@ ### #these two tag scraps are provisional for grabbing the ArtistID and playlist number, which are now nessecary to grab audio #both of these should be terminated by a comma -playlistIDtag = """plid=""", ''',''' -artistIDtag = """artid=""",''',''' +playlistIDtag = """plid=""", '''&''' +artistIDtag = """artid=""",'''&''' ######################################################################################################### This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <gea...@us...> - 2009-01-23 17:20:32
|
Revision: 315 http://mypyspace.svn.sourceforge.net/mypyspace/?rev=315&view=rev Author: gearmonkey Date: 2009-01-23 17:20:28 +0000 (Fri, 23 Jan 2009) Log Message: ----------- oops, used the wrong name for RDFtrans internal html representation Modified Paths: -------------- myspaceCrawler/trunk/RDFtrans.py Modified: myspaceCrawler/trunk/RDFtrans.py =================================================================== --- myspaceCrawler/trunk/RDFtrans.py 2009-01-23 16:57:40 UTC (rev 314) +++ myspaceCrawler/trunk/RDFtrans.py 2009-01-23 17:20:28 UTC (rev 315) @@ -61,7 +61,7 @@ def isArtist(self): '''is current page an artist???''' if self.HTML: - if genreTag[0] in self.page: + if genreTag[0] in self.HTML: artist = True else: artist = False This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <gea...@us...> - 2009-01-23 16:57:46
|
Revision: 314 http://mypyspace.svn.sourceforge.net/mypyspace/?rev=314&view=rev Author: gearmonkey Date: 2009-01-23 16:57:40 +0000 (Fri, 23 Jan 2009) Log Message: ----------- removed the playerTag as it's broken and unreliable. Replaced the artist check functionality by checking for genre formatting tags. The robustness of this method is to be determined, however it seems to work in most cases. Also, slightly altered some of the parameters used in url grabbing. Modified Paths: -------------- myspaceCrawler/trunk/RDFtrans.py myspaceCrawler/trunk/mpsUser.py myspaceCrawler/trunk/myspaceuris.py myspaceCrawler/trunk/tryurl.py Modified: myspaceCrawler/trunk/RDFtrans.py =================================================================== --- myspaceCrawler/trunk/RDFtrans.py 2009-01-16 17:59:29 UTC (rev 313) +++ myspaceCrawler/trunk/RDFtrans.py 2009-01-23 16:57:40 UTC (rev 314) @@ -61,12 +61,15 @@ def isArtist(self): '''is current page an artist???''' if self.HTML: - player = scrapePage(self.HTML, playerTag[0], playerTag[1]) + if genreTag[0] in self.page: + artist = True + else: + artist = False if not scrapePage(self.HTML, nameTag[0], nameTag[1]) == None: self.name = scrapePage(self.HTML, nameTag[0], nameTag[1]) else: self.name = str(None) - if player: + if artist: # make the mopy subject a myspace:MusicArtist self.subject = mopy.myspace.MusicArtist(self.NSprefix+str(self.uid)) # set the subject name Modified: myspaceCrawler/trunk/mpsUser.py =================================================================== --- myspaceCrawler/trunk/mpsUser.py 2009-01-16 17:59:29 UTC (rev 313) +++ myspaceCrawler/trunk/mpsUser.py 2009-01-23 16:57:40 UTC (rev 314) @@ -146,8 +146,8 @@ return xmlStruct def artistCheck(self): - '''for a given mpsUser with read source, check to see if it is an artist profile''' - if playerTag[0] in self.page: + '''for a given mpsUser with read source, check to see if it is an artist profile. This is done by examining the html source for the presence of genre labels. Note that even an artist without genre tags, will have these bits of markup, they will simply be blank.''' + if genreTag[0] in self.page: return True else: return False Modified: myspaceCrawler/trunk/myspaceuris.py =================================================================== --- myspaceCrawler/trunk/myspaceuris.py 2009-01-16 17:59:29 UTC (rev 313) +++ myspaceCrawler/trunk/myspaceuris.py 2009-01-23 16:57:40 UTC (rev 314) @@ -8,7 +8,8 @@ ######################################################################################################### # useful tags -playerTag = """SWFObject("http://musicservices.myspace.com/Modules/MusicServices/Services/Embed.ashx/ptype=4""", ''';''' +#the player tag is broken, so we're going to use the genre tag as an artist check +#playerTag = """SWFObject("http://musicservices.myspace.com/Modules/MusicServices/Services/Embed.ashx/ptype=4""", ''';''' # ### this tag will be terminated by a '.' ### friendTag = ''' <a href="http://profile.myspace.com/index.cfm?fuseaction=user.viewProfile&friendID=''', '''"''' # new tag updated 13/1/2009 Modified: myspaceCrawler/trunk/tryurl.py =================================================================== --- myspaceCrawler/trunk/tryurl.py 2009-01-16 17:59:29 UTC (rev 313) +++ myspaceCrawler/trunk/tryurl.py 2009-01-23 16:57:40 UTC (rev 314) @@ -5,8 +5,8 @@ #keepalive comes from the urlgrabber project, licensed under GPL and available here: http://linux.duke.edu/projects/urlgrabber/ import logging #changing to urllib2 and using a recently added timeout feature, so that the socket will timeout after TIMEOUT seconds -TIMEOUT = 12 -SLEEPTIME = .25 +TIMEOUT = 15 +SLEEPTIME = 5 #use the following three lines and import keepalive to use the keep alive urlopener This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <gea...@us...> - 2009-01-16 17:59:35
|
Revision: 313 http://mypyspace.svn.sourceforge.net/mypyspace/?rev=313&view=rev Author: gearmonkey Date: 2009-01-16 17:59:29 +0000 (Fri, 16 Jan 2009) Log Message: ----------- corrected the default base uri for rdf generated via mpsUser. Modified Paths: -------------- musicGrabber/branches/webserv-branch/myspace2rdf.py myspaceCrawler/tags/0.8.1b_release/myspaceCrawler.py myspaceCrawler/trunk/mpsUser.py myspaceCrawler/trunk/myspaceuris.py Modified: musicGrabber/branches/webserv-branch/myspace2rdf.py =================================================================== --- musicGrabber/branches/webserv-branch/myspace2rdf.py 2009-01-16 17:24:24 UTC (rev 312) +++ musicGrabber/branches/webserv-branch/myspace2rdf.py 2009-01-16 17:59:29 UTC (rev 313) @@ -102,10 +102,8 @@ - should switch to check for genre tags instead???''' if self.page: - ############################################# - # kludge set playr to always flase for now ## - ############################################# - player = False #= scrapePage(self.page, [playerTag[0]], playerTag[1]) + + player = scrapePage(self.page, [playerTag[0]], playerTag[1]) if player: self.subject = mopy.mo.MusicArtist(dbtuneMyspace+'uid/'+str(self.uid)) #self.subjecttwo = mopy.foaf.Person('http://dbtune.org/myspace/uid/'+str(self.uid)) Modified: myspaceCrawler/tags/0.8.1b_release/myspaceCrawler.py =================================================================== --- myspaceCrawler/tags/0.8.1b_release/myspaceCrawler.py 2009-01-16 17:24:24 UTC (rev 312) +++ myspaceCrawler/tags/0.8.1b_release/myspaceCrawler.py 2009-01-16 17:59:29 UTC (rev 313) @@ -32,7 +32,7 @@ -THREAD_CAP = 30 #maximum number of threads allowed to be firing at once +THREAD_CAP = 10000 #maximum number of threads allowed to be firing at once THREAD_STALL_TIME = 30 #length of time in seconds to wait until the thread count is checked again LOG_FILENAME = "musicCrawler.log" #name of logger file (path set at commandline) Modified: myspaceCrawler/trunk/mpsUser.py =================================================================== --- myspaceCrawler/trunk/mpsUser.py 2009-01-16 17:24:24 UTC (rev 312) +++ myspaceCrawler/trunk/mpsUser.py 2009-01-16 17:59:29 UTC (rev 313) @@ -55,7 +55,7 @@ ''' - def __init__(self, url, rdfprefix = dbtuneMyspace): + def __init__(self, url, rdfprefix = dbtuneMyspace + 'uid/'): """Initialization will set the source url, attempt to create a socket connection with the url and determine if this mpsUser is an artist. If the user given is an artist, the initialization will also scrape the top Friends. rdfprefix is the uri base prepended to the uids of other myspace resources, by default it is set to the dbtune live service""" self.source = url self.uid = -1 Modified: myspaceCrawler/trunk/myspaceuris.py =================================================================== --- myspaceCrawler/trunk/myspaceuris.py 2009-01-16 17:24:24 UTC (rev 312) +++ myspaceCrawler/trunk/myspaceuris.py 2009-01-16 17:59:29 UTC (rev 313) @@ -46,7 +46,6 @@ ######################################################################################################### - # myspace uri for downloads ----this has gotten a bit more complicated in the roll out of myspace's new media player # this xml file gives the songIDs, the songsIDs must be used individually to request another xml file that then contains the uri to the cached media # @@ -75,7 +74,6 @@ myspaceOwlURI = 'http://grasstunes.net/ontology/myspace.owl' dbtuneMyspace = 'http://dbtune.org/myspace/' - countries = ['Afghanistan', 'Albania', 'Algeria', 'American Samoa','Andorra', 'Angola','Anguilla','Antarctica','Antigua and Barbuda','Argentina', 'Armenia','Aruba','Australia','Austria','Azerbaijan','Bahamas', This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <gea...@us...> - 2009-01-16 17:24:34
|
Revision: 312 http://mypyspace.svn.sourceforge.net/mypyspace/?rev=312&view=rev Author: gearmonkey Date: 2009-01-16 17:24:24 +0000 (Fri, 16 Jan 2009) Log Message: ----------- The most notable change in this rev is the change to the genre tag. It was picking up loads of garbage with artist with no genres listed. This was fixed by removing all the whitespace from the scrape tag, replacing the closing tag (it use to be a single carriage return) and cleaning up the whitespace stripping mechanism for genre in RDFtrans. This seems to result in correct answers for artists with no listed genre (no genre entry in the rdf file) instead of gibberish. I think the rdf generated by RDFtrans inside the myspaceCrawler project is actually bordering on sensible now (it's been valid since r309, but now it actually makes sense). The most notable exception is that there are still some oddities in the myspace ontology namespace that need to be dealt with (the name space is showing as default5 instead of myspace). Modified Paths: -------------- myspaceCrawler/trunk/RDFtrans.py myspaceCrawler/trunk/myspaceuris.py myspaceCrawler/trunk/scraping.py Modified: myspaceCrawler/trunk/RDFtrans.py =================================================================== --- myspaceCrawler/trunk/RDFtrans.py 2009-01-15 20:06:13 UTC (rev 311) +++ myspaceCrawler/trunk/RDFtrans.py 2009-01-16 17:24:24 UTC (rev 312) @@ -86,18 +86,26 @@ friendUIDs = scrapePageWhile(self.HTML, friendTag[0], friendTag[1]) friendNames = scrapePageWhile(self.HTML, friendNameTag[0], friendNameTag[1]) friendPics = scrapePageWhile(self.HTML, friendPicTag[0], friendPicTag[1]) - - for i in range(len(friendUIDs)): - friend = mopy.myspace.Agent(self.NSprefix + str(friendUIDs[i])) + + if len(friendUIDs) != len(friendNames): + logging.info("Ther seems to be a different number of friend names (" + str(len(friendNames)) + + ") than friend IDs (" + str(len(friendUIDs)) + ") scraped off uid #" + str(self.uid) +".\nverify rdf.") + if len(friendUIDs) != len(friendPics): + logging.info("Ther seems to be a different number of friend pictures (" + str(len(friendPics)) + + ") than friend IDs (" + str(len(friendUIDs)) + ") scraped off uid #" + str(self.uid) +".\nverify rdf.") + + for idx, friendUID in enumerate(friendUIDs): + friend = mopy.myspace.Agent(self.NSprefix + str(friendUID)) try: - friend.name.set(friendNames[i]) + friend.name.set(friendNames[idx]) + logging.debug("adding friend with uid " + str(friendUID) + " whose name is " + str(friendNames[idx])) except Exception, err: logging.error("A friend name mismatch occurred in the rdf translation.\nRDFtrans::getFriends::" + str(err)) # refer to dbtune incase this friend isnt in crawl - thing = mopy.owl.Thing(dbtuneMyspace + 'uid/' + str(friendUIDs[i])) + thing = mopy.owl.Thing(dbtuneMyspace + 'uid/' + str(friendUID)) friend.sameAs.set(thing) try: - img = mopy.foaf.Image(friendPics[i]) + img = mopy.foaf.Image(friendPics[idx]) friend.depiction.add(img) self.mi.add(img) except: @@ -203,13 +211,13 @@ genreraw = scrapePage(self.HTML, genreTag[0], genreTag[1]) if genreraw == None: return genreraw - genreraw = str(genreraw).lstrip() - genreraw = genreraw.rstrip() + genreraw = str(genreraw).strip() + if genreraw == '': + return None genres = genreraw.split('/') genresfixed = [] for genre in genres: - genre = genre.rstrip() - genre = genre.lstrip() + genre = genre.strip() g = mopy.mo.Genre(myspaceOwlURI+'#'+urllib.quote(str(genre))) g.name.set(genre) self.mi.add(g) Modified: myspaceCrawler/trunk/myspaceuris.py =================================================================== --- myspaceCrawler/trunk/myspaceuris.py 2009-01-15 20:06:13 UTC (rev 311) +++ myspaceCrawler/trunk/myspaceuris.py 2009-01-16 17:24:24 UTC (rev 312) @@ -10,11 +10,11 @@ # useful tags playerTag = """SWFObject("http://musicservices.myspace.com/Modules/MusicServices/Services/Embed.ashx/ptype=4""", ''';''' # ### this tag will be terminated by a '.' ### -friendTag = '''<td bgcolor="FFFFFF" align="center" valign="top" width="107" style="word-wrap:break-word">\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t <a href="http://profile.myspace.com/index.cfm?fuseaction=user.viewProfile&friendID=''', '''"''' +friendTag = ''' <a href="http://profile.myspace.com/index.cfm?fuseaction=user.viewProfile&friendID=''', '''"''' # new tag updated 13/1/2009 #""" <a href="http://profile.myspace.com/index.cfm?fuseaction=user.viewprofile&friendid=""", '''"''' # ### tag will be terminated by a '"' ### -friendNameTag = """_friendLink">""", '''<''' +friendNameTag = '''_friendLink">''', '''<''' ### tag terminated by '<' ### userIDTag = '''"DisplayFriendId":''', ''',"IsLoggedIn"''' # 13/1/2009 @@ -24,7 +24,8 @@ # ### tag terminated by a ';' ### nameTag = """<span class="nametext">""", '''<''' # ### tag term by '<' -genreTag = '''<font color="#033330" size="1" face="Arial, Helvetica, sans-serif"><strong>\r\n\t\t\t\t\t''', ''' \r''' +#the returned identifier will inevitably be surrouned by whitespace that will need to be stripped +genreTag = '''<font color="#033330" size="1" face="Arial, Helvetica, sans-serif"><strong>''', '''</strong>''' #'''<font color="#033330" size="1" face="Arial, Helvetica, sans-serif"><strong>''', '''<''' # ### tag terminated by '<' niceURLTag = '''<td><div align="left"> <span><a href="''', '''">''' Modified: myspaceCrawler/trunk/scraping.py =================================================================== --- myspaceCrawler/trunk/scraping.py 2009-01-15 20:06:13 UTC (rev 311) +++ myspaceCrawler/trunk/scraping.py 2009-01-16 17:24:24 UTC (rev 312) @@ -35,7 +35,7 @@ logging.debug("Found identifier : "+identifier) return identifier; -def scrapePageWhile(page, patterns, termChar): +def scrapePageWhile(page, pattern, termChar): """Scrape the page given for each pattern and return a list with each identifier occurring after the last pattern (which is assumed to be terminated by termChar)""" @@ -44,8 +44,8 @@ idx_end = len(page) identifiers = [] itsFound = 1 + logging.debug("pattern : "+ pattern) while itsFound: - pattern = patterns idx = page.find(pattern, idx) #logging.debug("idx = "+str(idx)) if (idx > idx_end): # Couldn't find this pattern before re-occurrence of last pattern @@ -59,7 +59,7 @@ #logging.debug("idx_end = "+str(idx_end)) if idx != -1: - idx += len(patterns) + idx += len(pattern) # idx should now point to the start of the identifier we want id_end = page.find(termChar, idx) identifier = unicode(page[idx:id_end], 'utf8') This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: Benjamin F. <b.f...@go...> - 2009-01-16 00:07:32
|
sorry if anyone else is actually here this is just a test Benjamin Fields PhD Canidate Dept. of Computing Goldsmiths College, University of London b.f...@go... mobile: +44 (0)796 106 1568 "Which is more musical: a truck passing by a factory or a truck passing by a music school?" --John Cage |
From: Kurt J <ku...@gm...> - 2007-09-21 12:12:54
|
ok i commited the trunk/adding-rdf-branch merge in adding-rdf-branch. seems to be working... |
From: Kurt J <ku...@gm...> - 2007-09-21 10:39:33
|
Hey dude, The merge has been a bit rocky. SVN seemed to just leave some things out. not sure, my svn skills kinda suck. so i'm going by hand and checking trunk v mine. one question - do you hate logging? cuz i think it's pretty cool. but it does fuck up on py 2.3 but on 2.4+ it's the bees nees also, i've done some splitting of things into mroe files. so this might make more merging woes in the future... but it seemed like the thing to do. -kurt j |
From: Benjamin F. <ma...@go...> - 2007-09-11 14:46:11
|
So I have created a branch for the implementation of id3 tag writing for the files that are downloaded. It can be found at the following svn path: /mypyspace/branches/adding-ID3-branch Also, is anyone besides me on this list? Ben Fields PhD Student Dept. of Computing Goldsmiths College, University of London e: ma...@go... p: +44 (0) 20 7078 5170 |