mypyspace-developer Mailing List for MyPySpace (Page 2)

Status: Pre-Alpha

Brought to you by: gearmonkey, kurtjx

mypyspace-developer — MyPy developer related discussion here

You can subscribe to this list here.

2007	_Jan	_Feb	_Mar	_Apr	_May	_Jun	_Jul	_Aug	_Sep (3)	_Oct	_Nov	_Dec
2009	_Jan (9)	_Feb (13)	_Mar (4)	_Apr (4)	_May (13)	_Jun (1)	_Jul	_Aug	_Sep	_Oct (2)	_Nov	_Dec

Flat | Threaded

<< < 1 2 (Page 2 of 2)

[Mypyspace-developer] SF.net SVN: mypyspace:[331] musicGrabber/branches/webserv-branch/ myspace2rdf.py

From: <ku...@us...> - 2009-02-26 10:49:32

Revision: 331
          http://mypyspace.svn.sourceforge.net/mypyspace/?rev=331&view=rev
Author:   kurtjx
Date:     2009-02-26 10:49:27 +0000 (Thu, 26 Feb 2009)

Log Message:
-----------
fixed another bug - this one in mpsSong where the except argument causes some problems in line 350 or so - also checking if avas is None in line 180ish

Modified Paths:
--------------
    musicGrabber/branches/webserv-branch/myspace2rdf.py

Modified: musicGrabber/branches/webserv-branch/myspace2rdf.py
===================================================================
--- musicGrabber/branches/webserv-branch/myspace2rdf.py	2009-02-25 10:15:29 UTC (rev 330)
+++ musicGrabber/branches/webserv-branch/myspace2rdf.py	2009-02-26 10:49:27 UTC (rev 331)
@@ -177,16 +177,18 @@
 				#availableAs = song.getAttribute('durl')'''
 				thisSong = mpsSong(self, song, 'downloadprefix')
 				thisSong.getUri()
-				availableAs = thisSong.uri
+				
 				track = mopy.mo.Track()
 				track.title.set(thisSong.title)
-			
-				avas = mopy.mo.MusicalItem(availableAs)
-				track.available_as.set(avas)
+				availableAs = thisSong.uri
+				if availableAs:
+					avas = mopy.mo.MusicalItem(availableAs)
+					track.available_as.set(avas)
+					self.mi.add(avas)
 				#track.available_as.set(mopy.rdfs.Resource(availableAs))
 				self.subject.made.add(track)
 				self.mi.add(track)
-				self.mi.add(avas)
+				
 		
 		self.createCommonRDF()
 		self.scrapeGenre()
@@ -346,8 +348,9 @@
 		try:
 			self.uri = self.exhaustiveXML.getElementsByTagName('link')[0].firstChild.nodeValue
 		except AttributeError, err:
-			logging.info("mpsUser::getUri ran into a problem finding the download link for a song by artist with uid: " + 
-				str(self.parent().uid) + " link will be left blank.\n\tError msg: " + str(err))
+			#logging.info("mpsUser::getUri ran into a problem finding the download link for a song by artist with uid: " + 
+			#	str(self.parent().uid) + " link will be left blank.\n\tError msg: " + str(err))
+			pass
 			self.uri = ''
 
 	def setTrackNum(self, trackNumber, totalTracks):


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Mypyspace-developer] SF.net SVN: mypyspace:[330] musicGrabber/branches/webserv-branch/ myspaceuris.py

From: <ku...@us...> - 2009-02-25 10:15:32

Revision: 330
          http://mypyspace.svn.sourceforge.net/mypyspace/?rev=330&view=rev
Author:   kurtjx
Date:     2009-02-25 10:15:29 +0000 (Wed, 25 Feb 2009)

Log Message:
-----------
fixed myspace ontology prefix in myspaceuris.py

Modified Paths:
--------------
    musicGrabber/branches/webserv-branch/myspaceuris.py

Modified: musicGrabber/branches/webserv-branch/myspaceuris.py
===================================================================
--- musicGrabber/branches/webserv-branch/myspaceuris.py	2009-02-25 10:13:56 UTC (rev 329)
+++ musicGrabber/branches/webserv-branch/myspaceuris.py	2009-02-25 10:15:29 UTC (rev 330)
@@ -5,7 +5,7 @@
 #						### append user id to this ###
 rdfStoreURL = "http://myrdfspace.com/alpha/"
 
-myspaceOntology = 'http://purl.org/ontology/myspace.owl#'
+myspaceOntology = 'http://purl.org/ontology/myspace#'
 #########################################################################################################
 
 #########################################################################################################


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Mypyspace-developer] SF.net SVN: mypyspace:[329] musicGrabber/branches/webserv-branch/ myspaceuris.py

From: <ku...@us...> - 2009-02-25 10:14:00

Revision: 329
          http://mypyspace.svn.sourceforge.net/mypyspace/?rev=329&view=rev
Author:   kurtjx
Date:     2009-02-25 10:13:56 +0000 (Wed, 25 Feb 2009)

Log Message:
-----------
fixed myspace ontology prefix in myspaceuris.py

Modified Paths:
--------------
    musicGrabber/branches/webserv-branch/myspaceuris.py

Modified: musicGrabber/branches/webserv-branch/myspaceuris.py
===================================================================
--- musicGrabber/branches/webserv-branch/myspaceuris.py	2009-02-25 10:06:09 UTC (rev 328)
+++ musicGrabber/branches/webserv-branch/myspaceuris.py	2009-02-25 10:13:56 UTC (rev 329)
@@ -5,7 +5,7 @@
 #						### append user id to this ###
 rdfStoreURL = "http://myrdfspace.com/alpha/"
 
-myspaceOntology = 'http://grasstunes.net/ontology/myspace.owl#'
+myspaceOntology = 'http://purl.org/ontology/myspace.owl#'
 #########################################################################################################
 
 #########################################################################################################


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Mypyspace-developer] SF.net SVN: mypyspace:[328] musicGrabber/branches/webserv-branch/ myspace2rdf.py

From: <ku...@us...> - 2009-02-25 10:06:19

Revision: 328
          http://mypyspace.svn.sourceforge.net/mypyspace/?rev=328&view=rev
Author:   kurtjx
Date:     2009-02-25 10:06:09 +0000 (Wed, 25 Feb 2009)

Log Message:
-----------
fixed REM bug where an empty xmlPage made a crash at line 168 - now we just skip song info if xmlPage is None

Modified Paths:
--------------
    musicGrabber/branches/webserv-branch/myspace2rdf.py

Modified: musicGrabber/branches/webserv-branch/myspace2rdf.py
===================================================================
--- musicGrabber/branches/webserv-branch/myspace2rdf.py	2009-02-18 12:56:50 UTC (rev 327)
+++ musicGrabber/branches/webserv-branch/myspace2rdf.py	2009-02-25 10:06:09 UTC (rev 328)
@@ -160,35 +160,34 @@
 			# self.subject.sameAs.set(thing2)
 			# self.mi.add(thing2)
 			
-		idx=0
 		
 		xmlPage = try_open(mediaBase[0] + str(self.artistID) + mediaBase[1] + str(self.playlistID) + mediaBase[2] + str(self.uid) + mediaBase[3])
 		#print mediaBase[0] + str(self.artistID) + mediaBase[1] + str(self.playlistID) + mediaBase[2] + str(self.uid) + mediaBase[3]
-		self.xmlStruct = dom.parseString(''.join(xmlPage.readlines()))
-
-		songList = self.xmlStruct.getElementsByTagName('song')
-		for song in songList:
-			'''try:
-				songTitle = unicodedata.normalize('NFKC',song.getAttribute('title')).encode('ascii','ignore')
-			except AttributeError, err:
-				songTitle = str(None)
-			except IndexError, err:
-				songTitle = str(None)
-			#availableAs = song.getAttribute('durl')'''
-			thisSong = mpsSong(self, song, 'downloadprefix')
-			thisSong.getUri()
-			availableAs = thisSong.uri
-			track = mopy.mo.Track()
-			track.title.set(thisSong.title)
+		
+		if xmlPage:
+			self.xmlStruct = dom.parseString(''.join(xmlPage.readlines()))
+			songList = self.xmlStruct.getElementsByTagName('song')
+			for song in songList:
+				'''try:
+					songTitle = unicodedata.normalize('NFKC',song.getAttribute('title')).encode('ascii','ignore')
+				except AttributeError, err:
+					songTitle = str(None)
+				except IndexError, err:
+					songTitle = str(None)
+				#availableAs = song.getAttribute('durl')'''
+				thisSong = mpsSong(self, song, 'downloadprefix')
+				thisSong.getUri()
+				availableAs = thisSong.uri
+				track = mopy.mo.Track()
+				track.title.set(thisSong.title)
 			
-			avas = mopy.mo.MusicalItem(availableAs)
-			track.available_as.set(avas)
-			#track.available_as.set(mopy.rdfs.Resource(availableAs))
-			self.subject.made.add(track)
-			self.mi.add(track)
-			self.mi.add(avas)
+				avas = mopy.mo.MusicalItem(availableAs)
+				track.available_as.set(avas)
+				#track.available_as.set(mopy.rdfs.Resource(availableAs))
+				self.subject.made.add(track)
+				self.mi.add(track)
+				self.mi.add(avas)
 		
-			idx+=1
 		self.createCommonRDF()
 		self.scrapeGenre()
 		self.mi.add(self.subject)


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Mypyspace-developer] SF.net SVN: mypyspace:[327] graphRDF/branches/old2sparul/old2sparul.py

From: <ku...@us...> - 2009-02-18 12:56:54

Revision: 327
          http://mypyspace.svn.sourceforge.net/mypyspace/?rev=327&view=rev
Author:   kurtjx
Date:     2009-02-18 12:56:50 +0000 (Wed, 18 Feb 2009)

Log Message:
-----------
added some regex stuff to get rid of bad genre tags, sometime 1324123.rdf was set as a theme which was a bug in the old code i guess

Modified Paths:
--------------
    graphRDF/branches/old2sparul/old2sparul.py

Modified: graphRDF/branches/old2sparul/old2sparul.py
===================================================================
--- graphRDF/branches/old2sparul/old2sparul.py	2009-02-18 12:51:29 UTC (rev 326)
+++ graphRDF/branches/old2sparul/old2sparul.py	2009-02-18 12:56:50 UTC (rev 327)
@@ -3,8 +3,10 @@
 """
 old2sparul.py
 
+This is an ad hoc script for taking data from myrdfspace.com, cleaning it, and putting in sparql endpoint
+
 Created by Kurtis Random on 2009-02-03.
-Copyright (c) 2009 __MyCompanyName__. All rights reserved.
+Copyright (c) 2009 C4DM QMUL. All rights reserved.
 """
 
 import sys
@@ -12,21 +14,22 @@
 from logging import log, error, warning, info, debug
 import logging
 import ftplib
-#from SPARQLWrapper import SPARQLWrapper
 import SPARQLWrapper
 import mopy
 import urllib2
+import re
 from time import sleep
 
 help_message = '''
 take old myrdfspace files and add to the sparql endpoint...
 	-b --base <uri base from myrdfspace>
+	-s --start <uid to start from> useful after a crash ;-)
 '''
 
 failedList = []
 badQueryList = []
 
-defaultGraph = "http://dbtune.org/myspace-fj-set-2008"
+defaultGraph = "http://dbtune.org/myspace-fj-2008"
 sparqlEndPoint = "http://dbtune.org/cmn/sparql"
 myspaceBase = "http://dbtune.org/myspace/uid"
 myspaceOnt = "http://purl.org/ontology/myspace"
@@ -50,7 +53,8 @@
 			sleep(1.0)
 			attempt+=1
 			tryImportRDF(filename, attempt)
-		return mi
+		else:
+			return mi
 	debug("import failed after tries: " + str(attempt))
 	return None
 
@@ -58,45 +62,55 @@
 	'''parse the rdf and return a sparql update query'''
 	sparqlU=''
 	mi = tryImportRDF(base+filename, 0)
-	keys = mi.PersonIdx.keys()
-	for key in keys:
-		person = mi.PersonIdx[key]
-		if person.name:
-			# if we find the name, this is the main subject
-			suid = person.URI.split(base)[1]
-			subject = "<"+myspaceBase+"/"+suid+">"
-			name = person.name.pop()
-			sparqlU = sparqlU + '\n'+subject+' rdf:type mo:MusicArtist .'
-			sparqlU = sparqlU + '\n'+subject+' myspace:myspaceID "'+filename.rstrip('.rdf')+'"^^xsd:int .'
-			sparqlU = sparqlU + """\n"""+subject+' foaf:name "' + urllib2.quote(name)+'"@en . '
+	if mi:
+		keys = mi.PersonIdx.keys()
+		for key in keys:
+			person = mi.PersonIdx[key]
+			if person.name:
+				# if we find the name, this is the main subject
+				suid = person.URI.split(base)[1]
+				subject = "<"+myspaceBase+"/"+suid+">"
+				name = person.name.pop()
+				sparqlU = sparqlU + '\n'+subject+' rdf:type mo:MusicArtist .'
+				sparqlU = sparqlU + '\n'+subject+' myspace:myspaceID "'+filename.rstrip('.rdf')+'"^^xsd:int .'
+				sparqlU = sparqlU + """\n"""+subject+' foaf:name "' + urllib2.quote(name)+'"@en . '
 			
-			# get all the top friends
-			while(1):
-				try:
-					p = person.knows.pop()
-					ouid = p.URI.split(base)[1]
-					obj = "<"+myspaceBase+"/"+ouid+">"	
-					sparqlU=sparqlU+ "\n"+subject+" foaf:knows "+ obj+ ' . ' "\n"+subject+" myspace:topFriend "+obj+ ' . '
-					sparqlU = sparqlU + '\n'+obj+' rdf:type mo:MusicArtist .'
-				except:
-					break
+				# get all the top friends
+				while(1):
+					try:
+						p = person.knows.pop()
+					except:
+						break
+					else:
+						ouid = p.URI.split(base)[1]
+						obj = "<"+myspaceBase+"/"+ouid+">"	
+						sparqlU=sparqlU+ "\n"+subject+" foaf:knows "+ obj+ ' . ' "\n"+subject+" myspace:topFriend "+obj+ ' . '
+						sparqlU = sparqlU + '\n'+obj+' rdf:type mo:MusicArtist .'
 					
-			while(1):
+				while(1):
+					try:
+						thm = person.theme.pop()
+					except:
+						debug("breaking from genre pops")
+						break
+					else:
+						thm = thm.URI.split(base)[1]
+						# do some cleaning, bad genres in there like 35123543.rdf instead of hip hop
+						if not re.match(".*\.rdf",thm):
+							debug("adding genre: "+thm)
+							genre = "<"+myspaceOnt + "#"+urllib2.quote(thm)+">"
+							sparqlU=sparqlU+ "\n"+subject+ " myspace:genreTag "+ genre+ ' . '
+					
 				try:
-					thm = person.theme.pop()
-					genre = "<"+myspaceOnt + "#"+urllib2.quote(thm.URI.split(base)[1])+">"
-					sparqlU=sparqlU+ "\n"+subject+ " myspace:genreTag "+ genre+ ' . '
+					playcount = person.tipjar.pop().URI.split(base)[1]
+					sparqlU=sparqlU+ "\n"+subject+ ' myspace:totalPlays "'+ playcount+'"^^xsd:int . '
 				except:
-					break
-					
-			try:
-				playcount = person.tipjar.pop().URI.split(base)[1]
-				sparqlU=sparqlU+ "\n"+subject+ ' myspace:totalPlays "'+ playcount+'"^^xsd:int . '
-			except:
-				pass
+					pass
 				
-	sparqlU=sparqlU+'}'				
-	return sparqlU	
+		sparqlU=sparqlU+'}'				
+		return sparqlU
+	else:
+		return None
 
 def setLogger():
     '''just set the logger'''
@@ -216,17 +230,20 @@
 			sparul = parseRDF(f, base)
 			sparql = SPARQLWrapper.SPARQLWrapper(sparqlEndPoint)
 			sparql.addDefaultGraph(defaultGraph)
-			
-			# we have to deal w/ queries that are too long
-			if len(sparul) > apacheLimit:
-				debug('query too long, splitting...')
-				splitSparul = splitQuery(sparul)
-				for split in splitSparul:
-					sparql.setQuery(prefixes+split)
+			if sparul:
+				# we have to deal w/ queries that are too long
+				if len(sparul) > apacheLimit:
+					debug('query too long, splitting...')
+					splitSparul = splitQuery(sparul)
+					for split in splitSparul:
+						sparql.setQuery(prefixes+split)
+						trySparql(sparql, 0, f)
+				else:
+					sparql.setQuery(prefixes+insert+sparul)
 					trySparql(sparql, 0, f)
 			else:
-				sparql.setQuery(prefixes+insert+sparul)
-				trySparql(sparql, 0, f)
+				debug('failure on '+str(f))
+				failedList.append(f)
 
 				
 		


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Mypyspace-developer] SF.net SVN: mypyspace:[326] musicGrabber/branches/webserv-branch/ myspace2rdf.py

From: <ku...@us...> - 2009-02-18 12:51:33

Revision: 326
          http://mypyspace.svn.sourceforge.net/mypyspace/?rev=326&view=rev
Author:   kurtjx
Date:     2009-02-18 12:51:29 +0000 (Wed, 18 Feb 2009)

Log Message:
-----------
added foaf:primaryTopic to rdf, mostly cuz kingsley said to

Modified Paths:
--------------
    musicGrabber/branches/webserv-branch/myspace2rdf.py

Modified: musicGrabber/branches/webserv-branch/myspace2rdf.py
===================================================================
--- musicGrabber/branches/webserv-branch/myspace2rdf.py	2009-02-05 15:06:38 UTC (rev 325)
+++ musicGrabber/branches/webserv-branch/myspace2rdf.py	2009-02-18 12:51:29 UTC (rev 326)
@@ -105,8 +105,10 @@
 			genrePresent = scrapePage(self.page, [genreTag[0]], genreTag[1])
 			if genrePresent:
 				self.subject = mopy.mo.MusicArtist(dbtuneMyspace+'uid/'+str(self.uid))
-				#self.subjecttwo = mopy.foaf.Person('http://dbtune.org/myspace/uid/'+str(self.uid))
-				#self.subject = mopy.mo.MusicArtist('http://dbtune.org/myspace/uid/'+str(self.uid))
+				# add foaf:primaryTopic
+				ppd = mopy.foaf.PersonalProfileDocument("")
+				ppd.primaryTopic.set(self.subject)
+				self.mi.add(ppd)
 				self.name = scrapePage(self.page, [nameTag[0]], nameTag[1])
 				if self.name:
 					self.subject.name.set(self.name)
@@ -117,6 +119,11 @@
 			else:
 				#self.subject = mopy.mo.Agent('http://dbtune.org/myspace/uid/'+str(self.uid))
 				self.subject = mopy.foaf.Person(dbtuneMyspace+'uid/'+str(self.uid))
+				self.subject = mopy.mo.MusicArtist(dbtuneMyspace+'uid/'+str(self.uid))
+				# add foaf:primaryTopic
+				ppd = mopy.foaf.PersonalProfileDocument("")
+				ppd.primaryTopic.set(self.subject)
+				self.mi.add(ppd)
 				self.name = scrapePage(self.page, [nameTag[0]], nameTag[1])
 				#print self.name
 				if self.name:


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Mypyspace-developer] SF.net SVN: mypyspace:[325] graphRDF/branches/old2sparul/old2sparul.py

From: <ku...@us...> - 2009-02-05 15:06:42

Revision: 325
          http://mypyspace.svn.sourceforge.net/mypyspace/?rev=325&view=rev
Author:   kurtjx
Date:     2009-02-05 15:06:38 +0000 (Thu, 05 Feb 2009)

Log Message:
-----------
some additional error handling for fails on importRDFFile and an arguement to restart mid directory

Modified Paths:
--------------
    graphRDF/branches/old2sparul/old2sparul.py

Modified: graphRDF/branches/old2sparul/old2sparul.py
===================================================================
--- graphRDF/branches/old2sparul/old2sparul.py	2009-02-04 17:20:12 UTC (rev 324)
+++ graphRDF/branches/old2sparul/old2sparul.py	2009-02-05 15:06:38 UTC (rev 325)
@@ -40,10 +40,24 @@
 	def __init__(self, msg):
 		self.msg = msg
 
+def tryImportRDF(filename, attempt):
+	if attempt < 5:
+		debug("importing rdf")
+		try:
+			mi = mopy.importRDFFile(filename)
+		except urllib2.URLError:
+			debug("URLError importing RDF, retrying")
+			sleep(1.0)
+			attempt+=1
+			tryImportRDF(filename, attempt)
+		return mi
+	debug("import failed after tries: " + str(attempt))
+	return None
+
 def parseRDF(filename, base):
 	'''parse the rdf and return a sparql update query'''
 	sparqlU=''
-	mi = mopy.importRDFFile(base+filename)
+	mi = tryImportRDF(base+filename, 0)
 	keys = mi.PersonIdx.keys()
 	for key in keys:
 		person = mi.PersonIdx[key]
@@ -156,12 +170,13 @@
 		argv = sys.argv
 	try:
 		try:
-			opts, args = getopt.getopt(argv[1:], "ho:b:v", ["help", "output=","base="])
+			opts, args = getopt.getopt(argv[1:], "ho:b:s:v", ["help", "output=","base=", "start="])
 		except getopt.error, msg:
 			raise Usage(msg)
 	
 		# option processing
 		base = None
+		start = None
 		for option, value in opts:
 			if option == "-v":
 				verbose = True
@@ -171,6 +186,8 @@
 				output = value
 			if option in ("-b", "--base"):
 				base = value
+			if option in ("-s", "--start"):
+				start = value
 			'''if option in ("-g", '--graph'):
 				defaultGraph = value
 				insert = """ \ninsert into graph <"""+defaultGraph+"""> {"""'''
@@ -186,7 +203,14 @@
 		fileList = getFileListing(folder)
 		debug('got list of files')
 		#fileList = ['238729309.rdf', '13280592.rdf', '26412401.rdf', '8557307.rdf', '176635064.rdf', '12656647.rdf']
-		for f in fileList:
+		startIndex=0
+		if start:
+			try:
+				startIndex=fileList.index(start)
+			except:
+				debug("not a valid start file, not in list")
+				
+		for f in fileList[startIndex:]:
 			debug('parsing on file: '+str(f))
 			#parse each file and do a sparql update to the repository
 			sparul = parseRDF(f, base)


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Mypyspace-developer] SF.net SVN: mypyspace:[324] graphRDF/branches/old2sparul/old2sparul.py

From: <ku...@us...> - 2009-02-04 17:20:16

Revision: 324
          http://mypyspace.svn.sourceforge.net/mypyspace/?rev=324&view=rev
Author:   kurtjx
Date:     2009-02-04 17:20:12 +0000 (Wed, 04 Feb 2009)

Log Message:
-----------
old2sparul working properly :-)

Modified Paths:
--------------
    graphRDF/branches/old2sparul/old2sparul.py

Modified: graphRDF/branches/old2sparul/old2sparul.py
===================================================================
--- graphRDF/branches/old2sparul/old2sparul.py	2009-02-04 15:18:08 UTC (rev 323)
+++ graphRDF/branches/old2sparul/old2sparul.py	2009-02-04 17:20:12 UTC (rev 324)
@@ -26,11 +26,11 @@
 failedList = []
 badQueryList = []
 
-defaultGraph = "http://dbtune.org/myspace-fj-2008p"
+defaultGraph = "http://dbtune.org/myspace-fj-set-2008"
 sparqlEndPoint = "http://dbtune.org/cmn/sparql"
 myspaceBase = "http://dbtune.org/myspace/uid"
 myspaceOnt = "http://purl.org/ontology/myspace"
-prefixes = """PREFIX owl: <http://www.w3.org/2002/07/owl#> \nPREFIX foaf: <http://xmlns.com/foaf/0.1/> \nPREFIX dc: <http://purl.org/dc/elements/1.1/> \nPREFIX mo: <http://purl.org/ontology/mo/>\nPREFIX myspace: <http://purl.org/ontology/myspace#>\nPREFIX xsd: <http://www.w3.org/2001/XMLSchema#>"""
+prefixes = """PREFIX owl: <http://www.w3.org/2002/07/owl#> \nPREFIX foaf: <http://xmlns.com/foaf/0.1/> \nPREFIX dc: <http://purl.org/dc/elements/1.1/> \nPREFIX mo: <http://purl.org/ontology/mo/>\nPREFIX myspace: <http://purl.org/ontology/myspace#>\nPREFIX xsd: <http://www.w3.org/2001/XMLSchema#>\nPREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>"""
 
 insert = """ \ninsert into graph <"""+defaultGraph+"""> {"""
 
@@ -52,6 +52,8 @@
 			suid = person.URI.split(base)[1]
 			subject = "<"+myspaceBase+"/"+suid+">"
 			name = person.name.pop()
+			sparqlU = sparqlU + '\n'+subject+' rdf:type mo:MusicArtist .'
+			sparqlU = sparqlU + '\n'+subject+' myspace:myspaceID "'+filename.rstrip('.rdf')+'"^^xsd:int .'
 			sparqlU = sparqlU + """\n"""+subject+' foaf:name "' + urllib2.quote(name)+'"@en . '
 			
 			# get all the top friends
@@ -61,6 +63,7 @@
 					ouid = p.URI.split(base)[1]
 					obj = "<"+myspaceBase+"/"+ouid+">"	
 					sparqlU=sparqlU+ "\n"+subject+" foaf:knows "+ obj+ ' . ' "\n"+subject+" myspace:topFriend "+obj+ ' . '
+					sparqlU = sparqlU + '\n'+obj+' rdf:type mo:MusicArtist .'
 				except:
 					break
 					


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Mypyspace-developer] SF.net SVN: mypyspace:[323] graphRDF/branches/old2sparul/old2sparul.py

From: <ku...@us...> - 2009-02-04 15:18:11

Revision: 323
          http://mypyspace.svn.sourceforge.net/mypyspace/?rev=323&view=rev
Author:   kurtjx
Date:     2009-02-04 15:18:08 +0000 (Wed, 04 Feb 2009)

Log Message:
-----------
splits big queries down now

Modified Paths:
--------------
    graphRDF/branches/old2sparul/old2sparul.py

Modified: graphRDF/branches/old2sparul/old2sparul.py
===================================================================
--- graphRDF/branches/old2sparul/old2sparul.py	2009-02-03 20:55:00 UTC (rev 322)
+++ graphRDF/branches/old2sparul/old2sparul.py	2009-02-04 15:18:08 UTC (rev 323)
@@ -22,22 +22,27 @@
 take old myrdfspace files and add to the sparql endpoint...
 	-b --base <uri base from myrdfspace>
 '''
+
 failedList = []
 badQueryList = []
 
-defaultGraph = "http://dbtune.org/myspace-test"
+defaultGraph = "http://dbtune.org/myspace-fj-2008p"
 sparqlEndPoint = "http://dbtune.org/cmn/sparql"
 myspaceBase = "http://dbtune.org/myspace/uid"
 myspaceOnt = "http://purl.org/ontology/myspace"
 prefixes = """PREFIX owl: <http://www.w3.org/2002/07/owl#> \nPREFIX foaf: <http://xmlns.com/foaf/0.1/> \nPREFIX dc: <http://purl.org/dc/elements/1.1/> \nPREFIX mo: <http://purl.org/ontology/mo/>\nPREFIX myspace: <http://purl.org/ontology/myspace#>\nPREFIX xsd: <http://www.w3.org/2001/XMLSchema#>"""
 
+insert = """ \ninsert into graph <"""+defaultGraph+"""> {"""
+
+apacheLimit = 2000
+
 class Usage(Exception):
 	def __init__(self, msg):
 		self.msg = msg
 
 def parseRDF(filename, base):
 	'''parse the rdf and return a sparql update query'''
-	sparqlU = prefixes+""" \ninsert into graph <"""+defaultGraph+"""> {"""
+	sparqlU=''
 	mi = mopy.importRDFFile(base+filename)
 	keys = mi.PersonIdx.keys()
 	for key in keys:
@@ -99,8 +104,7 @@
 	try:
 		debug('attempting sparql update, try #' + str(attempt))
 		sparql.setReturnFormat(SPARQLWrapper.TURTLE)
-		ret = sparql.query()
-		print ret.convert()
+		ret = sparql.query().convert()
 	except urllib2.HTTPError:
 		debug('caught an http error, retrying...')
 		if attempt<5:
@@ -113,17 +117,36 @@
 	except SPARQLWrapper.sparqlexceptions.QueryBadFormed:
 		error("query failed for "+ str(f))
 		debug('$$$$$$$$$$$$$$$$BADLY FORMED QUERY$$$$$$$$$$$$$$$$$$$')
+		print sparql.queryString
 		badQueryList.append(f)
 		failedList.append(f)
 	except:
 		error("query failed for "+ str(f))
 		debug('************UPDATE FAILED***********')
 		failedList.append(f)
-		error("Unexpected error:", sys.exc_info()[0])
+		print "Unexpected error:", sys.exc_info()[0]
+		print sparql.queryString
+	else:
+		print ret
+		return ret
+	return None
 		
 def splitQuery(query):
 	'''sometime the query is too long and should be broke in two pieces'''
-	pass
+	lines = query.splitlines(1)
+	splits = []
+	split = ""
+	count = 0
+	for line in lines:
+		if count < apacheLimit:
+			split = split+line
+			count+=len(line)
+		else:
+			splits.append(insert+split+'}')
+			split= line
+			count = 0
+	splits.append(insert+split)
+	return splits
 
 def main(argv=None):
 	if argv is None:
@@ -145,6 +168,10 @@
 				output = value
 			if option in ("-b", "--base"):
 				base = value
+			'''if option in ("-g", '--graph'):
+				defaultGraph = value
+				insert = """ \ninsert into graph <"""+defaultGraph+"""> {"""'''
+				
 		
 		setLogger()
 		if base == None:
@@ -153,50 +180,27 @@
 		# parse base uri
 		folder = base.split("http://myrdfspace.com/")[1]
 		debug('getting list of files')
-		#fileList = getFileListing(folder)
+		fileList = getFileListing(folder)
 		debug('got list of files')
-		fileList = ['238729309.rdf', '13280592.rdf', '26412401.rdf', '8557307.rdf', '176635064.rdf', '12656647.rdf']
+		#fileList = ['238729309.rdf', '13280592.rdf', '26412401.rdf', '8557307.rdf', '176635064.rdf', '12656647.rdf']
 		for f in fileList:
 			debug('parsing on file: '+str(f))
 			#parse each file and do a sparql update to the repository
 			sparul = parseRDF(f, base)
 			sparql = SPARQLWrapper.SPARQLWrapper(sparqlEndPoint)
 			sparql.addDefaultGraph(defaultGraph)
-			sparql.setQuery(sparul)
-			trySparql(sparql, 0, f)
-			'''try:
-				debug('attempting sparql update')
-				sparql.setReturnFormat(SPARQLWrapper.TURTLE)
-				ret = sparql.query()
-				print ret.convert()
-			except urllib2.HTTPError:
-				debug('caught an http error, retrying...')
-				try:
-					ret = sparql.query()
-					print ret.convert()
-				except urllib2.HTTPError:
-					debug('second http error...')
-					try:
-						ret = sparql.query()
-						print ret.convert()
-					except:
-						print "query failed for "+ str(f)
-						debug('************UPDATE FAILED***********')
-						failedList.append(f)
-						print "FINAL error:", sys.exc_info()[0]
-				except:
-					print "query failed for "+ str(f)
-					debug('************UPDATE FAILED***********')
-					failedList.append(f)
-					print "Unexpected error:", sys.exc_info()[0]
-			except SPARQLWrapper.sparqlexceptions.QueryBadFormed:
-				debug('$$$$$$$$$$$$$$$$BADLY FORMED QUERY$$$$$$$$$$$$$$$$$$$')
-				badQueryList.append(f)
-			except:
-				print "query failed for "+ str(f)
-				debug('************UPDATE FAILED***********')
-				failedList.append(f)
-				print "Unexpected error:", sys.exc_info()[0]'''
+			
+			# we have to deal w/ queries that are too long
+			if len(sparul) > apacheLimit:
+				debug('query too long, splitting...')
+				splitSparul = splitQuery(sparul)
+				for split in splitSparul:
+					sparql.setQuery(prefixes+split)
+					trySparql(sparql, 0, f)
+			else:
+				sparql.setQuery(prefixes+insert+sparul)
+				trySparql(sparql, 0, f)
+
 				
 		
 		debug("Complete!!!")


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Mypyspace-developer] SF.net SVN: mypyspace:[322] graphRDF/branches/old2sparul/old2sparul.py

From: <ku...@us...> - 2009-02-03 20:55:05

Revision: 322
          http://mypyspace.svn.sourceforge.net/mypyspace/?rev=322&view=rev
Author:   kurtjx
Date:     2009-02-03 20:55:00 +0000 (Tue, 03 Feb 2009)

Log Message:
-----------
lil python script for importing old myrdfspace data into 3 store, need to add a function to break long queries in two

Added Paths:
-----------
    graphRDF/branches/old2sparul/old2sparul.py

Added: graphRDF/branches/old2sparul/old2sparul.py
===================================================================
--- graphRDF/branches/old2sparul/old2sparul.py	                        (rev 0)
+++ graphRDF/branches/old2sparul/old2sparul.py	2009-02-03 20:55:00 UTC (rev 322)
@@ -0,0 +1,216 @@
+#!/usr/bin/env python
+# encoding: utf-8
+"""
+old2sparul.py
+
+Created by Kurtis Random on 2009-02-03.
+Copyright (c) 2009 __MyCompanyName__. All rights reserved.
+"""
+
+import sys
+import getopt
+from logging import log, error, warning, info, debug
+import logging
+import ftplib
+#from SPARQLWrapper import SPARQLWrapper
+import SPARQLWrapper
+import mopy
+import urllib2
+from time import sleep
+
+help_message = '''
+take old myrdfspace files and add to the sparql endpoint...
+	-b --base <uri base from myrdfspace>
+'''
+failedList = []
+badQueryList = []
+
+defaultGraph = "http://dbtune.org/myspace-test"
+sparqlEndPoint = "http://dbtune.org/cmn/sparql"
+myspaceBase = "http://dbtune.org/myspace/uid"
+myspaceOnt = "http://purl.org/ontology/myspace"
+prefixes = """PREFIX owl: <http://www.w3.org/2002/07/owl#> \nPREFIX foaf: <http://xmlns.com/foaf/0.1/> \nPREFIX dc: <http://purl.org/dc/elements/1.1/> \nPREFIX mo: <http://purl.org/ontology/mo/>\nPREFIX myspace: <http://purl.org/ontology/myspace#>\nPREFIX xsd: <http://www.w3.org/2001/XMLSchema#>"""
+
+class Usage(Exception):
+	def __init__(self, msg):
+		self.msg = msg
+
+def parseRDF(filename, base):
+	'''parse the rdf and return a sparql update query'''
+	sparqlU = prefixes+""" \ninsert into graph <"""+defaultGraph+"""> {"""
+	mi = mopy.importRDFFile(base+filename)
+	keys = mi.PersonIdx.keys()
+	for key in keys:
+		person = mi.PersonIdx[key]
+		if person.name:
+			# if we find the name, this is the main subject
+			suid = person.URI.split(base)[1]
+			subject = "<"+myspaceBase+"/"+suid+">"
+			name = person.name.pop()
+			sparqlU = sparqlU + """\n"""+subject+' foaf:name "' + urllib2.quote(name)+'"@en . '
+			
+			# get all the top friends
+			while(1):
+				try:
+					p = person.knows.pop()
+					ouid = p.URI.split(base)[1]
+					obj = "<"+myspaceBase+"/"+ouid+">"	
+					sparqlU=sparqlU+ "\n"+subject+" foaf:knows "+ obj+ ' . ' "\n"+subject+" myspace:topFriend "+obj+ ' . '
+				except:
+					break
+					
+			while(1):
+				try:
+					thm = person.theme.pop()
+					genre = "<"+myspaceOnt + "#"+urllib2.quote(thm.URI.split(base)[1])+">"
+					sparqlU=sparqlU+ "\n"+subject+ " myspace:genreTag "+ genre+ ' . '
+				except:
+					break
+					
+			try:
+				playcount = person.tipjar.pop().URI.split(base)[1]
+				sparqlU=sparqlU+ "\n"+subject+ ' myspace:totalPlays "'+ playcount+'"^^xsd:int . '
+			except:
+				pass
+				
+	sparqlU=sparqlU+'}'				
+	return sparqlU	
+
+def setLogger():
+    '''just set the logger'''
+    loggingConfig = {"format":'%(asctime)s %(levelname)-8s %(message)s',
+                               "datefmt":'%d.%m.%y %H:%M:%S',
+                                "level": logging.DEBUG,
+                                #"filename":logPath + "musicGrabber.log",
+                                "filemode":"w"}
+    logging.basicConfig(**loggingConfig)
+
+def getFileListing(rdfFolder):
+	'''return a list of all the rdf files found w/ given base'''
+	rdfFolder = rdfFolder.rstrip('/')
+	rdfFolder = rdfFolder+'/'
+	ftp = ftplib.FTP("myrdfspace.com")
+	ftp.login("myrdf", "my1stRDF")
+	ftp.cwd("myrdfspace.com/"+rdfFolder)
+	vList = ftp.nlst()
+	return vList
+
+def trySparql(sparql, attempt, f):
+	try:
+		debug('attempting sparql update, try #' + str(attempt))
+		sparql.setReturnFormat(SPARQLWrapper.TURTLE)
+		ret = sparql.query()
+		print ret.convert()
+	except urllib2.HTTPError:
+		debug('caught an http error, retrying...')
+		if attempt<5:
+			attempt+=1
+			sleep(2)
+			trySparql(sparql, attempt, f)
+		else:
+			error("more that 5 http errors, giving up")
+			failedList.append(f)
+	except SPARQLWrapper.sparqlexceptions.QueryBadFormed:
+		error("query failed for "+ str(f))
+		debug('$$$$$$$$$$$$$$$$BADLY FORMED QUERY$$$$$$$$$$$$$$$$$$$')
+		badQueryList.append(f)
+		failedList.append(f)
+	except:
+		error("query failed for "+ str(f))
+		debug('************UPDATE FAILED***********')
+		failedList.append(f)
+		error("Unexpected error:", sys.exc_info()[0])
+		
+def splitQuery(query):
+	'''sometime the query is too long and should be broke in two pieces'''
+	pass
+
+def main(argv=None):
+	if argv is None:
+		argv = sys.argv
+	try:
+		try:
+			opts, args = getopt.getopt(argv[1:], "ho:b:v", ["help", "output=","base="])
+		except getopt.error, msg:
+			raise Usage(msg)
+	
+		# option processing
+		base = None
+		for option, value in opts:
+			if option == "-v":
+				verbose = True
+			if option in ("-h", "--help"):
+				raise Usage(help_message)
+			if option in ("-o", "--output"):
+				output = value
+			if option in ("-b", "--base"):
+				base = value
+		
+		setLogger()
+		if base == None:
+			raise Usage(help_message)
+			return 2
+		# parse base uri
+		folder = base.split("http://myrdfspace.com/")[1]
+		debug('getting list of files')
+		#fileList = getFileListing(folder)
+		debug('got list of files')
+		fileList = ['238729309.rdf', '13280592.rdf', '26412401.rdf', '8557307.rdf', '176635064.rdf', '12656647.rdf']
+		for f in fileList:
+			debug('parsing on file: '+str(f))
+			#parse each file and do a sparql update to the repository
+			sparul = parseRDF(f, base)
+			sparql = SPARQLWrapper.SPARQLWrapper(sparqlEndPoint)
+			sparql.addDefaultGraph(defaultGraph)
+			sparql.setQuery(sparul)
+			trySparql(sparql, 0, f)
+			'''try:
+				debug('attempting sparql update')
+				sparql.setReturnFormat(SPARQLWrapper.TURTLE)
+				ret = sparql.query()
+				print ret.convert()
+			except urllib2.HTTPError:
+				debug('caught an http error, retrying...')
+				try:
+					ret = sparql.query()
+					print ret.convert()
+				except urllib2.HTTPError:
+					debug('second http error...')
+					try:
+						ret = sparql.query()
+						print ret.convert()
+					except:
+						print "query failed for "+ str(f)
+						debug('************UPDATE FAILED***********')
+						failedList.append(f)
+						print "FINAL error:", sys.exc_info()[0]
+				except:
+					print "query failed for "+ str(f)
+					debug('************UPDATE FAILED***********')
+					failedList.append(f)
+					print "Unexpected error:", sys.exc_info()[0]
+			except SPARQLWrapper.sparqlexceptions.QueryBadFormed:
+				debug('$$$$$$$$$$$$$$$$BADLY FORMED QUERY$$$$$$$$$$$$$$$$$$$')
+				badQueryList.append(f)
+			except:
+				print "query failed for "+ str(f)
+				debug('************UPDATE FAILED***********')
+				failedList.append(f)
+				print "Unexpected error:", sys.exc_info()[0]'''
+				
+		
+		debug("Complete!!!")
+		print "\n\nREPORT:\n\tfailures: "+str(len(failedList))
+		print "\nfails: "
+		print failedList 
+		print "\n\nbad queries: "
+		print badQueryList
+		
+	except Usage, err:
+		print >> sys.stderr, sys.argv[0].split("/")[-1] + ": " + str(err.msg)
+		print >> sys.stderr, "\t for help use --help"
+		return 2
+
+
+if __name__ == "__main__":
+	sys.exit(main())


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Mypyspace-developer] SF.net SVN: mypyspace:[321] graphRDF/branches/old2sparul/

From: <ku...@us...> - 2009-02-03 20:53:30

Revision: 321
          http://mypyspace.svn.sourceforge.net/mypyspace/?rev=321&view=rev
Author:   kurtjx
Date:     2009-02-03 20:53:25 +0000 (Tue, 03 Feb 2009)

Log Message:
-----------
new directory

Added Paths:
-----------
    graphRDF/branches/old2sparul/


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Mypyspace-developer] SF.net SVN: mypyspace:[320] musicGrabber/branches/webserv-branch

From: <ku...@us...> - 2009-02-03 15:15:46

Revision: 320
          http://mypyspace.svn.sourceforge.net/mypyspace/?rev=320&view=rev
Author:   kurtjx
Date:     2009-02-03 15:15:36 +0000 (Tue, 03 Feb 2009)

Log Message:
-----------
fixed a bug in the genre scraping

Modified Paths:
--------------
    musicGrabber/branches/webserv-branch/myspace2rdf.py
    musicGrabber/branches/webserv-branch/myspaceuris.py

Modified: musicGrabber/branches/webserv-branch/myspace2rdf.py
===================================================================
--- musicGrabber/branches/webserv-branch/myspace2rdf.py	2009-01-27 13:55:06 UTC (rev 319)
+++ musicGrabber/branches/webserv-branch/myspace2rdf.py	2009-02-03 15:15:36 UTC (rev 320)
@@ -263,7 +263,7 @@
 
 	
 	def scrapeGenre(self):
-		genreraw = scrapePage(self.page, [genreTag[0]], genreTag[1])
+		'''genreraw = scrapePage(self.page, [genreTag[0]], genreTag[1])
 		if genreraw == None:
 			return genreraw
 		genreraw = str(genreraw).lstrip()
@@ -278,8 +278,22 @@
 			self.mi.add(g)
 			self.subject.genreTag.add(g)
 			genresfixed.append(genre)
-		return genresfixed
+		return genresfixed'''
+		localGenres = scrapePage(self.page, [genreTag[0]], genreTag[1])
+		if localGenres == None:
+			return None
+		genreNums = re.findall(''':"(.|..|...)"''', localGenres) # should return only 2 or 3 char string between 
+		genres = []
+		for gnum in genreNums:
+			genre = mopy.mo.Genre(myspaceOntology+urllib.quote(genreDict[int(gnum)]))
+			genre.name.set(genreDict[int(gnum)])
+			self.mi.add(genre)
+			self.subject.genreTag.add(genre)
+			genres.append(genre)
+
+		return genres
 		
+		
 class mpsSong:
 	"""a class that wraps around the downloading, feature extracting and modeling of a piece of media attached to a mpsUser
 	mpsSong object instances have the following public variables:

Modified: musicGrabber/branches/webserv-branch/myspaceuris.py
===================================================================
--- musicGrabber/branches/webserv-branch/myspaceuris.py	2009-01-27 13:55:06 UTC (rev 319)
+++ musicGrabber/branches/webserv-branch/myspaceuris.py	2009-02-03 15:15:36 UTC (rev 320)
@@ -26,7 +26,8 @@
 #						### tag terminated by a ';' ###
 nameTag = """<span class="nametext">""", '''<'''
 #						### tag term by '<'
-genreTag = '''<font color="#033330" size="1" face="Arial, Helvetica, sans-serif"><strong>\r\n\t\t\t\t\t''', ''' \r'''
+genreTag = '''MySpace.Ads.BandType = {''', '''}'''
+#genreTag = '''<font color="#033330" size="1" face="Arial, Helvetica, sans-serif"><strong>\r\n\t\t\t\t\t''', ''' \r'''
 #'''<font color="#033330" size="1" face="Arial, Helvetica, sans-serif"><strong>''', '''<'''
 #						### tag terminated by '<'
 niceURLTag = '''<td><div align="left">&nbsp;&nbsp;<span><a href="''', '''">'''
@@ -70,6 +71,9 @@
 #adding this back in to lessen the broken...
 dbtuneMyspace = 'http://dbtune.org/myspace/'
 
+#new dict for genres valid as of 2009 feb 3
+genreDict = {0:"", 61:"2-step", 59:"A'cappella", 125:"Acousmatic / Tape music", 1:"Acoustic", 73:"Afro-beat", 2:"Alternative", 3:"Ambient", 93:"Americana", 98:"Anime Song", 65:"Big Beat", 51:"Black Metal", 4:"Bluegrass", 5:"Blues", 105:"Bossa Nova", 60:"Breakbeat", 129:"Breakcore", 118:"Celtic", 109:"Children", 134:"Chinese pop", 135:"Chinese traditional", 6:"Christian", 7:"Christian Rap", 8:"Classic Rock", 77:"Classical", 110:"Classical - Opera and Vocal", 9:"Club", 10:"Comedy", 126:"Concrete", 11:"Country", 12:"Death Metal", 63:"Disco House", 70:"Down-tempo", 50:"Drum & Bass", 68:"Dub", 123:"Dutch pop", 67:"Electro", 127:"Electroacoustic", 13:"Electronica", 14:"Emo", 133:"Emotronic", 15:"Experimental", 107:"Flamenco", 16:"Folk", 17:"Folk Rock", 119:"French pop", 18:"Funk", 124:"Fusion", 56:"Garage", 120:"German pop", 79:"Glam", 112:"Gospel", 46:"Gothic", 95:"Grime", 47:"Grindcore", 19:"Grunge", 71:"Happy Hardcore", 57:"Hard House", 20:"Hardcore", 104:"Healing & EasyListening", 21:"Hip Hop", 22:"House", 69:"IDM", 97:"Idol", 23:"Indie", 45:"Industrial", 121:"Italian pop", 24:"Jam Band", 103:"Japanese Classic Music", 100:"Japanese Pop", 25:"Jazz", 58:"Jungle", 101:"Korean Pop", 49:"Latin", 128:"Live Electronics", 75:"Lounge", 113:"Lyrical", 102:"Melodramatic Popular Song", 26:"Metal", 131:"Minimalist", 76:"New Wave", 66:"Nu-Jazz", 27:"Other", 28:"Pop", 29:"Pop Punk", 130:"Post punk", 31:"Powerpop", 32:"Progressive", 62:"Progrsv House", 33:"Psychedelic", 43:"Psychobilly", 34:"Punk", 35:"R&B", 36:"Rap", 37:"Reggae", 111:"Religious", 38:"Rock", 44:"Rockabilly", 94:"Roots Music", 115:"Salsa", 116:"Samba", 39:"Screamo", 78:"Shoegaze", 96:"Showtunes", 40:"Ska", 41:"Soul", 106:"Soundtracks / Film music", 42:"Southern Rock", 122:"Spanish pop", 48:"Surf", 114:"Swing", 108:"Tango", 53:"Techno", 54:"Thrash", 52:"Trance", 132:"Trance", 55:"Trip Hop", 92:"Tropical", 99:"Visual", 117:"Zouk"}
+
 def setRDFStoreURL(url):
 	'''set the rdf uri path'''
 	rdfStoreURL = url


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Mypyspace-developer] SF.net SVN: mypyspace:[319] myspaceCrawler/trunk

From: <gea...@us...> - 2009-01-27 13:55:12

Revision: 319
          http://mypyspace.svn.sourceforge.net/mypyspace/?rev=319&view=rev
Author:   gearmonkey
Date:     2009-01-27 13:55:06 +0000 (Tue, 27 Jan 2009)

Log Message:
-----------
fixed the image url problem, now actually point to song images not myspace default filler images.

added a bit more inline documentation to the api examples code.

Modified Paths:
--------------
    myspaceCrawler/trunk/examples.py
    myspaceCrawler/trunk/mpsUser.py

Modified: myspaceCrawler/trunk/examples.py
===================================================================
--- myspaceCrawler/trunk/examples.py	2009-01-27 12:43:42 UTC (rev 318)
+++ myspaceCrawler/trunk/examples.py	2009-01-27 13:55:06 UTC (rev 319)
@@ -6,7 +6,7 @@
 some simple functions demonstrating mpsUser and mpsSong functionality
 
 Created by Benjamin Fields on 2008-11-09.
-Copyright (c) 2008 __MyCompanyName__. All rights reserved.
+Copyright (c) 2008 Goldsmiths. All rights reserved.
 """
 
 import sys
@@ -36,7 +36,7 @@
 		return 0
 	
 def socialCharts(initArtist, radius, chartLength=1):
-	'''breadth first crawl of width radius to find most chartLength popular songs from the center initArtist.'''
+	'''breadth first crawl of width radius to find at most chartLength popular songs from the center initArtist.'''
 	songQueue = []
 	visitedArtists = []
 	artistsInThisLevel = [initArtist]

Modified: myspaceCrawler/trunk/mpsUser.py
===================================================================
--- myspaceCrawler/trunk/mpsUser.py	2009-01-27 12:43:42 UTC (rev 318)
+++ myspaceCrawler/trunk/mpsUser.py	2009-01-27 13:55:06 UTC (rev 319)
@@ -46,11 +46,13 @@
         isArtist         --     Boolean, True means instance describes a MySpace artist with media
         rdfprefix        --     prefix for all rdf UIRs
         page             --     locally loaded copy of html pointed to by source
+			The following are only set if user is found to be an artist
         mediaXML         --     locally loaded (via miniDom) copy of xml describing playlist of media assciated 
-                                    with myspace Artist (not set in non artists)
+                                    with myspace Artist 
         totalPlays       --     sum of playcounts of all songs associated with myspace Artist 
-                                    (not set in non Artist)
-        artist           --     self declared name of artist (not set in non Artist)
+        artist           --     self declared name of artist
+		artistID         --     unique ID possessed by artists only, needed to retrieve media and  media related meta data
+		playlistID       --     unique ID used to retrieve playlist found on page
 
     
 	'''
@@ -342,7 +344,7 @@
 		else:
 			self.extractionprefix = extractionprefix
 		self.title = self.exhaustiveXML.getElementsByTagName('title')[0].firstChild.nodeValue
-		self.image = self.exhaustiveXML.getElementsByTagName('small')[0].firstChild.nodeValue
+		self.getimage()
 		self.playcount = xmlNode.getElementsByTagName('stats')[0].getAttribute('plays')
 		self.comments = "" #this is a blank string hold for the comments fields.  Might be used later.
 		self.trackNum, self.totalTracks = None, None
@@ -357,10 +359,31 @@
 		try:
 			self.uri = self.exhaustiveXML.getElementsByTagName('link')[0].firstChild.nodeValue
 		except AttributeError, err:
-			logging.info("mpsUser::getUri ran into a problem finding the download link for a song by artist with uid: " + 
+			logging.info("mpsUser::mpsSong::getUri ran into a problem finding the download link for a song by artist with uid: " + 
 				str(self.parent().uid) + " link will be left blank.\n\tError msg: " + str(err))
 			self.uri = ''
-
+	def getimage(self):
+		'''find an image associated with the song, getting the largest resolution available'''
+		try:
+			self.image = self.exhaustiveXML.getElementsByTagName('track')[0].getElementsByTagName('large')[0].firstChild.nodeValue
+		except AttributeError:
+			try:
+				self.image = self.exhaustiveXML.getElementsByTagName('track')[0].getElementsByTagName('medium')[0].firstChild.nodeValue
+			except AttributeError:
+				try:
+					self.image = self.exhaustiveXML.getElementsByTagName('track')[0].getElementsByTagName('small')[0].firstChild.nodeValue
+				except Exception, err:
+					logging.info("mpsUser::mpsSong::getimage ran into a problem finding an image for a song by artist with uid: " + 
+						str(self.parent().uid) + " image will be left blank.\n\tError msg: " + str(err))
+					self.image = ''
+			except Exception, err:
+				logging.info("mpsUser::mpsSong::getimage ran into a problem finding an image for a song by artist with uid: " + 
+					str(self.parent().uid) + " image will be left blank.\n\tError msg: " + str(err))
+				self.image = ''
+		except Exception, err:
+			logging.info("mpsUser::mpsSong::getimage ran into a problem finding an image for a song by artist with uid: " + 
+				str(self.parent().uid) + " image will be left blank.\n\tError msg: " + str(err))
+			self.image = ''
 	def setTrackNum(self, trackNumber, totalTracks):
 		'''set the track number for this song and the number of tracks in the album it is in.'''
 		self.trackNum = trackNumber


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Mypyspace-developer] SF.net SVN: mypyspace:[318] myspaceCrawler/trunk/myspaceuris.py

From: <gea...@us...> - 2009-01-27 13:19:12

Revision: 318
          http://mypyspace.svn.sourceforge.net/mypyspace/?rev=318&view=rev
Author:   gearmonkey
Date:     2009-01-27 12:43:42 +0000 (Tue, 27 Jan 2009)

Log Message:
-----------
I think things are finally back in working order after the transition to myspace music 2.0
Turns out the new artistID and playlistID identifiers are terminated with ampersands not commas.  I don't know where I got the comma termination from, but it's fixed now.
These changes need to get to merged with the web-serv branch, I'll sort that out later today.
Also, when I was testing these I noticed that the song pictures are all getting filled in with the myspace no photo icon.  Not sure where that's coming from, but I'll try to sort that out in the near term as well.

Modified Paths:
--------------
    myspaceCrawler/trunk/myspaceuris.py

Modified: myspaceCrawler/trunk/myspaceuris.py
===================================================================
--- myspaceCrawler/trunk/myspaceuris.py	2009-01-26 10:28:38 UTC (rev 317)
+++ myspaceCrawler/trunk/myspaceuris.py	2009-01-27 12:43:42 UTC (rev 318)
@@ -15,7 +15,7 @@
 # new tag updated 13/1/2009
 #"""&nbsp;<a href="http://profile.myspace.com/index.cfm?fuseaction=user.viewprofile&friendid=""", '''"'''
 #						### tag will be terminated by a '"' ###
-friendNameTag = '''_friendLink">''', '''<'''
+friendNameTag = '''_friendLink">''', '''</a>'''
 						### tag terminated by '<' ###
 userIDTag = '''"DisplayFriendId":''', ''',"IsLoggedIn"'''
 # 13/1/2009
@@ -41,9 +41,9 @@
 
 ###
 #these two tag scraps are provisional for grabbing the ArtistID and playlist number, which are now nessecary to grab audio
-#both of these should be terminated by a comma
-playlistIDtag = """plid=""", ''','''
-artistIDtag = """artid=""",''','''
+#both of these should be terminated by an ampersand
+playlistIDtag = """plid=""", '''&'''
+artistIDtag = """artid=""",'''&'''
 
 
 #########################################################################################################


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Mypyspace-developer] SF.net SVN: mypyspace:[317] musicGrabber/branches/webserv-branch/ myspace2rdf.py

From: <gea...@us...> - 2009-01-26 10:28:43

Revision: 317
          http://mypyspace.svn.sourceforge.net/mypyspace/?rev=317&view=rev
Author:   gearmonkey
Date:     2009-01-26 10:28:38 +0000 (Mon, 26 Jan 2009)

Log Message:
-----------
fixed lots of spelling errors in the newly added bits of myspace2rdf.

it's scrape not scrap or crap 
and
scraping not scrapping 

good.

Modified Paths:
--------------
    musicGrabber/branches/webserv-branch/myspace2rdf.py

Modified: musicGrabber/branches/webserv-branch/myspace2rdf.py
===================================================================
--- musicGrabber/branches/webserv-branch/myspace2rdf.py	2009-01-25 17:14:03 UTC (rev 316)
+++ musicGrabber/branches/webserv-branch/myspace2rdf.py	2009-01-26 10:28:38 UTC (rev 317)
@@ -131,10 +131,10 @@
 		
 	def createArtistRDF(self):
 		'''write RDF for an artist page'''		
-		if self.scrapArtistID() and self.scrapPlaylistNumber():
+		if self.scrapeArtistID() and self.scrapePlaylistNumber():
 			pass
 		else:
-			print 'crap failed'
+			print 'scrape failed'
 		
 		# get the image
 		imageURL = scrapePage(self.page, [picTag[0]+str(self.uid)+'''"><img src="'''], picTag[1])
@@ -190,22 +190,22 @@
 		p = f.read()
 		print p
 
-	def scrapArtistID(self):
-		'''attempt to find via scrap of page the internal artist number.'''
+	def scrapeArtistID(self):
+		'''attempt to find via scrape of page the internal artist number.'''
 		try:
 			self.artistID = scrapePage(self.page, [artistIDtag[0]], artistIDtag[1])
 			return True
 		except Exception, err:
-			print "Ran into trouble trying to scrap the ArtistID for page from " + self.source  + "\nError::" + str(err)
+			print "Ran into trouble trying to scrape the ArtistID for page from " + self.source  + "\nError::" + str(err)
 			return False
 			
-	def scrapPlaylistNumber(self):
-		"""attempts to find via scrap of the internal identifier of an artist's playlist of songs"""
+	def scrapePlaylistNumber(self):
+		"""attempts to find via scrape of the internal identifier of an artist's playlist of songs"""
 		try:
 			self.playlistID = scrapePage(self.page, [playlistIDtag[0]], playlistIDtag[1])
 			return True
 		except Exception, err:
-			print "Ran into trouble trying to scrap the playlistID for page from " + self.source  + "\nError::" + str(err)
+			print "Ran into trouble trying to scrape the playlistID for page from " + self.source  + "\nError::" + str(err)
 			return False
 	
 	


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Mypyspace-developer] SF.net SVN: mypyspace:[316] musicGrabber/branches/webserv-branch

From: <ku...@us...> - 2009-01-25 17:14:06

Revision: 316
          http://mypyspace.svn.sourceforge.net/mypyspace/?rev=316&view=rev
Author:   kurtjx
Date:     2009-01-25 17:14:03 +0000 (Sun, 25 Jan 2009)

Log Message:
-----------
added tracks to webserv

Modified Paths:
--------------
    musicGrabber/branches/webserv-branch/myspace2rdf.py
    musicGrabber/branches/webserv-branch/myspaceuris.py

Modified: musicGrabber/branches/webserv-branch/myspace2rdf.py
===================================================================
--- musicGrabber/branches/webserv-branch/myspace2rdf.py	2009-01-23 17:20:28 UTC (rev 315)
+++ musicGrabber/branches/webserv-branch/myspace2rdf.py	2009-01-25 17:14:03 UTC (rev 316)
@@ -98,13 +98,12 @@
 			
 	def isArtist(self):
 		'''is current page an artist???
-				- currently check for the flash player
-				- should switch to check for genre tags instead???'''
+				- previously checked for the flash player
+				- new check for genre tags instead'''
 		
 		if self.page:
-		
-			player = scrapePage(self.page, [playerTag[0]], playerTag[1])
-			if player:
+			genrePresent = scrapePage(self.page, [genreTag[0]], genreTag[1])
+			if genrePresent:
 				self.subject = mopy.mo.MusicArtist(dbtuneMyspace+'uid/'+str(self.uid))
 				#self.subjecttwo = mopy.foaf.Person('http://dbtune.org/myspace/uid/'+str(self.uid))
 				#self.subject = mopy.mo.MusicArtist('http://dbtune.org/myspace/uid/'+str(self.uid))
@@ -132,6 +131,11 @@
 		
 	def createArtistRDF(self):
 		'''write RDF for an artist page'''		
+		if self.scrapArtistID() and self.scrapPlaylistNumber():
+			pass
+		else:
+			print 'crap failed'
+		
 		# get the image
 		imageURL = scrapePage(self.page, [picTag[0]+str(self.uid)+'''"><img src="'''], picTag[1])
 		img = mopy.foaf.Image(imageURL)
@@ -150,19 +154,25 @@
 			# self.mi.add(thing2)
 			
 		idx=0
-		xmlPage = try_open(mediaBase + str(self.uid))
-		xmlStruct = dom.parseString(''.join(xmlPage.readlines()))
-		songs = xmlStruct.getElementsByTagName('song')
-		for song in songs:
-			try:
+		
+		xmlPage = try_open(mediaBase[0] + str(self.artistID) + mediaBase[1] + str(self.playlistID) + mediaBase[2] + str(self.uid) + mediaBase[3])
+		#print mediaBase[0] + str(self.artistID) + mediaBase[1] + str(self.playlistID) + mediaBase[2] + str(self.uid) + mediaBase[3]
+		self.xmlStruct = dom.parseString(''.join(xmlPage.readlines()))
+
+		songList = self.xmlStruct.getElementsByTagName('song')
+		for song in songList:
+			'''try:
 				songTitle = unicodedata.normalize('NFKC',song.getAttribute('title')).encode('ascii','ignore')
 			except AttributeError, err:
 				songTitle = str(None)
 			except IndexError, err:
 				songTitle = str(None)
-			availableAs = song.getAttribute('durl')
+			#availableAs = song.getAttribute('durl')'''
+			thisSong = mpsSong(self, song, 'downloadprefix')
+			thisSong.getUri()
+			availableAs = thisSong.uri
 			track = mopy.mo.Track()
-			track.title.set(songTitle)
+			track.title.set(thisSong.title)
 			
 			avas = mopy.mo.MusicalItem(availableAs)
 			track.available_as.set(avas)
@@ -180,7 +190,25 @@
 		p = f.read()
 		print p
 
-		
+	def scrapArtistID(self):
+		'''attempt to find via scrap of page the internal artist number.'''
+		try:
+			self.artistID = scrapePage(self.page, [artistIDtag[0]], artistIDtag[1])
+			return True
+		except Exception, err:
+			print "Ran into trouble trying to scrap the ArtistID for page from " + self.source  + "\nError::" + str(err)
+			return False
+			
+	def scrapPlaylistNumber(self):
+		"""attempts to find via scrap of the internal identifier of an artist's playlist of songs"""
+		try:
+			self.playlistID = scrapePage(self.page, [playlistIDtag[0]], playlistIDtag[1])
+			return True
+		except Exception, err:
+			print "Ran into trouble trying to scrap the playlistID for page from " + self.source  + "\nError::" + str(err)
+			return False
+	
+	
 	def createRDF(self):
 		'''write the info to RDF for non-artist page'''
 		match = re.findall('viewAlbums&amp;friendID='+str(self.uid)+'">\s*<img border="\d*" alt="[^"]*" src="([^"]*?)"', str(self.page))
@@ -245,14 +273,147 @@
 		for genre in genres:
 			genre = genre.rstrip()
 			genre = genre.lstrip()
-			g = mopy.mo.Genre('http://grasstunes.net/ontology/myspace.owl#'+urllib.quote(str(genre)))
+			g = mopy.mo.Genre(myspaceOntology+urllib.quote(str(genre)))
 			g.name.set(genre)
 			self.mi.add(g)
 			self.subject.genreTag.add(g)
 			genresfixed.append(genre)
 		return genresfixed
+		
+class mpsSong:
+	"""a class that wraps around the downloading, feature extracting and modeling of a piece of media attached to a mpsUser
+	mpsSong object instances have the following public variables:
+		parent           --     a weakref to the mpsUser that generated the mpsSong instance
+        uri              --     lo res cached download link
+        betterUri        --     hi res cached download link (not always available)
+        downloadprefix   --     local prefix to stick the file when downloaded
+        extractionprefix --     local prefix to stick the feature files when extracted
+        title            --     title of song
+        image            --     url to get image associated with song
+        playcount        --     number of times song has been played via myspace player
+        trackNum         --     track number based on order presented on myspace
+        totalTracks      --     number of songs available for parent
+        filename         --     name used for local lofi file, when downloaded
+        HIFIfilename     --     name used for local hifi file, when downloaded
+        beats            --     local name of beat segmentaton file, used to do variable segment length feature extraction
 
+	"""
+	def __init__(self, parent, xmlNode, downloadprefix = '', extractionprefix = ''):
+		"""initializes the mpsSong class.  Parent is a pointer to the calling mpsUser, xmlNode should be a DOM object with the songs info.  downloadprefix is the local directory prefix where the media will be put, default is an empty string.  If no extractionprefix is given, extracted features will be places in the dir pointed to by downloadprefix"""
+		#self.parent = weakref.ref(parent)
+		self.xmlNode = xmlNode
+		self.getUri()
+		#the nicer file download is currently broken...
+		#self.betterURI = xmlNode.getAttribute('downloadable')
+		self.downloadprefix = downloadprefix
+		if extractionprefix == '':
+			self.extractionprefix = downloadprefix
+		else:
+			self.extractionprefix = extractionprefix
+		self.title = self.exhaustiveXML.getElementsByTagName('title')[0].firstChild.nodeValue
+		self.image = self.exhaustiveXML.getElementsByTagName('small')[0].firstChild.nodeValue
+		self.playcount = xmlNode.getElementsByTagName('stats')[0].getAttribute('plays')
+		self.comments = "" #this is a blank string hold for the comments fields.  Might be used later.
+		self.trackNum, self.totalTracks = None, None
+		self.filename, self.HIFIfilename = None, None
+		self.beats = None
 
+	def getUri(self):
+		self.songID = self.xmlNode.getAttribute('songId')
+		xmlPage = try_open(songBase[0] + str(self.songID) + songBase[1])
+		self.exhaustiveXML = dom.parseString(''.join(xmlPage.readlines()))
+		xmlPage.close()
+		try:
+			self.uri = self.exhaustiveXML.getElementsByTagName('link')[0].firstChild.nodeValue
+		except AttributeError, err:
+			logging.info("mpsUser::getUri ran into a problem finding the download link for a song by artist with uid: " + 
+				str(self.parent().uid) + " link will be left blank.\n\tError msg: " + str(err))
+			self.uri = ''
+
+	def setTrackNum(self, trackNumber, totalTracks):
+		'''set the track number for this song and the number of tracks in the album it is in.'''
+		self.trackNum = trackNumber
+		self.totalTracks = totalTracks
+
+	def download(self):
+		'''download the track.  
+		Upon success set self.filename to the local location of the downloaded song and return true.  
+		On FAIL return false.'''
+		logging.debug("downloading " + self.title + " by " + self.parent().artist +  " to " + self.downloadprefix)
+		if self.trackNum != None:
+			filename = unicode(str(self.trackNum), 'utf8')  + u'_' +  self.title + u'.mp3'
+		else:
+			filename =   self.title + u'.mp3'
+		if try_get(self.uri, os.path.join(self.downloadprefix, filename)) != None:
+			logging.debug("success on " +  self.title + " by " + self.parent().artist +  " to " + os.path.join(self.downloadprefix,filename))
+			self.filename = filename
+			return True
+		else:
+			logging.debug("FAIL on " + self.title + " by " + self.parent().artist +  " to " + os.path.join(self.downloadprefix,filename))
+			return False
+
+	def downloadHIFI(self):
+		'''if it exists, download the hi fidelity version of the track.  
+		Upon success set self.HIFIfilename to the local location of the downloaded song and return true.  
+		On FAIL return false.'''
+		if not self.betterURI:
+			logging.info("NO hi-fi version of " + self.title + " by " + self.parent().artist + " but we did look for it.")
+			return False
+		logging.debug("downloading hifi copy of " + self.title + "by" + self.parent().artist +  " to " + self.downloadprefix)
+		if self.trackNum != None:
+			filename = unicode(str(self.trackNum), 'utf8') + u'_' + self.title + u'_hifi.mp3'
+		else:
+			filename =  self.title + u'_hifi.mp3'
+		if (try_get(self.betteruri, os.path.join(self.downloadprefix,filename)) != None):
+			logging.debug("success on hi-fi version of " + self.title + " by " + self.parent().artist +  " to " + os.path.join(self.downloadprefix,filename))
+			self.HIFIfilename = filename
+			return True
+		else:
+			logging.debug("FAIL on hi-fi version of " + self.title + " by " + self.parent().artist +  " to " + os.path.join(self.downloadprefix,filename))
+			return False
+
+
+	def tag(self, hifi = False):
+		'''create or modify the id3 tag for downloaded song associated with self. set optional hifi arg to tag the hifi download'''
+		if hifi:
+			fileToTag = os.path.join(self.downloadprefix,self.HIFIfilename)
+		else:
+			fileToTag = os.path.join(self.downloadprefix,self.filename)
+		if fileToTag == None:
+			logging.info("asked to tag a file associated with uid: " + str(self.parent().uid) + " but the song does not exist locally")			
+		logging.debug("adding tags to " + fileToTag)
+		try: id3 = mutagen.id3.ID3(fileToTag)
+		except mutagen.id3.ID3NoHeaderError:
+			logging.info("No ID3 header found for " + fileToTag + "; creating tag from scratch")
+			id3 = mutagen.id3.ID3()
+		except Exception, err:
+			logging.error(str(err))
+			return
+		id3.add(mutagen.id3.TIT2(encoding=3,text=self.title))
+		id3.add(mutagen.id3.TPE1(encoding=3,text=self.parent().artist))
+		id3.add(mutagen.id3.COMM(encoding=3,text=self.comments, lang="eng", desc=""))
+		#id3.add(mutagen.id3.COMM(encoding=3,text=relationshipLink, lang="eng", desc="MusicGrabberSig"))	
+		id3.add(mutagen.id3.TALB(encoding=3,text=self.parent().album))
+		if self.trackNum != None:
+			id3.add(mutagen.id3.TRCK(encoding=3,text=str(self.trackNum) + '/' + str(self.totalTracks)))
+		id3.add(mutagen.id3.POPM(encoding=3,email=str(self.parent().uid)+"@myspace", rating = 128, count=self.playcount))
+		if self.image == None:
+			logging.error("No image present for " + self.title + ", " + self.parent().artist)
+		try:
+			logging.debug("trying to get image from " + self.image)
+			localImgPath, imgHeader = try_get(self.image, os.path.join("/tmp",os.path.basename(self.image)))
+			imgHandle = open(localImgPath)
+			id3.add(mutagen.id3.APIC(encoding=3, mime=imgHeader.type, data=imgHandle.read(), type=17, desc="Song pic from myspace.com"))
+		except:
+			logging.error("Unable to retieve image for " + self.title + ", " + self.parent().artist)
+		try:
+			id3.save(fileToTag)
+		except Exception, err:
+			logging.error(str(err) + ";couldn\'t save the tag for " + self.title + " by " + self.parent().artist)
+
+
+
+
 def main(argv=None):
 	if argv is None:
 		argv = sys.argv

Modified: musicGrabber/branches/webserv-branch/myspaceuris.py
===================================================================
--- musicGrabber/branches/webserv-branch/myspaceuris.py	2009-01-23 17:20:28 UTC (rev 315)
+++ musicGrabber/branches/webserv-branch/myspaceuris.py	2009-01-25 17:14:03 UTC (rev 316)
@@ -4,6 +4,8 @@
 
 #						### append user id to this ###
 rdfStoreURL = "http://myrdfspace.com/alpha/"
+
+myspaceOntology = 'http://grasstunes.net/ontology/myspace.owl#'
 #########################################################################################################
 
 #########################################################################################################
@@ -37,8 +39,8 @@
 ###
 #these two tag scraps are provisional for grabbing the ArtistID and playlist number, which are now nessecary to grab audio
 #both of these should be terminated by a comma
-playlistIDtag = """plid=""", ''','''
-artistIDtag = """artid=""",''','''
+playlistIDtag = """plid=""", '''&'''
+artistIDtag = """artid=""",'''&'''
 
 
 #########################################################################################################


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Mypyspace-developer] SF.net SVN: mypyspace:[315] myspaceCrawler/trunk/RDFtrans.py

From: <gea...@us...> - 2009-01-23 17:20:32

Revision: 315
          http://mypyspace.svn.sourceforge.net/mypyspace/?rev=315&view=rev
Author:   gearmonkey
Date:     2009-01-23 17:20:28 +0000 (Fri, 23 Jan 2009)

Log Message:
-----------
oops, used the wrong name for RDFtrans internal html representation 

Modified Paths:
--------------
    myspaceCrawler/trunk/RDFtrans.py

Modified: myspaceCrawler/trunk/RDFtrans.py
===================================================================
--- myspaceCrawler/trunk/RDFtrans.py	2009-01-23 16:57:40 UTC (rev 314)
+++ myspaceCrawler/trunk/RDFtrans.py	2009-01-23 17:20:28 UTC (rev 315)
@@ -61,7 +61,7 @@
 	def isArtist(self):
 		'''is current page an artist???'''
 		if self.HTML:
-			if genreTag[0] in self.page:
+			if genreTag[0] in self.HTML:
 				artist = True
 			else:
 				artist = False


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Mypyspace-developer] SF.net SVN: mypyspace:[314] myspaceCrawler/trunk

From: <gea...@us...> - 2009-01-23 16:57:46

Revision: 314
          http://mypyspace.svn.sourceforge.net/mypyspace/?rev=314&view=rev
Author:   gearmonkey
Date:     2009-01-23 16:57:40 +0000 (Fri, 23 Jan 2009)

Log Message:
-----------
removed the playerTag as it's broken and unreliable.  Replaced the artist check functionality by checking for genre formatting tags.  The robustness of this method is to be determined, however it seems to work in most cases.

Also, slightly altered some of the parameters used in url grabbing.

Modified Paths:
--------------
    myspaceCrawler/trunk/RDFtrans.py
    myspaceCrawler/trunk/mpsUser.py
    myspaceCrawler/trunk/myspaceuris.py
    myspaceCrawler/trunk/tryurl.py

Modified: myspaceCrawler/trunk/RDFtrans.py
===================================================================
--- myspaceCrawler/trunk/RDFtrans.py	2009-01-16 17:59:29 UTC (rev 313)
+++ myspaceCrawler/trunk/RDFtrans.py	2009-01-23 16:57:40 UTC (rev 314)
@@ -61,12 +61,15 @@
 	def isArtist(self):
 		'''is current page an artist???'''
 		if self.HTML:
-			player = scrapePage(self.HTML, playerTag[0], playerTag[1])
+			if genreTag[0] in self.page:
+				artist = True
+			else:
+				artist = False
 			if not scrapePage(self.HTML, nameTag[0], nameTag[1]) == None:
 				self.name = scrapePage(self.HTML, nameTag[0], nameTag[1])
 			else:
 				self.name = str(None)
-			if player:
+			if artist:
 				# make the mopy subject a myspace:MusicArtist
 				self.subject = mopy.myspace.MusicArtist(self.NSprefix+str(self.uid))
 				# set the subject name

Modified: myspaceCrawler/trunk/mpsUser.py
===================================================================
--- myspaceCrawler/trunk/mpsUser.py	2009-01-16 17:59:29 UTC (rev 313)
+++ myspaceCrawler/trunk/mpsUser.py	2009-01-23 16:57:40 UTC (rev 314)
@@ -146,8 +146,8 @@
 		return xmlStruct
 		
 	def artistCheck(self):
-		'''for a given mpsUser with read source, check to see if it is an artist profile'''
-		if playerTag[0] in self.page:
+		'''for a given mpsUser with read source, check to see if it is an artist profile.  This is done by examining the html source for the presence of genre labels.  Note that even an artist without genre tags, will have these bits of markup, they will simply be blank.'''
+		if genreTag[0] in self.page:
 			return True
 		else:
 			return False

Modified: myspaceCrawler/trunk/myspaceuris.py
===================================================================
--- myspaceCrawler/trunk/myspaceuris.py	2009-01-16 17:59:29 UTC (rev 313)
+++ myspaceCrawler/trunk/myspaceuris.py	2009-01-23 16:57:40 UTC (rev 314)
@@ -8,7 +8,8 @@
 
 #########################################################################################################
 # useful tags
-playerTag = """SWFObject("http://musicservices.myspace.com/Modules/MusicServices/Services/Embed.ashx/ptype=4""", ''';'''
+#the player tag is broken, so we're going to use the genre tag as an artist check
+#playerTag = """SWFObject("http://musicservices.myspace.com/Modules/MusicServices/Services/Embed.ashx/ptype=4""", ''';'''
 #						###	this tag will be terminated by a '.' ###
 friendTag = '''&nbsp;<a href="http://profile.myspace.com/index.cfm?fuseaction=user.viewProfile&friendID=''', '''"'''
 # new tag updated 13/1/2009

Modified: myspaceCrawler/trunk/tryurl.py
===================================================================
--- myspaceCrawler/trunk/tryurl.py	2009-01-16 17:59:29 UTC (rev 313)
+++ myspaceCrawler/trunk/tryurl.py	2009-01-23 16:57:40 UTC (rev 314)
@@ -5,8 +5,8 @@
 #keepalive comes from the urlgrabber project, licensed under GPL and available here: http://linux.duke.edu/projects/urlgrabber/
 import logging
 #changing to urllib2 and using a recently added timeout feature, so that the socket will timeout after TIMEOUT seconds
-TIMEOUT = 12
-SLEEPTIME = .25 
+TIMEOUT = 15
+SLEEPTIME = 5 
 
 
 #use the following three lines and import keepalive to use the keep alive urlopener


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Mypyspace-developer] SF.net SVN: mypyspace:[313] myspaceCrawler/trunk

From: <gea...@us...> - 2009-01-16 17:59:35

Revision: 313
          http://mypyspace.svn.sourceforge.net/mypyspace/?rev=313&view=rev
Author:   gearmonkey
Date:     2009-01-16 17:59:29 +0000 (Fri, 16 Jan 2009)

Log Message:
-----------
corrected the default base uri for rdf generated via mpsUser.

Modified Paths:
--------------
    musicGrabber/branches/webserv-branch/myspace2rdf.py
    myspaceCrawler/tags/0.8.1b_release/myspaceCrawler.py
    myspaceCrawler/trunk/mpsUser.py
    myspaceCrawler/trunk/myspaceuris.py

Modified: musicGrabber/branches/webserv-branch/myspace2rdf.py
===================================================================
--- musicGrabber/branches/webserv-branch/myspace2rdf.py	2009-01-16 17:24:24 UTC (rev 312)
+++ musicGrabber/branches/webserv-branch/myspace2rdf.py	2009-01-16 17:59:29 UTC (rev 313)
@@ -102,10 +102,8 @@
 				- should switch to check for genre tags instead???'''
 		
 		if self.page:
-			#############################################
-			# kludge set playr to always flase for now ##
-			#############################################
-			player = False #= scrapePage(self.page, [playerTag[0]], playerTag[1])
+		
+			player = scrapePage(self.page, [playerTag[0]], playerTag[1])
 			if player:
 				self.subject = mopy.mo.MusicArtist(dbtuneMyspace+'uid/'+str(self.uid))
 				#self.subjecttwo = mopy.foaf.Person('http://dbtune.org/myspace/uid/'+str(self.uid))

Modified: myspaceCrawler/tags/0.8.1b_release/myspaceCrawler.py
===================================================================
--- myspaceCrawler/tags/0.8.1b_release/myspaceCrawler.py	2009-01-16 17:24:24 UTC (rev 312)
+++ myspaceCrawler/tags/0.8.1b_release/myspaceCrawler.py	2009-01-16 17:59:29 UTC (rev 313)
@@ -32,7 +32,7 @@
 
 
 
-THREAD_CAP = 30 #maximum number of threads allowed to be firing at once
+THREAD_CAP = 10000 #maximum number of threads allowed to be firing at once
 THREAD_STALL_TIME = 30 #length of time in seconds to wait until the thread count is checked again
 LOG_FILENAME = "musicCrawler.log" #name of logger file (path set at commandline)
 

Modified: myspaceCrawler/trunk/mpsUser.py
===================================================================
--- myspaceCrawler/trunk/mpsUser.py	2009-01-16 17:24:24 UTC (rev 312)
+++ myspaceCrawler/trunk/mpsUser.py	2009-01-16 17:59:29 UTC (rev 313)
@@ -55,7 +55,7 @@
     
 	'''
 
-	def __init__(self, url, rdfprefix = dbtuneMyspace):
+	def __init__(self, url, rdfprefix = dbtuneMyspace + 'uid/'):
 		"""Initialization will set the source url, attempt to create a socket connection with the url and determine if this mpsUser is an artist.  If the user given is an artist, the initialization will also scrape the top Friends. rdfprefix is the uri base prepended to the uids of other myspace resources, by default it is set to the dbtune live service"""
 		self.source = url
 		self.uid = -1

Modified: myspaceCrawler/trunk/myspaceuris.py
===================================================================
--- myspaceCrawler/trunk/myspaceuris.py	2009-01-16 17:24:24 UTC (rev 312)
+++ myspaceCrawler/trunk/myspaceuris.py	2009-01-16 17:59:29 UTC (rev 313)
@@ -46,7 +46,6 @@
 
 
 #########################################################################################################
-
 # myspace uri for downloads  ----this has gotten a bit more complicated in the roll out of myspace's new media player
 # this xml file gives the songIDs, the songsIDs must be used individually to request another xml file that then contains the uri to the cached media
 #
@@ -75,7 +74,6 @@
 myspaceOwlURI = 'http://grasstunes.net/ontology/myspace.owl'
 dbtuneMyspace = 'http://dbtune.org/myspace/'
 
-
 countries = ['Afghanistan', 'Albania', 'Algeria', 'American Samoa','Andorra',
 			'Angola','Anguilla','Antarctica','Antigua and Barbuda','Argentina',
 			'Armenia','Aruba','Australia','Austria','Azerbaijan','Bahamas',


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Mypyspace-developer] SF.net SVN: mypyspace:[312] myspaceCrawler/trunk

From: <gea...@us...> - 2009-01-16 17:24:34

Revision: 312
          http://mypyspace.svn.sourceforge.net/mypyspace/?rev=312&view=rev
Author:   gearmonkey
Date:     2009-01-16 17:24:24 +0000 (Fri, 16 Jan 2009)

Log Message:
-----------
The most notable change in this rev is the change to the genre tag.  It was picking up loads of garbage with artist with no genres listed.  This was fixed by removing all the whitespace from the scrape tag, replacing the closing tag (it use to be a single carriage return) and cleaning up the whitespace stripping mechanism for genre in RDFtrans.  This seems to result in correct answers for artists with no listed genre (no genre entry in the rdf file) instead of gibberish.

I think the rdf generated by RDFtrans inside the myspaceCrawler project is actually bordering on sensible now (it's been valid since r309, but now it actually makes sense).  The most notable exception is that there are still some oddities in the myspace ontology namespace that need to be dealt with (the name space is showing as default5 instead of myspace).

Modified Paths:
--------------
    myspaceCrawler/trunk/RDFtrans.py
    myspaceCrawler/trunk/myspaceuris.py
    myspaceCrawler/trunk/scraping.py

Modified: myspaceCrawler/trunk/RDFtrans.py
===================================================================
--- myspaceCrawler/trunk/RDFtrans.py	2009-01-15 20:06:13 UTC (rev 311)
+++ myspaceCrawler/trunk/RDFtrans.py	2009-01-16 17:24:24 UTC (rev 312)
@@ -86,18 +86,26 @@
 			friendUIDs = scrapePageWhile(self.HTML, friendTag[0], friendTag[1])
 			friendNames = scrapePageWhile(self.HTML, friendNameTag[0], friendNameTag[1])
 			friendPics = scrapePageWhile(self.HTML, friendPicTag[0], friendPicTag[1])
-
-			for i in range(len(friendUIDs)):
-				friend = mopy.myspace.Agent(self.NSprefix + str(friendUIDs[i]))
+			
+			if len(friendUIDs) != len(friendNames):
+				logging.info("Ther seems to be a different number of friend names (" + str(len(friendNames)) + 
+					") than friend IDs (" + str(len(friendUIDs)) + ") scraped off uid #" + str(self.uid) +".\nverify rdf.")
+			if len(friendUIDs) != len(friendPics):
+				logging.info("Ther seems to be a different number of friend pictures (" + str(len(friendPics)) + 
+					") than friend IDs (" + str(len(friendUIDs)) + ") scraped off uid #" + str(self.uid) +".\nverify rdf.")
+				
+			for idx, friendUID in enumerate(friendUIDs):
+				friend = mopy.myspace.Agent(self.NSprefix + str(friendUID))
 				try:
-					friend.name.set(friendNames[i])
+					friend.name.set(friendNames[idx])
+					logging.debug("adding friend with uid " + str(friendUID) + " whose name is " + str(friendNames[idx]))
 				except Exception, err:
 					logging.error("A friend name mismatch occurred in the rdf translation.\nRDFtrans::getFriends::" + str(err))
 				# refer to dbtune incase this friend isnt in crawl
-				thing = mopy.owl.Thing(dbtuneMyspace + 'uid/' + str(friendUIDs[i]))
+				thing = mopy.owl.Thing(dbtuneMyspace + 'uid/' + str(friendUID))
 				friend.sameAs.set(thing)
 				try:
-					img = mopy.foaf.Image(friendPics[i])
+					img = mopy.foaf.Image(friendPics[idx])
 					friend.depiction.add(img)
 					self.mi.add(img)
 				except:
@@ -203,13 +211,13 @@
 		genreraw = scrapePage(self.HTML, genreTag[0], genreTag[1])
 		if genreraw == None:
 			return genreraw
-		genreraw = str(genreraw).lstrip()
-		genreraw = genreraw.rstrip()
+		genreraw = str(genreraw).strip()
+		if genreraw == '':
+			return None
 		genres = genreraw.split('/')
 		genresfixed = []
 		for genre in genres:
-			genre = genre.rstrip()
-			genre = genre.lstrip()
+			genre = genre.strip()
 			g = mopy.mo.Genre(myspaceOwlURI+'#'+urllib.quote(str(genre)))
 			g.name.set(genre)
 			self.mi.add(g)

Modified: myspaceCrawler/trunk/myspaceuris.py
===================================================================
--- myspaceCrawler/trunk/myspaceuris.py	2009-01-15 20:06:13 UTC (rev 311)
+++ myspaceCrawler/trunk/myspaceuris.py	2009-01-16 17:24:24 UTC (rev 312)
@@ -10,11 +10,11 @@
 # useful tags
 playerTag = """SWFObject("http://musicservices.myspace.com/Modules/MusicServices/Services/Embed.ashx/ptype=4""", ''';'''
 #						###	this tag will be terminated by a '.' ###
-friendTag = '''<td bgcolor="FFFFFF" align="center" valign="top" width="107" style="word-wrap:break-word">\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t&nbsp;<a href="http://profile.myspace.com/index.cfm?fuseaction=user.viewProfile&friendID=''', '''"'''
+friendTag = '''&nbsp;<a href="http://profile.myspace.com/index.cfm?fuseaction=user.viewProfile&friendID=''', '''"'''
 # new tag updated 13/1/2009
 #"""&nbsp;<a href="http://profile.myspace.com/index.cfm?fuseaction=user.viewprofile&friendid=""", '''"'''
 #						### tag will be terminated by a '"' ###
-friendNameTag = """_friendLink">""", '''<'''
+friendNameTag = '''_friendLink">''', '''<'''
 						### tag terminated by '<' ###
 userIDTag = '''"DisplayFriendId":''', ''',"IsLoggedIn"'''
 # 13/1/2009
@@ -24,7 +24,8 @@
 #						### tag terminated by a ';' ###
 nameTag = """<span class="nametext">""", '''<'''
 #						### tag term by '<'
-genreTag = '''<font color="#033330" size="1" face="Arial, Helvetica, sans-serif"><strong>\r\n\t\t\t\t\t''', ''' \r'''
+#the returned identifier will inevitably be surrouned by whitespace that will need to be stripped
+genreTag = '''<font color="#033330" size="1" face="Arial, Helvetica, sans-serif"><strong>''', '''</strong>'''
 #'''<font color="#033330" size="1" face="Arial, Helvetica, sans-serif"><strong>''', '''<'''
 #						### tag terminated by '<'
 niceURLTag = '''<td><div align="left">&nbsp;&nbsp;<span><a href="''', '''">'''

Modified: myspaceCrawler/trunk/scraping.py
===================================================================
--- myspaceCrawler/trunk/scraping.py	2009-01-15 20:06:13 UTC (rev 311)
+++ myspaceCrawler/trunk/scraping.py	2009-01-16 17:24:24 UTC (rev 312)
@@ -35,7 +35,7 @@
 	logging.debug("Found identifier : "+identifier)
 	return identifier;
 	
-def scrapePageWhile(page, patterns, termChar):
+def scrapePageWhile(page, pattern, termChar):
 	"""Scrape the page given for each pattern and return a list with each identifier occurring after the
 	  last pattern (which is assumed to be terminated by termChar)"""
 	
@@ -44,8 +44,8 @@
 	idx_end = len(page)
 	identifiers = []
 	itsFound = 1
+	logging.debug("pattern : "+ pattern)
 	while itsFound:
-		pattern = patterns
 		idx = page.find(pattern, idx)
 		#logging.debug("idx = "+str(idx))
 		if (idx > idx_end): # Couldn't find this pattern before re-occurrence of last pattern
@@ -59,7 +59,7 @@
 		#logging.debug("idx_end = "+str(idx_end))
 	
 		if idx != -1:
-			idx += len(patterns)
+			idx += len(pattern)
 		# idx should now point to the start of the identifier we want
 		id_end = page.find(termChar, idx)
 		identifier = unicode(page[idx:id_end], 'utf8')


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Mypyspace-developer] test

From: Benjamin F. <b.f...@go...> - 2009-01-16 00:07:32

sorry if anyone else is actually here this is just a test



Benjamin Fields
PhD Canidate
Dept. of Computing
Goldsmiths College, University of London
b.f...@go...
mobile: +44 (0)796 106 1568
"Which is more musical: a truck passing by a factory or a truck  
passing by a music school?" --John Cage

[Mypyspace-developer] branch merged

From: Kurt J <ku...@gm...> - 2007-09-21 12:12:54

ok i commited the trunk/adding-rdf-branch merge in adding-rdf-branch.  seems
to be working...

[Mypyspace-developer] mergin woes

From: Kurt J <ku...@gm...> - 2007-09-21 10:39:33

Hey dude,

The merge has been a bit rocky.  SVN seemed to just leave some things out.
not sure, my svn skills kinda suck.  so i'm going by hand and checking trunk
v mine.

one question - do you hate logging?  cuz i think it's pretty cool.  but it
does fuck up on py 2.3  but on 2.4+ it's the bees nees

also, i've done some splitting of things into mroe files.  so this might
make more merging woes in the future...  but it seemed like the thing to do.

-kurt j

[Mypyspace-developer] id3 tag branch

From: Benjamin F. <ma...@go...> - 2007-09-11 14:46:11

So I have created a branch for the implementation of id3 tag writing  
for the files that are downloaded.  It can be found at the following  
svn path:

/mypyspace/branches/adding-ID3-branch

Also, is anyone besides me on this list?


Ben Fields
PhD Student
Dept. of Computing
Goldsmiths College, University of London
e: ma...@go...
p: +44 (0) 20 7078 5170

Flat | Threaded

<< < 1 2 (Page 2 of 2)

2007	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep (3)	Oct	Nov	Dec
2009	Jan (9)	Feb (13)	Mar (4)	Apr (4)	May (13)	Jun (1)	Jul	Aug	Sep	Oct (2)	Nov	Dec