From: Humufr <hu...@ya...> - 2005-02-11 16:45:54
Attachments:
fig_bench.png
|
Hi, I did some change (again) in the load function to improve the speed when you're load some big data file but you want use only some columns. I did all my tests with a file with 9722 line and 16 columns. The bench test file is after. I think that the result of the bench are interesting: I you want use 2 columns on the 16 the results are: load matplotlib 0.58 load with columns choice 0.27 normal load inside the new load version 0.58 We win a factor two. I know that depend totally from the number of columns and that the change is not interesting and more decrease the efficiency if you want use all the data in your file but like the columns call is optionnal I don't think that is point is crucial but I add a figure to see the effect when you go to one to all the columns. The load function is after. Regards, Nicolas ----------------------------------------------- #!/usr/bin/env python # -*- coding: utf-8 -*- from time import clock t3 = clock() import load_2 Y=load_2.load('data') x=Y[:,0] y=Y[:,1] t4 = clock() #print t4-t3 #print x,y col = [0,6] t1 = clock() import load_matplotlib X=load_matplotlib.load('data') #X = [X[:,i] for i in col] x=X[:,0] y=X[:,1] t2 = clock() print 'load matplotlib', t2-t1 #print X t3 = clock() import load_2 X=load_2.load('data',columns=range(14)) x=Y[:,0] y=Y[:,1] t4 = clock() print 'load with columns choice', t4-t3 t3 = clock() import load_2 Y=load_2.load('data') x=Y[:,0] y=Y[:,1] t4 = clock() normal = t4-t3 print 'normal load ', normal time = [] for i in range(16): t3 = clock() import load_2 X=load_2.load('data',columns=range(i)) x=Y[:,0] y=Y[:,1] t4 = clock() #print 'load with columns choice', t4-t3 time.append(t4-t3) from pylab import * time = array(time)/normal plot(range(16),time) xlabel('N columns (total = 16)') ylabel('time columns /normal time') show() ------------------------------------------------------------------ def load(fname,comments='%',columns=None): """ Load ASCII data from fname into an array and return the array. The data must be regular, same number of values in every row fname can be a filename or a file handle. A character for to delimit the comments can be use (optional), the default is the matlab character '%'. An second optional argument can be add, to tell which columns you want use in the file. This arguments is a list who contains the number of columns beggining by 0 (python style). matfile data is not currently supported, but see Nigel Wade's matfile ftp://ion.le.ac.uk/matfile/matfile.tar.gz Example usage: X = load('test.dat') # data in two columns t = X[:,0] y = X[:,1] Alternatively, you can do t,y = transpose(load('test.dat')) # for two column data X = load('test.dat',[0,2]) # data in two columns (columns 1 and 3 use in the file) X = load('test.dat') # a matrix of data X = load('test.dat',columns=[2,3]) # a matrix of data, only columns 3 and 4 will be use x = load('test.dat') # a single column of data x = load('test.dat,'#') # the character use like a comment delimiter is '#' """ # from numarray import array fh = file(fname) X = [] numCols = None if columns is None: for line in fh: line = line[:line.find(comments)].strip() if not len(line): continue row = [float(val) for val in line.split()] thisLen = len(row) if numCols is not None and thisLen != numCols: raise ValueError('All rows must have the same number of columns') X.append(row) else: for line in fh: line = line[:line.find(comments)].strip() if not len(line): continue row = [val for val in line.split()] row = [float(row[i]) for i in columns] thisLen = len(row) if numCols is not None and thisLen != numCols: raise ValueError('All rows must have the same number of columns') X.append(row) X = array(X) r,c = X.shape if r==1 or c==1: X.shape = max([r,c]), return X |
From: John H. <jdh...@ac...> - 2005-02-13 02:01:23
|
>>>>> "Humufr" == Humufr <hu...@ya...> writes: Humufr> Hi, I did some change (again) in the load Humufr> function to improve the speed when you're load some big Humufr> data file but you want use only some columns. I did all my Humufr> tests with a file with 9722 line and 16 columns. The Humufr> bench test file is after. I think that the result of the Humufr> bench are interesting: Humufr> I you want use 2 columns on the 16 the results are: Humufr> load matplotlib 0.58 load with columns choice 0.27 normal Humufr> load inside the new load version 0.58 Humufr> We win a factor two. I know that depend totally from the Humufr> number of columns and that the change is not interesting Humufr> and more decrease the efficiency if you want use all the Humufr> data in your file but like the columns call is optionnal I Humufr> don't think that is point is crucial but I add a figure to Humufr> see the effect when you go to one to all the columns. Humufr> The load function is after. Either there was an error i your cut and paste, or the reason your new load function is faster is that it does nothing. Note the indentation The second time you do "for line in fh" you clearly intend to be handling the columns case, but it is inside the "if columns is None" block. It looks like the reason the columns version of load is faster is because it's not doing anything... JDH if columns is None: for line in fh: line = line[:line.find(comments)].strip() if not len(line): continue row = [float(val) for val in line.split()] thisLen = len(row) if numCols is not None and thisLen != numCols: raise ValueError('All rows must have the same number of columns') X.append(row) else: for line in fh: line = line[:line.find(comments)].strip() if not len(line): continue row = [val for val in line.split()] row = [float(row[i]) for i in columns] thisLen = len(row) if numCols is not None and thisLen != numCols: raise ValueError('All rows must have the same number Humufr> Regards, Humufr> Nicolas Humufr> ----------------------------------------------- Humufr> #!/usr/bin/env python # -*- coding: utf-8 -*- Humufr> from time import clock Humufr> t3 = clock() import load_2 Y=load_2.load('data') x=Y[:,0] Humufr> y=Y[:,1] t4 = clock() #print t4-t3 #print x,y Humufr> col = [0,6] t1 = clock() import load_matplotlib Humufr> X=load_matplotlib.load('data') #X = [X[:,i] for i in col] Humufr> x=X[:,0] y=X[:,1] t2 = clock() print 'load matplotlib', Humufr> t2-t1 #print X Humufr> t3 = clock() import load_2 Humufr> X=load_2.load('data',columns=range(14)) x=Y[:,0] y=Y[:,1] Humufr> t4 = clock() print 'load with columns choice', t4-t3 Humufr> t3 = clock() import load_2 Y=load_2.load('data') x=Y[:,0] Humufr> y=Y[:,1] t4 = clock() normal = t4-t3 print 'normal load ', Humufr> normal Humufr> time = [] for i in range(16): t3 = clock() import load_2 Humufr> X=load_2.load('data',columns=range(i)) x=Y[:,0] y=Y[:,1] Humufr> t4 = clock() #print 'load with columns choice', t4-t3 Humufr> time.append(t4-t3) Humufr> from pylab import * time = array(time)/normal Humufr> plot(range(16),time) xlabel('N columns (total = 16)') Humufr> ylabel('time columns /normal time') show() Humufr> ------------------------------------------------------------------ Humufr> def load(fname,comments='%',columns=None): """ Load ASCII Humufr> data from fname into an array and return the array. Humufr> The data must be regular, same number of values in Humufr> every row Humufr> fname can be a filename or a file handle. Humufr> A character for to delimit the comments can be use Humufr> (optional), Humufr> the default is the matlab character '%'. Humufr> An second optional argument can be add, to tell Humufr> which columns you Humufr> want use in the file. This arguments is a list who Humufr> contains the Humufr> number of columns beggining by 0 (python style). Humufr> matfile data is not currently supported, but see Humufr> Nigel Wade's matfile Humufr> ftp://ion.le.ac.uk/matfile/matfile.tar.gz Humufr> Example usage: Humufr> X = load('test.dat') # data in two columns t = Humufr> X[:,0] y = X[:,1] Humufr> Alternatively, you can do Humufr> t,y = transpose(load('test.dat')) # for two column Humufr> data X = load('test.dat',[0,2]) # data in two columns Humufr> (columns 1 and 3 use in the file) Humufr> X = load('test.dat') # a matrix of data X = Humufr> load('test.dat',columns=[2,3]) # a matrix of data, only Humufr> columns 3 and 4 will be use x = load('test.dat') # a Humufr> single column of data Humufr> x = load('test.dat,'#') # the character use like a Humufr> comment delimiter is '#' """ Humufr> # from numarray import array Humufr> fh = file(fname) Humufr> X = [] numCols = None if columns is None: for line in Humufr> fh: line = line[:line.find(comments)].strip() if not Humufr> len(line): continue row = [float(val) for val in Humufr> line.split()] thisLen = len(row) if numCols is not None Humufr> and thisLen != numCols: raise ValueError('All rows must Humufr> have the same number of columns') X.append(row) else: for Humufr> line in fh: line = line[:line.find(comments)].strip() if Humufr> not len(line): continue row = [val for val in Humufr> line.split()] row = [float(row[i]) for i in columns] Humufr> thisLen = len(row) if numCols is not None and thisLen != Humufr> numCols: raise ValueError('All rows must have the same Humufr> number of columns') X.append(row) Humufr> X = array(X) r,c = X.shape if r==1 or c==1: X.shape = Humufr> max([r,c]), return X |
From: Humufr <hu...@ya...> - 2005-02-14 01:09:29
|
>It looks like the reason the columns version of load is faster is >because it's not doing anything... It' not exactly true. I'm agree that the change is not big, but the difference comes from this two lines: #row = [val for val in line.split()] #no change in float for all values row = line.split() # dont need the loop so forgot the precedent line row = [float(row[i]) for i in columns] # float value and in a fact there are a condition if: the first is to keep exactly the same function than yours. The second part is to not transform all the element in float but only the columns choose and this change explain the difference... Regards, Nicolas |