Isidro Aguillo from the Cybermetrics lab was kind enough to reply to the Normalization question.
The basic principle is that all measurements for a certain metric get normalized against the maximum value, for any repository, of that metric.

In the simple example of the size metric, normalization would happen as follows:

According to the July ranking, CERN's repository (http://cdsweb.cern.ch) is the largest in size, with 2,590,000 pages indexed.
When the number of indexed pages for K.U. Leuven's Lirias is 253,000, the normalized figure (253k / 2590k) is 0.0976 or 9,76%

For the google scholar metric, it's a little bit more complicated because the average of 2 normalized totals is taken. To elaborate on the example:

site:lirias.kuleuven.be query in google scholar (all results): 21,400
site:lirias.kuleuven.be query in google scholar (only results from 2001-2008): 658

Imagine digital.csic.es has the maximum among all the world repositories with 42.800 (all results) , lirias is then 0.5 or 50%
The max value for recent results (2001-2008) is repository.usp.br with 6580. Then Lirias is 0.1 or 10%

The final scholar value for lirias would then be (50+10)/2 = 30% or 0.3 (rank 145th for example)

with kindest regards,

Bram Luyten

@mire - http://www.atmire.com

On Mon, Dec 13, 2010 at 11:12 AM, Bram Luyten <bram@mire.be> wrote:
Hi David,

JIRA does not allow anonymous interaction, so I'm afraid you'll have to take a minute to register an account. After you're logged in, it's really easy: a "Comment" button appears on the top left:

Small demo:
http://screencast.com/t/vygNWXdT

About the methodology & the indicated points:

Different results based on the search engine localization


I didn't realize this, but even for something like the Size index, it's true that different localized pages of google give different results.
site:hub.hku.hk on Google.com -> 726.000
site:hub.hku.hk on Google.es -> 729.000
site:hub.hku.hk on Google.hk -> 725.000

So this must indicate that for each of the localized google pages, different indexes are being used. As Baidu is the largest search engine in Asia, the fact that baidu coverage is not included might disadvantage asian institutions in the ranking.

Normalization

I only know about normalization in the case of the Scholar metric, as described on the methodology page:

Scholar (Sc). Using Google Scholar database we calculate the mean of the normalised total number of papers and those (recent papers) published between 2001 and 2008.

I'm unsure as well what "normalised" means in this context. Would be great if anyone could enlighten us.


best regards,

Bram

@mire - http://www.atmire.com

Technologielaan 9 - 3001 Heverlee - Belgium
533 2nd Street - Encinitas, CA 92024 - USA

http://www.togather.eu - Before getting together, get Tog@ther


On Mon, Dec 13, 2010 at 8:13 AM, David Palmer <dtpalmer@hku.hk> wrote:

Thanks Bram,

 

Yes, I would support harvestable usage stats.  I did not see how to add my support on the page you gave ?

 

Webometrics.  I see I must be more specific.  I have followed the papers written in the Webometrics project for both universities and repositories.  I tried to reproduce the results on a few sites.  I could not.  The methodology is not specific enough in some cases.  In others, I wonder if the search engines have different results in Spain as opposed to Hong Kong.  In some cases, I know this is true.  Also, I remember that part of the methodology was that certain results in certain cases were “normalized.”  But nothing written to explain which specific results were normalized.

 

Well, you might just conclude, like others have done, that I am dumb.  Hmnn, that is a possibility.  Better vitamins?  On the other hand, The Journal of Irreproducible Results, comes to mind;

        http://www.jir.com/

 

Serious types could stop reading here, but appropro of nothing, my favourite irreproducible result “the buttered cat paradox”, which goes like, buttered toast will always fall face down on the ground.  Cats will always land on their feet.  So if you strap a piece of buttered toast to the back of the cat, and hoist out the window, you should see antigravity appear.

        http://www.butteredcat.com/index.php?module=pagemaster&PAGE_user_op=view_page&PAGE_id=2&MMN_position=30:30

 

david

 

 

From: bluyten@gmail.com [mailto:bluyten@gmail.com] On Behalf Of Bram Luyten
Sent: Saturday, December 11, 2010 9:09 PM
To: David Palmer
Cc: dspace-general@lists.sourceforge.net
Subject: Re: [Dspace-general] webometrics

 

Without a full answer to your question (apologies in advance), here's one consideration:
the repository ranking only measures exposure through search engines. The data is being gathered by launching certain queries in google, yahoo, ...

the reason why they choose such a generic approach, is that it can work independently from the platforms. It doesnt matter which platform you run, as long as you have a URL (or subdomain), your repository (or website for that matter) can be measured. (and they do, similar metrics are being used to measure the exposure of university websites: http://www.webometrics.info/ ).

In my opinion, USAGE of repositories would be a much more valuable metric. Sure, it's good to have thousands of pages indexed, but are people actively downloading the files that are hosted there ?

With the SOLR statistics work on 1.6, now that institutions are already using this over a considerable amount of time, we would have the "common ground" to compare usage statistics.

I have proposed an automated OAI interface, in order to enable harvesting of your usage data, based on an internationally supported standard:

https://jira.duraspace.org/browse/DS-626 (if you think this is important, please voice your support in this request ;)

If this could make it into DSpace, I see no reason why usage date couldn't be included in the ranking (at least, for DSpace repositories).

Somewhat related: Annual repository cost per file vs cost per download


From a financial management perspective, you could calculate the annual cost of a repository as a cost-per-file ... let's say if you have 1000 files, and your internal staff time & some consultancy would cost you $5000 per year (just example figures, no real example), this would be a rather high cost of $5 per file. However, if you would know that the number of downloads is 50.000 (so 50 downloads per file on average), you can do cost accounting per download. That would be $0.1 per download.

best regards,

Bram

@mire - http://www.atmire.com

Technologielaan 9 - 3001 Heverlee - Belgium
533 2nd Street - Encinitas, CA 92024 - USA

http://www.togather.eu - Before getting together, get Tog@ther

On Fri, Dec 10, 2010 at 5:03 PM, David Palmer <dtpalmer@hku.hk> wrote:


I remain intrigued by the idea of metrics for IRs.  I have read the papers
on webometrics, and found questions.  I have asked and have not been
answered.

Will we as a community accept this ranking without any input into its
formulation?  Or even without proper understanding of the methodology?

David Palmer
Scholarly Communications Team Leader
The University of Hong Kong Libraries
Pokfulam Road
Hong Kong
tel. +852 2859 7004
http://hub.hku.hk






------------------------------------------------------------------------------
Oracle to DB2 Conversion Guide: Learn learn about native support for PL/SQL,
new data types, scalar functions, improved concurrency, built-in packages,
OCI, SQL*Plus, data movement tools, best practices and more.
http://p.sf.net/sfu/oracle-sfdev2dev
_______________________________________________
Dspace-general mailing list
Dspace-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-general