Thanks for what is also a sensible recommendation.

On Aug 29, 2008, at 10:01 AM, Thomas A McGee wrote:

I missed the chat the other day, so some of this may have been covered and dismissed already.

Tomcat has the capacity to output Apache-style "combined" log files for all requests, including bitstreams. There's a whole host of commercial, shareware and freeware products out there designed to slice-and-dice these Apache log files and pull out all the kinds of reports people seem to be talking about here.

The programs range from the very simple, like Analog, to the extremely complex and expensive, like WebTrends Enterprise. They can be configured to download the log files automatically and run reports on a schedule, so that they're there when you come in in the morning. They can incorporate various filters, resolve user IP addresses, analyze request URL paths (which can be translated into collection and community names), referers, logged-in users, user agents, etc. etc.

Rather than reinvent the wheel (and this is an extremely complex wheel),I think for most users it would pay to look at this approach unless there is something really esoteric about your traffic that you are trying to get at.

Its an inherent issue in the the "address space" of DSpace resources made available in the web-application. For instance. I may have the following Community, Collection and Item

Computer Science and Artificial Intelligence Lab (CSAIL)

CSAIL Technical Reports (July 1, 2003 - present)

Adaptive Envelope MDPs for Relational Equivalence-based Planning

Via the perception of the Apache/Tomcat logs Requests to these resources are made and based on those logs its quite difficult to ascertain that there is a hierarchy here:

/1721.1/5458 <-- Community
      /1721.1/29807 <-- Collection
              /1721.1/41920 <-- Item

The challenge is that most logging packages given the lack of the above structure being absent in the path of the resource, cannot roll up the statistics to represent the aggregations at the collection and item level that Managers want to see for a DSpace Community/Collection.

Likewise, we are in a situation where we are trying to maintain

1.) Not introducing a ridged expectation that "paths" for which resources are represented can not change over time as dspace evolves
2.) That we may have more than one path for which a resource is accessed, and may want to either treat those accesses as "the same" or treat them as "uniquely different" statistically.
3.) That we want to allow hooks so that these stats can be collected off the "logical event" in DSpace rather than the "physical event" in the application server.

By configuring a stats solution like analog/awstats/webtrends, we are restricted to only gathering statistics about the physical event of requesting that address in the web service. And likewise, if that address representing that resource changes in UI (either via development decisions or administrative decisions) then that configuration of that external software will be out of sync and need to be adjusted.  By having the application report "logical events" we can step away from this issue. By internalizing the statistics gathering and generation, we have an opportunity to create a solution that can allow DSpace to freely evolve and  solution that will meet the requirements requested by the community (or more explicitly, exhibited by the Minho addon).


Mark R. Diggory - DSpace Developer and Systems Manager
MIT Libraries, Systems and Technology Services
Massachusetts Institute of Technology