Document and content ranking in docsum

In past releases, the presentation of document content in a summary has been entirely based on the SQL queries to select all content and locate the the phrase given by the user to match that phrase in the content and present that content amongst other matches as a summary. The problem with this selection is that the database will likely select in the same order each time the content that matches the user query for a summary. And so, relevant content that is not included in the summary due to page length limitations and being selected later then earlier matched content, is not included. This later content might also be more relevant and more valuable to the user requesting the summary.

To fix this problem is complicated because the content passed into the repository has no relational structure, or is unstructured. And, so it is up to me, the developer, to determine what should be ranked first as appearing in a limited summary verses another piece of content.

The first steps of this solution will be provided in the next release of docsum. The importance of content will be determined based on the frequency of its existence in other content. This is based on the basic assumption that related content will share various topics and phrases. The more that a document shares content with another document, its importance ranking will increase. The hope is that then the most relevant content based on the user query will be presented based on inter-document commonality, and not just a generalize match in the database select statement.

The current progress of this work can be seen in the SVN repository of the code as ContentExaminer.py.

Posted by Terrence Pietrondi 2007-04-09