Re: [VuFind-Tech] Optimization for alphabetic browse

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Thanks, Tod, that makes sense to me.

As far as maintaining backward compatibility goes, if we want to get this into VuFind 4.1, it would be nice if we could implement in a way that doesn’t break the interface. If we’re okay with waiting until 5.0 next year, I’d feel more comfortable with breaking changes.

One thought is that it might make sense to code this in such a way that the deep code that’s processing everything only expects one extras list… then higher-level code could process the retrieveBibId setting and use that to add the “id” field to the extras list as needed. This would then give us a place to put a deprecation warning, and later to simply remove obsolete functionality, without having to go deeper into the logic. Of course, I’m making this suggestion without looking at the code, so it may be completely inappropriate to the reality of the situation. ;-)

- Demian

From: Tod Olson [mailto:to...@uc...]
Sent: Wednesday, July 26, 2017 4:48 PM
To: Demian Katz
Cc: Tod Olson; Mark Triggs; vufind-tech
Subject: Re: Optimization for alphabetic browse

I like that two setting idea, and have worked that into my code changes. I hope to have a pull request later today or tomorrow to show how I'm approaching it. And have others chime in on where to be more defensive in the programming.

The idea that id just become another field is also good. That would reduce backward compatibility, but clean up some code. To implement that, I would change some property names, and create a new method to replace matchingIDs, just to make the break with back-comparability more clear. (of course, if we want to toggle between old and new behavior that likely would be another property.)

-Tod

On Jul 26, 2017, at 7:38 AM, Demian Katz <dem...@vi...<mailto:dem...@vi...>> wrote:

Sometimes, if I double-tap enter too quickly, it seems to send my emails prematurely. Stupid Outlook. Anyway, it appears that the logic for linking to headings assumes that IDs will be present when count <= 5, so simply omitting them from the response will not work without code changes to VuFind.

As far as how configuration goes, if we’re interested in backward compatibility, perhaps we should actually add two settings:

retrieveBibId = true/false (default = true)
maxBibListSize = as you described (default = -1)

Essentially, retrieveBibId controls whether or not to fetch bib IDs while retrieving the extras list, and maxBibListSize controls how many records to examine when we need IDs and/or extras. We might eventually consider deprecating retrieveBibId and making “id” just another element that can be included in the extras list, but it seems simpler to start here.

As far as versioning goes, I certainly wouldn’t object to doing a better job of that. Right now the closest thing we have to versioning is the fact that I add a tag to the GitHub repo every time there is a VuFind release showing which version of the browse handler is included in each VuFind release… but having an internal, separate browse handler versioning mechanism wouldn’t be a bad thing, and would enable us to use semantic versioning for backward breaks, etc.

Hopefully that covers everything; if I missed anything I should have addressed, feel free to ask again. ☺

- Demian

From: Demian Katz
Sent: Wednesday, July 26, 2017 8:31 AM
To: 'Tod Olson'
Cc: Mark Triggs; vufind-tech
Subject: RE: Optimization for alphabetic browse

Tod,

To answer your first question, the logic for linking to AlphaBrowse results is found here:

https://github.com/vufind-org/vufind/blob/master/module/VuFind/src/VuFind/View/Helper/Root/AlphaBrowse.php#L67<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fvufind-org%2Fvufind%2Fblob%2Fmaster%2Fmodule%2FVuFind%2Fsrc%2FVuFind%2FView%2FHelper%2FRoot%2FAlphaBrowse.php%23L67&data=02%7C01%7Cdemian.katz%40villanova.edu%7Ca99b183e71484e3aab6608d4d467a88d%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C636366989050761061&sdata=8lj9eGEo7xZ%2BOnwU8NuX5fGxN9PvDg%2BvxhbWXSHm8UM%3D&reserved=0>

From: Tod Olson [mailto:to...@uc...]
Sent: Tuesday, July 25, 2017 4:30 PM
To: Demian Katz
Cc: Tod Olson; Mark Triggs; vufind-tech
Subject: Re: Optimization for alphabetic browse

If we want to preserve the search-by-exact-ID thing, there must be a way to set a threshold. The easiest way I can think of is to put it in solrconfig.xml. We'll come back to this in a bit. I want to start with the UI and work back to the browse-handler.

There are two things we want from the BibDB::matchingIDs() method:

(1) bib Ids for exact-match searching from the alphabetical browse results list, and
(2) extra field display info for the alphabetical browse results list.

These are two very different functions, but must be intertwined for fairly practical reasons, and that creates some complication.

For (1), the exact-match searching, the UI has some threshold on the number of exact bib Ids it will use in a browse list entry. The browse-handler response for each list item includes a list of bib ids and a count of matching records. Somehow the UI has a threshold for how many ids to use in constructing the search link from the list. Let's say it's 5: if there are 5 ids or fewer, the search by bib id, otherwise search by heading. Where is that threshold set? And what is the behavior if the ids list is empty, but with a non-zero count? Is it safe to return an empty list, or do we always need a minimal number for the UI to work correctly?

For (2), the extra fields for list entries, if there is a request for any extra fields in a call to BibDB::matchingIDs(), then setting a threshold may risk not returning some values, unless we ignore the threshold when extra fields are requested. But I think that sort of exception is confusing.

So I think we want a mechanism to set the maximum number of records/bib Ids to retrieve, regardless of hit count on the entry or whether we are asking for extra fields.

How to implement such a threshold?

The existing requestHanders stanza for "/browse" already contain per-index configuration: DBpath, field (for exact-match search), and normalizer. We could add another setting, maxBibIds or maxBibListSize, something like that. Semantics could be:

-1: get all of the bibIds (maybe good for title browse)
0: get no bibIds and don't bother checking matching Ids or extras at all (probably good for subject browse)
N: List the first N bibIds that are returned (for sites or indexes wanting to hedge their bets)

BrowseRequestHandler then stores the threshold for each browse index in a new field in the BrowseSource object for each browse index.
The the threshold is somehow passed in to Browse::populateItem, but I've not gotten that far yet.

So that's the outline in my head. Sound reasonable so far, or too complicated?

Side question: do we need to add versioning to the browse-handler? BrowseRequestHandler::getVersion returns a fixed string, maybe we should do something about that, but I don't know what.

-Tod

On Jul 13, 2017, at 8:54 AM, Tod Olson <to...@uc...<mailto:to...@uc...>> wrote:

It is also the case that BibDB::matchingIDs allows an "extras" parameter which names extra Solr fields to return for use in the browse listing. We use this in our Title Browse to return author, format, and year of publication. When there is a title with multiple formats or whatever, you see that. See a title browse for "hobbit"<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcatalog.lib.uchicago.edu%2Fvufind%2Falphabrowse%2Fhome%3Ffrom%3Dhobbit%26source%3Dtitle&data=02%7C01%7Cdemian.katz%40villanova.edu%7C7de2733cabc1430454c608d4d39bf003%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C636366114081132004&sdata=JH3VXNndo2crSOMORLuyRJq3aenlBHM24bcnCZW%2FylI%3D&reserved=0> in our VuFind installation. Useful in Title browse and call number browse, but we do not use the extras in the authority browses. So must continue to support that.

So yes, probably best to devise a general solution. *think, think, think*

-Tod

On Jul 13, 2017, at 7:22 AM, Demian Katz <dem...@vi...<mailto:dem...@vi...>> wrote:

It’s possible that the search-by-exact-ID thing is a hold-over from earlier days of VuFind when search query construction was less consistent and reliable, or was simply a sign of lack of confidence in Solr. It may also help address edge cases where a record has changed in Solr but the browse index has not yet been updated – i.e. if a heading is removed or changed on a record but the record is not deleted.

In general, though, I’m inclined to agree with Mark that the whole thing may no longer be necessary… and indeed, if there are problems, having two different ways of resolving browse headings just creates more cases that need to be tested, potentially complicating troubleshooting and introducing weird behaviors. So it may make the most sense to switch to always doing things one consistent way.

I’d still be inclined to make this a configurable behavior (if it can be done at little cost) simply because there might be some use cases where it would be helpful for somebody to use the browse handler to fetch IDs – but that’s probably very conservative of me, and might not really be worth the effort.

- Demian

From: Tod Olson [mailto:to...@uc...]
Sent: Thursday, July 13, 2017 7:43 AM
To: Mark Triggs
Cc: Tod Olson; Demian Katz; vufind-tech
Subject: Re: Optimization for alphabetic browse

Thanks, Mark!

You were probably thinking about the majority of cases (especially titles) where there are only one or very few matches. I'd be really interested to find cases where the heading is less reliable, I rather doubt you were making that up.

In any case, I may try to think a little more flexibly about how to address the issue.

Best,

-Tod

On Jul 13, 2017, at 6:03 AM, Mark Triggs <ms...@di...<mailto:ms...@di...>> wrote:

This may well disappoint, but I'm scratching my head as to why I ever bothered pulling out the list of IDs like that. Firing the search and getting the count should be cheap enough, but collecting all the IDs does seem like a waste of time.
I found this comment in the AlphaBrowse.php that I think I wrote:

 // Linking using bib ids is generally more reliable than doing

 // searches for headings, but headings give shorter queries and

 // don't look as strange.
But current me doesn't find that very compelling. Seems like it would be better to forget about collecting the IDs and just build the link based on the search heading. Like you say, should be much faster!
Cheers,
Mark
Tod Olson <to...@uc...<mailto:to...@uc...>> writes:
On implementing an optimization, I'm also thinking in BibDB::matchingIDs, to check if the extras parameter is empty and just immediately check BibDB:: recordCount, if the result is greater than some threshold don't even bother with pulling IDs. Not a perfect solution, but maybe okay.
And will be interested to see what Mark has to say!
Thanks,
-Tod

--

Mark Triggs

<ms...@di...<mailto:ms...@di...>>