Skip to content

Rhaptos Software Development

Personal tools
You are here: Home » Developer Blog » jccooper blog » Browse System Alpha-chunking

Browse System Alpha-chunking Browse System Alpha-chunking

Document Actions
Submitted by jccooper. on 2007-08-15 17:10. Development
The browse system's alphabetical selection for authors, etc, works okay for us, since we have enough data, but it works poorly for bare Rhaptos installs, and is already starting to be a problem for some of the more popular letters.

Currently we break all the data in alphabet-chunked second panes (like under title, author, institution, courses) by letter. Some of these only use the alphabet to load the third pane (title, courses); the others use it to reload the second pane (author, institution). Regardless, the alphabet is rendered the same, simply as an array of hardcoded values. Most of these are letters, but we also have a category for non-ASCII characters, and in one case for content that doesn't provide data in that category (a null category). This scheme, of course, is fast to render. It's also simple to use and fairly obvious, and at most font sizes/resolutions, it fits on a single like. These are some good characteristics.

Problems

There are two main classes of  problem here.

Under: when there's very little data, it's pretty frustrating to browse. (At least, in an aimless way. Targeted browsing is still okay.) Most of the letters are empty, and one starts to wonder if there's anything there. And it takes a lot of clicking to find anything at all in a situation with small numbers of authors and/or content. This is not a problem for cnx.org, since we have enough content that there are only a handful of empty categories (mostly at the end of the alphabet in institutions). However, for new Rhaptos sites, it's a big problem. And it will become a problem even with cnx.org during lens application (in the future.)

Over: when a letter has a lot of data, it can take several pages up in the second pane (which has no batching) and leads to long batch navigation in third panes. Parts of "Author", for instance, are getting there. Again, this isn't too bad on cnx.org, but might be in the future with content growth.

When to calculate

On of the obvious stumbling blocks to a more calculated alphabetical switcher is that we basically have to work with the whole repository to figure out the chunking, or at least do the query ahead of time for each letter. This is prohibitively expensive for something that happens constantly in the browse interface. In fact, it probably has enough latency that it's not a good candidate for caching, since cache hits would be so slow.

Clearly we could calculate this and hard-code it for the current state of the repository, but that's not a particularly good solution, especially in the face of multiple repositories. The most promising avenue is a tool (or utility post-2.5) that can do these calculations, and store them in a simple list, able to respond to the browse pane quite quickly. Updates to this facility would be manual (whenever a balance issue was noticed), but we could set a timer on it. As slow as the repository is likely to change, this might be on a weekly basis or more; slow updates would probably improve usability, as the interface wouldn't keep subtly shifting. Another option would be to trigger a rearrangement if X number of content has been published/changed (though probably this would be on a periodic timer so this operation isn't synchronous to publish.


Solution Possibilities

One of the more obvious approaches to fixing the under-population problem is to remove/disable empty letters. This leaves the UI pretty stable, and doesn't require changing anything but the alphabet switcher. It's also a straightforward calculation. It does not do anything about an over-population problem, though it would create infrastructure we could use for a future solution. One difficulty in this approach is that the publication of content in a currently-empty letter would make it unavailable in the browse interface until some calculations were made. The alphabet system could be engineered to have a cheap publish hook to deal with this, though.


The other major way to deal with both data density problems is to create buckets of approximately the same size, much like we do for tag cloud sizes, but on the order of 20-30 (if there's enough data, otherwise you have fewer.) This might look like

Repository Size
Displayed Alphabet
1
A-Z
4
A-E F-Z
40
A B C-Dr Ds-Dz E F-H I-K L M-P Q-S T-Z
5000
A-Ap Aq-Bd Be-Bq Br-Cm Cn-Cz D-Dr Ds-Dt Du-Ft Fu-H I J-Lu Lv-N O-Q R-Rs Rt-Ss T-Vz W-Zz
5005
A-Ap Aq-Bd Be-Bn Bo-Cm Cn-Cz D-Dr Ds-Dt Du-Ft Fu-H I J-Lu Lv-N O-Q R-Rp Rq-Ss T-Vz W-Zz


The bucket specification might vary, but this version lends itself to a nice tight display, in the way that A-Apple Aquilae-Azimuth doesn't. Exactly how it breaks is also important; if it is very strict on bucket size (which, actually, is probably easier) then the list will have fewer whole letters under real conditions. We could try to prefer whole letters.

In this scheme, each would have pretty close to the same number of entries, depending on how much variation is allowed for grouping rationalization. While this is nice, my feeling is that this is a little too geeky.

A variation on this is encyclopedic ordering, possibly combined with letter dropping as above. In this case we would still try to balance different buckets, but subject to some rules designed to maintain readability of the result.

  1. Splits may happen only within whole letter ranges.
  2. Combinations may be of only whole letters.

With these rules for calculation, we might get something like:

    A-Am An-Az B C D-Dm Do-Dz E F G H I-K L M N O P-Pb Pc-Pz Q R S T U-V W-Z

Combinations might even be labeled like X-Y-Z, though this gets cumbersome with larger combinations.

A low-density repository might read:

    A-Z
or
    A-M N-Z

These buckets might not end up quite as balanced, but would be a significant improvement. Just these rules would handle both over- and under-population problems quite nicely, and with no particular publish hook, since new content will always have somewhere to land, even if it might cause a reshuffling in the next calculation. We might also remove letters with no content, but I think this would lead to worse categorization, however, and would certainly complicate things.

There are a few wrinkles:

What to do about "Other": some titles start with oddball characters, like the quotation mark, some with recognizable but not ASCII characters, and others with characters from non-Roman systems. Currently we claim to be unable to place them in our lexical ordering, and so toss them in an "Other" bucket. A more clever sorting system (well, search system) could ignore quotes and similar junk. Accented characters, and some other characters, could be normalized to a normal roman letter (and there are indeed ways to do this already.) Other letters (eth, thorn) have a sorting order (several, really), and we could treat them as letters of their own, or place them inside some other letter (commonly T). And how about Chinese-style characters? Tossing these all in a big "Other" pile would seem to create a very large pile over time. It may be that we will have to do this anyway, as we don't have the time or expertise to build alphabetical sorting in every language around.

What to do about "Unknown": this is only for Institution, and also could generate a very large pile. It's already two pages. My inclination is that for categories where content doesn't provide any data, we shouldn't include that content.

Search system interaction

The browse system uses a wide variety of ways to fetch data, but most of the alphabetical ones rely on methods that specifically extract by first letter. Employing various ranges would require changing these, or adapting the caller, to deal with sub ranges and multi-letter ranges. Encyclopedic chunking may be a little easier, but the search demands for "Bg*-Bk*" are more complex than those for "B*", and likely to be more expensive. Some systems might not be able to build this into the query nicely, and we'd have to do result filtering.

The same system that calculates and stores the ranges might also be able to tell us what its contents should be, having looked them up, perhaps expensively, during its off-line run. This would certainly be fast, and could probably even add things quickly on publication (but is yet another synchronous event on publication.) I'm wary of yet another method of cataloging data, however, and this would add some complexity.

Re: Browse System Alpha-chunking

Posted by kef at 2007-08-21 16:10
Two questions:

1. What does it require to add a count of content to the current letter system? The language and subject categories provide counts of content and it is very helpful.

2. Can we have a display ALL that chunks? That would be very useful for small repositories (and even for our with collections and lenses).

Re: Browse System Alpha-chunking

Posted by kef at 2007-08-21 17:27
Two questions:

1. What does it require to add a count of content to the current letter system? The language and subject categories provide counts of content and it is very helpful.

2. Can we have a display ALL that chunks? That would be very useful for small repositories (and even for our with collections and lenses).
Developer Blog
« May 2008 »
Su Mo Tu We Th Fr Sa
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31
Categories:
Content (55)
Copyright (0)
Deep Code (3)
Development (193)
Markup (22)
Metadata (1)
Printing (7)
Style (9)
Testing (2)
Usability (6)