Browse System Alpha-chunking
Currently we break all the data in alphabet-chunked second panes (like
under title, author, institution, courses) by letter. Some of these only use
the alphabet to load the third pane (title, courses); the others use it to
reload the second pane (author, institution). Regardless, the alphabet is
rendered the same, simply as an array of hardcoded values. Most of these are
letters, but we also have a category for non-ASCII characters, and in one
case for content that doesn't provide data in that category (a null
category). This scheme, of course, is fast to render. It's also simple to use
and fairly obvious, and at most font sizes/resolutions, it fits on a single
like. These are some good characteristics.
Problems
There are two main classes of problem here.
Under: when there's very little data, it's pretty frustrating to browse. (At
least, in an aimless way. Targeted browsing is still okay.) Most of the
letters are empty, and one starts to wonder if there's anything there. And it
takes a lot of clicking to find anything at all in a situation with small
numbers of authors and/or content. This is not a problem for cnx.org, since
we have enough content that there are only a handful of empty categories
(mostly at the end of the alphabet in institutions). However, for new Rhaptos
sites, it's a big problem. And it will become a problem even with cnx.org
during lens application (in the future.)
Over: when a letter has a lot of data, it can take several pages up in the
second pane (which has no batching) and leads to long batch navigation in
third panes. Parts of "Author", for instance, are getting there. Again, this
isn't too bad on cnx.org, but might be in the future with content
growth.
When to calculate
On of the obvious stumbling blocks to a more calculated alphabetical switcher is that we basically have to work with the whole repository to figure out the chunking, or at least do the query ahead of time for each letter. This is prohibitively expensive for something that happens constantly in the browse interface. In fact, it probably has enough latency that it's not a good candidate for caching, since cache hits would be so slow.
Clearly we could calculate this and hard-code it for the current state of the repository, but that's not a particularly good solution, especially in the face of multiple repositories. The most promising avenue is a tool (or utility post-2.5) that can do these calculations, and store them in a simple list, able to respond to the browse pane quite quickly. Updates to this facility would be manual (whenever a balance issue was noticed), but we could set a timer on it. As slow as the repository is likely to change, this might be on a weekly basis or more; slow updates would probably improve usability, as the interface wouldn't keep subtly shifting. Another option would be to trigger a rearrangement if X number of content has been published/changed (though probably this would be on a periodic timer so this operation isn't synchronous to publish.
Solution Possibilities
One of the more obvious approaches to fixing the under-population problem
is to remove/disable empty letters. This leaves the UI pretty stable, and
doesn't require changing anything but the alphabet switcher. It's also a
straightforward calculation. It does not do anything about an over-population
problem, though it would create infrastructure we could use for a future
solution. One difficulty in this approach is that the publication of content
in a currently-empty letter would make it unavailable in the browse interface
until some calculations were made. The alphabet system could be engineered to
have a cheap publish hook to deal with this, though.
The other major way to deal with both data density problems is to create
buckets of approximately the same size, much like we do for tag cloud sizes,
but on the order of 20-30 (if there's enough data, otherwise you have fewer.)
This might look like
| Repository Size |
Displayed Alphabet |
|---|---|
| 1 |
A-Z |
| 4 |
A-E F-Z |
| 40 |
A B C-Dr Ds-Dz E F-H I-K L M-P Q-S T-Z |
| 5000 |
A-Ap Aq-Bd Be-Bq Br-Cm Cn-Cz D-Dr Ds-Dt Du-Ft Fu-H I J-Lu Lv-N O-Q
R-Rs Rt-Ss T-Vz W-Zz |
| 5005 |
A-Ap Aq-Bd Be-Bn Bo-Cm Cn-Cz D-Dr Ds-Dt Du-Ft Fu-H I J-Lu Lv-N O-Q R-Rp Rq-Ss T-Vz W-Zz |
The bucket specification might vary, but this version lends itself to a
nice tight display, in the way that A-Apple Aquilae-Azimuth doesn't.
Exactly how it breaks is also important; if it is very strict on bucket size
(which, actually, is probably easier) then the list will have fewer whole
letters under real conditions. We could try to prefer whole
letters.
In this scheme, each would have pretty close to the same number of
entries, depending on how much variation is allowed for grouping
rationalization. While this is nice, my feeling is that this is a little too
geeky.
A variation on this is encyclopedic ordering, possibly combined with letter dropping as above. In this case we would still try to balance different buckets, but subject to some rules designed to maintain readability of the result.
- Splits may happen only within whole letter ranges.
- Combinations may be of only whole letters.
With these rules for calculation, we might get something like:
A-Am An-Az B C D-Dm Do-Dz E F G H I-K L M N O P-Pb
Pc-Pz Q R S T U-V W-Z
Combinations might even be labeled like X-Y-Z, though this gets cumbersome
with larger combinations.
A low-density repository might read:
A-Z
or
A-M N-Z
These buckets might not end up quite as balanced, but would be a
significant improvement. Just these rules would handle both over- and
under-population problems quite nicely, and with no particular publish hook,
since new content will always have somewhere to land, even if it might cause
a reshuffling in the next calculation. We might also remove letters with no
content, but I think this would lead to worse categorization, however, and
would certainly complicate things.
There are a few wrinkles:
What to do about "Other": some titles start with oddball characters, like
the quotation mark, some with recognizable but not ASCII characters, and
others with characters from non-Roman systems. Currently we claim to be
unable to place them in our lexical ordering, and so toss them in an "Other"
bucket. A more clever sorting system (well, search system) could ignore
quotes and similar junk. Accented characters, and some other characters,
could be normalized to a normal roman letter (and there are indeed ways to do
this already.) Other letters (eth, thorn) have a sorting order (several,
really), and we could treat them as letters of their own, or place them
inside some other letter (commonly T). And how about Chinese-style
characters? Tossing these all in a big "Other" pile would seem to create a
very large pile over time. It may be that we will have to do this anyway, as
we don't have the time or expertise to build alphabetical sorting in every
language around.
What to do about "Unknown": this is only for Institution, and also could
generate a very large pile. It's already two pages. My inclination is that
for categories where content doesn't provide any data, we shouldn't include
that content.
Search system interaction
The browse system uses a wide variety of ways to fetch data, but most of the alphabetical ones rely on methods that specifically extract by first letter. Employing various ranges would require changing these, or adapting the caller, to deal with sub ranges and multi-letter ranges. Encyclopedic chunking may be a little easier, but the search demands for "Bg*-Bk*" are more complex than those for "B*", and likely to be more expensive. Some systems might not be able to build this into the query nicely, and we'd have to do result filtering.
The same system that calculates and stores the ranges might also be able
to tell us what its contents should be, having looked them up, perhaps
expensively, during its off-line run. This would certainly be fast, and could
probably even add things quickly on publication (but is yet another
synchronous event on publication.) I'm wary of yet another method of
cataloging data, however, and this would add some complexity.
Re: Browse System Alpha-chunking
1. What does it require to add a count of content to the current letter system? The language and subject categories provide counts of content and it is very helpful.
2. Can we have a display ALL that chunks? That would be very useful for small repositories (and even for our with collections and lenses).

1. What does it require to add a count of content to the current letter system? The language and subject categories provide counts of content and it is very helpful.
2. Can we have a display ALL that chunks? That would be very useful for small repositories (and even for our with collections and lenses).