Goodbye, Columbus; Farewell, COTA

LITA Camp is long over now (it ended on May 8th), but I’m just finally getting around to adding a post about it.  Though it would’ve been ideal if the “un”conference hadn’t had such a hefty registration fee attached to it, it was still a nice couple of days to network with a bunch of interesting librarians whom I had never met before.

Also, though Dublin seems to be a really nice town, transportation to and from the airport was less than ideal (especially for me, since I opted to stay in Ohio until Saturday in order to see a bit more of a Columbus while I was there).  Unfortunately, I wasn’t prepared for the horrible website that is COTA (the Central Ohio Transit Authority).

Perhaps I am a little slow, but just one small tweak that could make those PDF bus routes a lot easier to use on one’s own would be to include some “ante” and “post” meridian notation!  Or, heck, even using military time would make things a lot less ambiguous, and slightly more user-friendly.  Here’s an example of a problem that I encountered, putting me on the wrong side of the meridian and stuck downtown without a bus back to Dublin:


Fun to twist your head and read, right?  Well, to make a long story short, I could’ve avoided getting stuck had this particular excerpt included “am” next to each of those times.  Though I was able to travel down to Columbus early that evening, that same bus did not run back uptown at night (and I had unwisely assumed that I could get hitch a ride back at the 7:42 stop).

Aside from that misfortune, I did get to see a few other interesting sights and signs while in Dublin/Columbus, a few of which I’ll post below.  First, you’ve got OCLC’s headquarter’s sign, captured pretty poorly with my camera phone:


And then you’ve got, what must be, the most active pedestrians in Dublin, Ohio:


As for the funniest piece of graffiti that I encountered, I’ll bestow that honor on this alley-side image:


And finally, the out-right winner for what was clearly the best sign that I saw during my trip:


A sophisticated treat, indeed; but, sadly, it was one that I opted not to partake in during my time there.  Next time, though, Columbus!


Hello, Columbus

Not sure what to expect yet, but I’ve just recently arrived in Columbus, Ohio. Here’s the reason why:

Lita Camp 2009

I plan to do a presentation regarding my attempts/plans to integrate our EAD records with our Digital Repository. I was hoping to have a functioning beta by now (aside from just a few web page examples and a PPT presentation), but a lot of other work has come up that hasn’t permitted that to happen. Nevertheless, I still hope to launch everything in July.

And, after this weekend, I’ll go ahead and post my presentation and a detailed conference report. I’m looking forward to it…

Subjective Access

“Subjective access may not guarantee that I’m right about the character of the state I’m conscious of myself as being in, but on this view it does ensure that I’m the one who’s in the state if anybody is.”

— David M. Rosenthal, from Consciousness and mind (oclc: 61200643, page 355)

In EAD, we file subjects under a tag known as <controlaccess>, which is short for “controlled access headings.” And, since we’re using these for access (just as they were so used in the card catalogs of old, and occasionally still current), they should certainly be hyperlinks, right?

So, in what state are the subjects of our finding aids actually in, and in what state should they be?

In our case, at ECU, we’re in the process of updating all of our old subjects into Library of Congress Subject Headings, like this:

Sinbad (dog) + more

Sinbad (dog) + more

This isn’t an easy process, but it does mean that, once done, all of our finding aids will “play” nicely with all of the objects in our Library Catalog as well as all of the objects in our Digital Repository. But, for the time being, everything that’s listed in our “controlled access headings” is listed as plain text (without even, at this time, the ability to restrict a keyword search to those fields).

After the update, however, not only will we feature more advanced search options on an advanced search page, but we’ll also turn all of our subjects into hyperlinks. But this begs the question:

to what should we link?

At first, this was “obvious” to me, but after now having looked around at other institutions, it seems that there are a few different solutions, which I’ll list below:

(1) Link nowhere (The option that’s most often employed, and the one that we’ll be moving away from)

(2) Link to the rest of the finding aid database (subject to subject)

(3) Link to the rest of the finding aid database (subject to keywords)

(4) Link to the library catalog, which would include the rest of the finding aid database (subject to subject)

Right now, I’m currently leaning toward a fifth option, at least for the time being (which is just a combination of options 2 and 4):

(5) Link to finding aid database (subject to subject), and then on the page of search results, also include a link that will extend that same subject search to our library catalog and also to our digital repository.

I’d love to hear other options and what other people think may be the best route (though, in my opinion, that may largely come down to the size of your collections and also to the extent that they’ve been cataloged in some normalized fashion).

My final thought about all of this is how to extend it beyond our own collections, into larger EAD databases like ArchiveGrid. Wouldn’t it be useful to a researcher if a subject in a local finding aid could be extended to repositories worldwide? In this case, though, it would definitely be easier to follow Columbia’s example so as to not get involved with messy crosswalks and the like.

Gangling Container Lists

Linotype operator

— or, on faking a neologism

What’s a “gangling container list”*, you might reasonably wonder?  Well, I’m using the term “GCL” to refer to a “container list” (or inventory) in a finding aid that is particularly hard to encode/potentially confusing to the user/online viewer.  The main GCL at ECU belongs to the Manuscript Collection numbered 741.  Let me explain, in a less cryptic fashion:

Right now, the only collection that we have that’s both heavily described and digitized is our Daily Reflector Negative Collection.

Though the encoding for this collection isn’t divided into thematic series (it’s arranged chronologically instead), it is arranged/subdivided by:

  1. Box
  2. Folder
  3. Sleeve
  4. Item (when digitized).

Here’s an example of our EAD encoding for that, where the compenent level in the EAD corresponds to the ordered-list numbers above:

Snippet of the EAD container list for the Daily Reflector Negative Collection

Snippet of the EAD container list for the Daily Reflector Negative Collection

If you’re familiar with EAD, you might look at this and have a lot of questions/criticisms.  However, I don’t want to focus on how this finding aid is encoded (as it’s not typical for our collections, and it isn’t ideal yet), but instead what I want to focus on is its physical arrangement, its display, and how we’re going to connect it to the portions that are digitized.

Until now, we’ve only been linking digitized objects in our finding aids at the item level (in this case, that’d be the <c04> tag).  However, we have a received an LSTA grant for this collection that will shortly result in the digitization and description of over 7000 images.   And, in preparation of this grant, the container list (or, GCL) has grown from a relatively short list, that contained information about its 45 boxes, to an exceptionally long list, which now contains information for over 13000 described sleeves.

Presently, the online finding aid has every box, folder, and sleeve listed on just one page of output.  It also includes just over 100 images that were digitized prior to the grant for testing purposes.  But, if the finding aid were to include all of the images, this would result in over 20000 lines being added just to the container list!

So, we have two dilemmas:

  1. How to deal with this “one page display”
  2. How to deal with so many items (which will only increase after the grant).

As for problem number 1, we’re going to continue with our one page display option for the time being (though we may eventually employ other types of interfaces) in order to keep our search processes as simple as possible.  This could/should be an entire blog post on its own, however, so I’ll save that for another time.

That leaves problem number two. One potential solution, though not yet employed, will adhere to the following principles:

  • Encode everything (all +7000 items, and add new items as they’re requested for digitization)
  • Do not provide item level links in the finding aid (at least in the initial display) if the collection has too many items (rather than setting an arbitrary item number limit, however, this decision will be made at the collection level and might only include this particular collection, due to the next reason)
  • When possible, only scan and catalog at the lowest level of granularity already described in the finding aid (this means that when future items are requested by a patron for digitization, we might scan all of the other items in that folder at the same time, and only describe the “digital object” at the same level as is described in the finding aid).  See this object for a pilot example (but note that the display is not finished and that it hasn’t yet been cataloged).
  • Create a new stylesheet that can differentiate between providing links at the box, folder, sleeve, and item levels when necessary.
  • Create a new template that helps to address issue number 1 until that issue can be more thoroughly examined.

For this finding aid, then, the stylesheet will only output links at the “folder” and “sleeve” levels.  The individual items will only be accessible from these two levels (of folder and/or sleeve).  In some cases, then, the sleeve link will take you to a display with only one item and in others it will take you to a display with multiple items (it just depends on how many of the negatives were selected from that particular sleeve).  Each of these “sleeves” has a description that includes the total number of physical negatives included, though, so it should hopefully be somewhat clear to the user whether the sleeve is partially or fully digitized.

Check back next week for a mock-up of a newly improved Daily Reflector finding aid (this ambitious deadline, I’m hoping, will give me some incentive to finally write that stylesheet).   The mock-up won’t look like the final format, however, as there will still be some work that needs to be done to more fully integrate our digital repository with our finding aid database, but it should present a pretty clear idea.

In the meantime, please leave comments, suggestions, or even examples of your own GCLs.  I’ve certainly seem some instances of innovative displays for extremely large collections, but what I’m more interested in seeing is a display method for such a collection that also fits in with the overall delivery and “search” of the rest (that is, not just a finding aid that’s like an online exhibit, but a mutable sort of finding aid that integrates well with every EAD at that institution).

*Though the phrase is abbreviated as GCL, the recommended pronunciation is actually “Gackl”**, which is utilized instead of “G.C.L.” in order to better emphasize the electronic awkwardness of its referent.
**That’s not an e-typographical error. The preferred spelling is “Gackl” rather than something like “Gackle” for two important reasons:

  1. Obviously, to mock all things Web 2.0
  2. So as to not confuse the term with (nor raise awareness of) Gackle, North Dakota.

Measuring our digital archives, part 2

Well, I’ve seen a few attempts at visualizing archival holdings in the past, but this one by Mitchell Whitelaw at the University of Canberra is somewhat similar to what I was discussing in my previous post:

Click on image to see the blogpost "Packing Them In"

This is working with a very LARGE dataset, which includes some 57k series from the National Archives of Australia [ see Mitchell Whitelaw’s post for more information: ].  The visualization, in this case, is highlighting the ratio between the size of a series (in linear meters) with the number of registered items that belong to that series (the emptier the square is, the smaller amount of registered items).

So, my idea is certainly not all that original (and I certainly didn’t think that it would be).  But, this find still encourages me to pursue my particular path, since the one thing unique about my idea is the usage of the length of the EAD itself to be used for comparitive purposes (though this may not prove ideal, I think it’s still worth checking into).  It would also be nice to see how this and other similar processes compare when used on the same collections…  but, before that can happen, the proper toolsets will need to be developed.

Measuring our digital archives

Wouldn’t it be great if we had a standard unit of measurement for archival finding aids?  Surely there’s one already, right?  Well, before I answer that, let me back up a little bit…

A recent post by Michele Combs on the EAD Listserv has me thinking again about the large colleciton of EAD records that I work with on a daily basis.

Michele’s question was a seemingly simple one:

what percentage of your collections (that have “finding aids”) are encoded in EAD?

This, then, raised the question of how exactly do we define a finding aid, and also implied questions about whether all instances of “finding aids” should be encoded in EAD (my answer would be YES, if only for the format).  But that’s not the part that interested me during the discussion.

What interested me was when someone else in the list mentioned that though they had a certain percentage of their finding aids in EAD, they also had some finding aids that were extremely long (up to 1000 pages!), and that almost none of the collections that went over 100 pages were in EAD format.  This makes some sense, as it would take a lot of time to type that information into digital format (if it only exists on paper), and the OCR process/clean-up might take even longer.  That said, eventually these collections will have to be converted to EAD:  certainly their current length already suggests the importance of the collection!

But the introduction of “page count” is what really interested me, and gave me some good ideas.  Here’s what I mean:

“Page counts” are not a very good unit of measurement, since the format, font type, font size, margins, spacing, etc., can all affect what length you’ll end up with.  However, any finding aid that’s in a digital format (be that EAD or even MS Word) can easily be measured by the unit of character count (sans the EAD tags in regards to the XML format, though, of course).   This way, archives/archivists can do a quick and accurate count of the size of ALL of their finding aids.

What’s more, this measurement would then be accurate when compared to collections at other institutions (which would certainly not be the case if it were just based on page counts).

Of course, it’s important to note that I’m only talking about the “size” of the finding aid, and not the physical size of the collection.  However, once you have the “descriptive” size of the finding aid, you could then compare that information with the physical extent of the collection.

But why would you want to do that?  Well, for one, it could be a useful tool to visualize not only the size of our collections, but the lengths that we go toward describing them (and, in a lot of cases, the lengths that we still need to go, in regards to collections that may be physically large but nearly bereft of descriptive attention).

So, I’m thinking about starting to develop a simple toolset to do just that on our local collection (assuming I ever have the time) in hopes that it could then be extended to other archival institutions that are also using EAD.  Hopefully such a large-scale assesment would have some unintended effects as well, but at the very least I think that it could be an interesting way to pinpoint collections — or even areas of collections — that are in need of more processing to increase their visibility (and this, I’m thinking, could be an ideal step to take after the wave of  “more product, less processing” approaches in order to help archivists prioritize their time).

Normalized Dates in EAD

A recent post to the EAD listserv has me thinking once again about how dates are used (or not used) in EAD records.  As far as I can tell, RLG’s ArchiveGrid doesn’t permit searching by date (I could be wrong on this, though, as I don’t have full access to it, but it does use Lucene to index its records; though I suppose that most of these are just MARC records?) and Proquest’s Archive Finder does permit searching by date, but it doesn’t really allow you to do very much (i.e. there’s no way to rank your results by “relevancy”).

This leads me to a question:  what sort of back-end systems are archives using for their EAD records? (are there any surveys out there that has this information, or should we start one???)

At ECU, we’re using an XML database only, but we aren’t doing any advanced searching by date (primarily because, at this time, if you did search for something like “1912”, it’s not going to limit your results very much; and then, really, you’re just back at the whole “browse by collection name” situation).  However, you can do a keyword search for “1912”, and the results that are returned to you will be ordered by the number of hits in each document, which, in my mind, is only a small difference in functionality, but perhaps more useful (in most occasions) than simply limiting your results to any and all collection date ranges that contain the year “1912”.

This leads me to another set of questions:  is anyone out there using the “bulk” attribute as part of your information retrieval process?…  is anyone using dates beyond the collection range (those dates associated with a series, folder, even an item) in the information retrieval process?…  has anyone attempted to test their corpus of EAD records with their current search operations vs. indexing and searching those records by means of different models of IR, such as Nutch, INDRI, Solr, or even just Google Custom Search???

I think it’s great that we’re encoding our documents so well, but I keep wondering if we’re harnessing that information in the best possible ways yet (and perhaps the best solutions won’t be tied to our encoding practices at all).