I wrote this when I was the Digital Library Applications Lead at the University of Notre Dame. It was archived from http://www3.nd.edu/~dbrubak1/ which no longer exists.

A Digital Repository We Can Live With

Thursday January 3, 2013 by Dan Brubaker Horst in Archived, API, architecture, archives, digital humanities, ND, repository

Setting Expectations

The primary deliverable of a digital repository should be a persistent URL that resolves to a rich HTML description of the item. A digital object should be able to be:

Linked to
Embedded (e.g. via oEmbed)
Indexed by Search Engines (e.g. Google, Google Scholar)
Presented in our Library Catalog
Delivered in a wide variety of derivatives

All digital objects need:

Access controls
Copyright auditing
Life cycle management
Description and metadata enrichment

We need to provide the services needed to meet these expectations when the digital objects are ingested.

When we store digital objects the original files will be treated specially. In most cases the original file will be a preservation ready item. We will want to create a presentation-ready derivative of the original at a reasonable fidelity and in an appropriate format. When preserving these files the archival and the presentation masters of the file should be versioned and periodically bit-checked for integrity. On-the-fly derivatives, such as a crop from an image or a plain text export of a TEI document, do not need integrity checking. Derivatives should be cached and allowed to expire via an appropriate mechanism.

Don’t Dig The Hole Deeper

While the Hydra stack has been helpful it is not without problems. Thus far our efforts have produced applications that are complicated, difficult to maintain, and fail at delivering an exceptional experience to patrons and curators. Penn State’s ScholarSphere is a well-implemented “Hydra head” but it is still constrained by the same design decisions present in all “Hydra heads” Applications that use the “Hydra head” design philosophy have awkwardly coupled concerns. Persistence, serialization, and presentation are blended together and the resulting application is entwined with Fedora, Solr, and a host of gems.

The promise of the Hydra stack is that if you install a bunch of gems into your application it will be easy to make a Fedora-backed digital repository. This is fine if there is only one application that serves as a digital repository for an institution. With each added application it gets more and more complex to maintain the ecosystem because the necessary gems are rapidly changing. We already have half a dozen applications and will build at least that many this year. I think this will create a maintenance nightmare.

Our Current Repository Architecture The APIs for Fedora and Solr are the integration layer between applications.

An API-Driven Architecture

I believe that we can forestall our growing pains if we reexamine our architecture. The collection of gems that makes up a “Hydra head” which as also been colloquially referred to as the “Hydra neck” isn’t a good integration point across multiple applications because it ties every application too closely to Fedora and Solr.

Right now when we say “repository” we mean Fedora Commons. When we say “repository” we should mean a common API for storing and retrieving digital objects. If we wrap Fedora and the canonical Solr index with an application that provides a Hypermedia API to the underlying content we can keep a lot of the complexity centralized. This will effectively move the point of integration between our applications from the Fedora API to the new API that we have written to meet our needs. This should lower the barrier of implementation for client applications after an upfront investment in time.

The Proposed Repository Architecture The persistence mechanism of digital objects is obfuscated by a common API. Complexity is centralized allowing for simpler client applications.

One of the topics that perennially pops up at Hydra meetings is: “What would Hydra look like without Fedora?” The proposed API-driven architecture offers an answer. Because it removes the Fedora API as the common interface between applications the entire persistence layer can be changed to anything that implements the API. This also provides a benefit for developers because it would allow alternate implementations of the API, e.g. one backed by SQLite, to be used while developing repository-backed applications.

Patron-Facing API Clients

While the centralized repository will have a canonical representation of each object most of the time an object belongs to a more specific context. Groups of digital objects should be able to be embedded or re-cast in other interfaces as members of a collection, exhibit, or other specialized interfaces.

Exhibit

An exhibit is a selection of items with particular scholarly significance. They should stand alone or serve as digital companions to exhibits presented in the Rare Books exhibit area. Exhibits may be as simple as static pages with embedded repository content. There is room for a tool like Atrium to facilitate this process but it’s present incarnation is more focused on “collections” than “exhibits”

Collection

A collection interface is an alternative to a conventional finding aid. Collections will typically be based on Blacklight and have their own collection-specific Solr indexes.

Portal

A portal is a collection with additional features and visualizations. We can provide value to our patrons with tools that enable them to sift, sort, and juxtapose our holdings. The collection interface would typically be Blacklight-driven while the other components may be either shared visualization tools or one-off pieces as needed.

Selector-Facing API Clients

One of the selling points of Hydra is that it can unite the patron-facing discovery interface with the curator-facing administrative interface within the same application. While this is nice in some cases, such as in a self-submit style application, the needs of a patron and a curator can diverge dramatically in some cases. In the Seaside project we split the administrative interface and the discovery interface into two applications. This allows us to tailor the experience for each audience. We can extend this idea further and create a variety of tools for adding and editing content into the repository to meet the needs of curators in different contexts.

Single-item Editor

This can be accomplished in a traditional Hydra-style form view in an application. To enable editing items from the front-end view we could create a simple browser extension, or perhaps even a bookmarklet, that embeds an editing interface onto the presentation page. This would work best when there are only a few curators.

Batch Ingest

There are a number of areas that use Filemaker Pro or other databases to manage their collections. Thick clients are sometimes very helpful and we shouldn’t force them to abandon their established workflow to accommodate our repository. Hopefully writing an adapter from FMP using a client to our repository API will be simpler than our current solution of writing rake tasks that call ActiveFedora directly.

Depository

In a depository privileged patrons can curate their own submissions. In this case a conventional “Hydra head” like Libra and ScholarSphere would be a fine solution.

Next Steps

There is a lot of work involved in this proposal. Here are some of the deliverables:

Determine if adding another layer of abstraction in our infrastructure will result in a net loss of work. I believe it has the potential to do so but I have no empirical data.
Design the one repository API to rule them all. This will not be easy. There are a lot of things to get wrong and arm waving about HATEOS doesn’t make the technical challenges disappear.
Implement the API. If we do the design work up front this will easier but it is still a major project.
Implement the repository API with a different persistence mechanism. It would greatly simplify the development stack if the API could be implemented on top of something like Redis or SQLite.
Write clients for the API. A Ruby client and a JavaScript client would be a good start.
Determine how to make repository content portable. oEmbed is one option; are there others?