Does my digital archives need a digital repository system?

I have had this discussions with colleagues several times over the past couple of years. Somebody is getting ready to prototype a digital archives at their archival institution and the first question they ask is, "which open-source repository system should I use? Dspace? Fedora? Greenstone? Eprints? "

However, as a system analyst, I believe "what are my requirements?" is the more appropriate question to ask before selecting technology and tools.

In many instances these requirements might be much simpler, in a first iteration prototype, then what is offered by repository systems. Or worse, the repository system doesn't really support the necessary requirements and instead leads the implementors down a frustrating, time-consuming path of installing, configuring, tweaking, shoe-horning, customizing, re-designing, and re-integrating to meet the design constraints and quirks of the given repository system. All of which, in the end, is to meet some simple archival storage and/or metadata management requirements.

First generation

The first generation of open-source repository platforms listed above were designed as 'institutional repository' or digital library systems. Their workflow and metadata functionality were hard-coded for e-text, e-thesis and e-library collections. This makes it slightly complicated to adapt them to the workflow and metadata requirements for digital collections of archival materials, i.e. records and other documentary materials that are managed using archival processes and principles (e.g. appraisal, accessioning, multi-level archival description).

Dspace from MIT has probably been getting the most attention from those that are pilot testing digital archives. I think in large part because it documents some of its architecture using OAIS terminology which speaks to archivists concerned about digital preservation (although comprehensive digital preservation support is something that is deferred to a future Dspace 2.0 release).

The other attraction is, of course, that is a fully open-source system. From my experience I have found that most archivists responsible for public records in the public trust prefer to work with open-source technology to keep their digital archive platforms as technically accessible and flexible as possible (even if that means passing on some more mature proprietary technologies).

However, as one MIT archivist admitted in a Dspace session at the 2006 SAA conference, the current version of Dspace really isn't suited for use in an archival setting. The MIT Archives is instead planning their own customization of Dspace, called "ASpace" to adapt for use in an archival setting. I know of a couple of projects that started with Dspace as a prototype digital archives repository and abandoned it after deciding that working around the workflow and data management issues was going to be too much trouble.

That said, each of these systems, including Dspace, is now trying to become more flexible and allow for a service-based architecture. At the moment, however, I believe only Fedora is capable of opening its repository functionality as web services.

Java Content Repository (JCR) API

It would be great to see these institutional repository systems go one step further and provide support for the emerging industry standard JCR. JCR is a generic API for content repositories. It comes from the Java community but can be ported to any language. The current version is being worked on and supported by the who's who of the ECM and enterprise computing world (see list of names under Expert Group at the JCR project homepage).

JCR API compliance would certainly open up the potential for open-source digital repositories in the enterprise (where we still have to fight an anti-open-source bias from most IT departments). By the way, Apache JackRabbit is an open-source digital repository system that does support JCR.

Don't get me wrong...

Anyway, this is not meant as a rant against the current batch of open-source repository systems, I hope they will improve to meet the needs of a wider user community (including JCR support). I expect to be using and supporting them well into the future as integral parts of mature digital archives architectures.

This post is meant more as a call for simplicity when starting a first iteration, simple digital archives prototype. It was inspired by a couple of articles I read recently.

Fedora and the preservation of university records

Firstly, the October 2006 RLG Diginews article that reported on the findings of the 'Fedora and the Preservation of University Records Project'. This was a project that started very much along the usual lines, i.e. let's take Fedora and let's see what it will take to get it to work as our archival records repository. The authors note:

"We changed our focus when we realized that we were asking the wrong question. In serving as the repository core of a preservation system, a Fedora instance (or instances) would only be one part of an overall preservation environment. Large portions of ingest and access activities in addition to preservation planning decisions would occur outside of the Fedora instance...The question we should have asked was:'“Can a Fedora repository, surrounded by the proper preservation policies, tools, and Fedora services, serve as the basis of a trustworthy preservation system?'"

Their research project then shifted focus to produce an excellent set of Ingest and Maintain (storage) guidelines. In other words, it moved from the concept of a digital archives as a software application to the OAIS concept of an archival information system as a “an organization of people and systems that has accepted the responsibility to preserve information and make it available for a designated community.”

The Internet Archive

Some archival professionals have been slightly annoyed with some of the thunder that was stolen from the digital archives community by the Internet Archive project which doesn't operate like a traditional archival institution but does publicize the word 'archive'. Anyway, turf wars about the term 'archive' is so 1990's; some of us don't even flinch anymore when techies use the word 'archival' as a noun instead of an adjective (e.g. 'email archival').

The point is that the Internet Archive is a great project that has managed to capture a legacy of digital content that would otherwise have been lost. As a result, the Internet Archive is now running the largest digital archives in the world. A recent eWeek article gave a behind the scenes look at the Internet Archive's technical architecture. What struck me was the simplicity of their setup:

"Despite the massive amounts of data that the Internet Archive is storing, managing and preserving for posterity, [Internet Archive founder Brewster] Kahle said the secret to the organization's success is keeping it simple. "We don't do anything that isn't immediately obvious to college students with Linux on their dorm-room desktop," Kahle said. "We are allergic to secret sauce. Everything we do is standardized and simple."

The Internet Archive uses a JBOD (just a bunch of disks) approach. Off-the-shelf Ubuntu Linux boxes with four hard drives each, all networked together. Some basic network monitoring software, OAI metadata harvesting and off-site replication over HTTP and FTP. Simple. Scalable. Obvious to college students with Linux on their dorm-room desktop, or archivists converted into techies to deal with the changing (digital) nature of our collections.

A simple digital archives architecture

If that's good enough for the Internet Archive, which is handling petabytes of content and significant traffic volumes, why not for the first iteration of a simple digital archives prototype? Why shouldn't the 'archival storage' component of an OAIS not simply use a Linux or Windows OS to manage file storage rather than implement a digital repository system with all the technical architecture buy-in that requires?

As the Tuft-Yale Fedora Project concluded, it is really the supporting processes and tools around the archival storage that are critical anyway. These can be provided by procedural guidelines and simple utilities that can be mixed and matched, as best-of-breed solutions, to provide necessary functionality such as backup and replication, file normalization, checksum integrity monitoring, etc..

This would allow for the metadata and workflow components (e.g. OAIS' 'data management' and 'administration' components) to be managed in a simple database system that is tailored specifically for processing archival collections (e.g. such as the open-source ICA-AtoM application, a custom database or a commercial archival description package). This database could enforce a simple namespace and use it to assign unique identifiers to the digital objects in archival storage, linking to a file or files on the network storage device(s). The database should, of course, be able to support both archival description standards and structural and technical metadata standards (e.g. PREMIS, METS). It would manage the physical and logical relationships between the digital objects, schedule and log preservation tasks, and provide search and browse access. It goes without saying of course, that maintaining the links between the database records and the digital objects would be a critical requirement.

Anyway, I can appreciate that such a database application would also get fairly complex. But it can be built from the ground up (like ICA-AtoM) as an archival description application that supports multi-level archival description, archival authority files, and archival processing workflows, therefore making it easier to integrate into an existing archival institution setting.

I can also appreciate that the archivist responsible for this prototype would have to acquire new technical skills to implement and manage it but this would still be a requirement even if an existing repository system was being used. This way, however, the archivist gets a thorough understanding of how all the pieces fit together, allowing them to retool the components as necessary, rather than working with what may otherwise appear to be a mysterious black box.

Of course, as the functional and technical requirements of the digital archives get more complex it may, eventually, become time to upgrade the archival storage and/or (meta)data management component to a comprehensive digital repository system.

However, this decision should be driven by requirements, not by digital repository peer pressure ;-)

Originally published on November 27, 2006 at