The anatomy of a digital information object

From a theoretical point of view there is no difference between a digital and an analogue information object. As I explained in previous posts about the concept of information and information objects, both are entities that contains the content of a message and have the required structure and context to allow that message to be decoded and understood.

However, in practice, it is much easier for an analogue information object to carry the content and structure of a message forward in space and time because these are intrinsically linked to the information’s medium. For example, if information is inscribed on a stone table or a sheet of paper then the information’s content and structure will likely survive as long as the actual tablet or sheet of paper survive. [1]

However, digital information objects are much more complicated to preserve and keep accessible over time because their relationship to their storage medium is much more ephemeral. The content and structure of a digital information object is not easily contained within a single physical object like a sheet of paper. Instead, the binary inscriptions of a digital information object are dependent on a complex chain of encodings and electrical components for rendering.

Taking some time to consider the anatomy of digital information objects is important for my research into archives access system because archival materials in digital form are very popular with the users of these systems, due namely to the ease with which they are retrieved, used and shared. Let’s use the early drafts of my thesis paper as an example. I am preparing it using the XP version of the Microsoft Word software application which creates files in the .doc file format. Each time I open the archives_access_systems.doc file for further additions and revisions it has to travel and transform itself through a complex maze of hardware, operating system software and application software before it is made legible to my eyes against the light of my computer screen. A host of errors could occur en route which would leave the document illegible and would reduce the information to a meaningless string of electrical charges.

The perilous journey of a digital information object

At its root, the archives_access_systems.doc file is decomposed into an array of binary digits (1’s and 0’s called bits) which are represented on my laptop computer’s magnetic hard disk as either negative or positive polarity charges. The hard disk consists of several metal or glass platters, each of which is divided into thousands of separate clusters. Different portions of the archives_access_systems.doc file are strewn over hundreds of these clusters. Each bit is read by arms on the hard disk controller which move a polarity-detecting, read/write head across the platters. The hard disk controller relies on the contents of a file allocation table to register the cluster locations of each file and to reassemble it into a linear stream of bits that correspond to the archives_access_systems.doc file. Hard disk driver software is then used to move the bitstream as a string of electrical charges from the platters, through the motherboard circuits, to the input/ouput messaging subsystem and file system driver software. Read errors occur commonly at this stage because the head is extremely sensitive to dust, displacement and magnetic charges. Also errors occur when the file allocation table is corrupted or when there is no compatible driver software for the hard disk or the file system.

If successfully retrieved from the hard disk, the operating system sends the bitstream to the application software which is loaded and running in the random access memory circuits. The application software must recognize the header information that should be present at the beginning of the bitstream so that it can decode and render the string of binary digits using the proper layout and form. It must also be able to detect and convert the character encoding that is used to represent the text content (e.g. ASCII or Unicode UTF-8) into legible symbols. Read errors occur commonly at this stage when the application software does not recognize the header information or character encoding because it was created by another version or type of software and it has not been programmed to recognize that particular file type or encoding scheme.

Finally, the application software sends display commands through the RAM circuits to the operating system software which passes them to display driver software. This driver software sends commands through the motherboard to the graphic display circuit card which then sends electrical signals to the computer monitor that turn specific cells in a grid (pixels) on and off to display text, graphics or other parts of the document. Read errors occur at this stage when the display commands sent by the application software and the display driver software are not compatible with the monitor hardware. [2]

If a digital information object falls in the forest…

In truth, the archives_access_systems.doc file that eventually reaches my eyes only exists as a Microsoft Word document at the logical level. It is not possible to say with full confidence, for example, that the digital information object exists physically on the hard disk clusters, in the RAM circuits that operate the application software, or in the electrical charges of the computer monitor. Each is integral in giving the information object its structure but none of them are its structure. This anomaly has led Ken Thibodeau to conclude that it is not possible to preserve a digital information object, it is only possible to preserve the ability to reproduce it. [3] This raises important issues about verifying the authenticity of digital information objects when they are reproduced at some later place and time on some other computing platform. There are, in fact, a myriad of other critical digital preservation issues that affect access to archival materials in digital formats. For example, the contextual information and background knowledge that are necessary to add meaning to the digital information objects are also stored and linked using digital formats and tools. Therefore, the relationship of a digital information object to its context is just as fragile as its relationship to its content and structure.

Digital preservation is not my problem, man

However, the problems and possible solutions to the digital preservation problem are out of the scope of my PhD research on archives access systems. [4] For those projects that are addressing digital preservation issues [which I do in my day job as a digital archives consultant], the ISO Open Archival Information System (OAIS) standard is typically used as the de-facto best practice to guide the implementation of strategies and systems. [5] The OAIS divides an archival information system into functional components that address ingest (capture), administration, archival storage, data management, preservation planning, and access. Archives access systems can be characterized as the access sub-systems of OAIS-based archival information systems or as stand-alone systems that interface with the archival storage and data management components of OAIS-based systems. [6] Therefore, one of the key assumptions in this research, and in the design of archives access systems in general, is that archival materials in digital format have been preserved and are available for retrieval.

I have seen the future and it is… digital

Of course, archives access systems are not only concerned with archival materials in digital formats. They must be capable of providing access to archival collections in both analogue and digital format. In fact, a very significant majority of archival materials currently preserved in collections around the world are in paper and other analogue formats. However, over the coming decades that ratio will change as more analogue materials are digitized for online access and, in particular, because new information will increasingly be created in digital format. For example, a study conducted at the University of California concluded that 5 exabytes of new information was produced in 2002. [7]The authors estimate that is equal to all the words ever spoken by human beings up until 2002 or 37,000 times the contents of the Library of Congress’ holdings in 2002 (estimated to contain 136 terabytes of information). [8] Of particular note was that 92% of the 5 exabytes of information produced in 2002 was in digital format and the total amount of new information grew about 30% per year between 1999 and 2002.

There is no doubt, therefore, that the digital information age has arrived and that it will affect all areas of human endeavour, including how we record and access the information that we use as memory aids and proxies for past events (i.e. archival materials). Without minimizing the difficulty of preserving digital information objects over the long-term, digital technologies offer great promise for improving access to archival materials and their ability to communicate information, knowledge, experience and memory. In less than half a decade, a billion individuals worldwide have learned to become online users, consumers and producers. [9] On a daily basis they are using the World Wide Web to work, learn and play. They are reading the news, banking, seeking medical information, purchasing consumer goods, meeting new people, playing games, and a host of other activities including historical research. [10]

In fact, there is an incredible demand and interest for online access to archival materials, particularly those that are digital information objects (whether born-digital or digitized). Therefore, I am looking forward to the point in my PhD research when I can get past the literature reviews and writing definitions so that I can begin prototyping some of the new technologies and online practices that can enhance archives access systems…

————

[1] Of course, sufficient contextual information and background knowledge is still required to decode and understand the message. For example, ancient Egyptian hieroglyphics inscribed on stone tablets survived for centuries but hieroglyphic encoding was not legible until the discovery and translation of the Rosetta Stone in 1799 and 1822 respectively.

[2] Additionally all of the components and processes that are used to retrieve and display a digital information object are dependent on a steady supply of electricity, at a specific voltage, for the information to be retrieved and communicated. This is a fundamental requirement that can not always be taken for granted. The majority of the world’s population, for example, does not have access to reliable electricity, let alone computing equipment or the knowledge to operate it.

[3] Thibodeau, Ken. “Preservation Task Force: Final Report” The Long-term Preservation of Authentic Electronic Records: Findings of the InterPARES Project (InterPARES Project, 2002), p.5. [last accessed on January 31, 2007] Interestingly, this principle might have some implications about how we think about memories stored in the brain. It turns out that ‘neural information objects’ have a similar problem with their storage media because “nearly all of the brain’s molecules, including those that form the neural connections, are replaced every week or two.” [Furlow, Bryant. “You Must Remember This” New Scientist (2308: 15 September 2001)]. Yet our memories, at least those we can access, continue to exist on a logical and conceptual level when we recall them.

[4] On the whole, the ability to preserve and provide access to digital information objects over the long-term is threatened by fragile storage media and rapid technological change that leads to incompatible hardware, software, and file formats as well as the lack or loss of contextual metadata. Also, when considering the large volumes of digital information that is generated by modern organizations on a daily basis, the other critical factor is the lack of assigned responsibilities and resources to strategically implement the organization-wide processes and systems that are necessary to protect and maintain digital information objects over the long-term.

[5] International Organization for Standardization. ISO 14721 — Open Archival Information System – Reference Model (2003).

[6] In fact, archives access systems, as I characterize them in my research, are a combination of what the OAIS refers to as ‘access’, ‘access aids’ and ‘access software.’ Access is defined as “the OAIS entity that contains the services and functions which make the archival information holdings and related services visible to Consumers [e.g. users]”. Access aid is defined as “a software program or document that allow Consumers to locate, analyze, and order Archival Information Packages of interest.” Access software is defined as “a type of software that presents part of or all of the information content of an Information Object in forms understandable to humans or systems.” [International Organization for Standardization. ISO 14721 — Open Archival Information System – Reference Model (2003), p.1-7.]

[7] Lyman, Peter and Varian, Hal. How Much Information? 2003 (University of California, 2003) [last accessed on January 31, 2007]

[8] One Exabyte is 1024 Petabytes. One Petabyte is 1024 Terabytes. One Terabyte is 1024 Gigabytes. One Gigabyte is 1024 Megabytes. One Megabyte is 1024 Kilobytes. One Kilobyte is 1024 Bytes. One Byte is eight binary digits (bits).

[9] Internet World Stats. Internet Usage Statistics – The Big Picture (Miniwatts Marketing Group, 2007) [last accessed on January 31, 2007].

[10] Of course it is important to remember that although there are now an estimated one billion Internet users that is only 17% of the total population of 6.5 billion people worldwide. Nevertheless, there are signs that the digital divide is closing. The World Bank reports that between 2000 and 2005 the number of Internet users in developing countries grew from 15 to 67 per 1000 people and the number of mobile phone users grew most dramatically from 46 to 258 per 1000 people. As well, these figures are per 1000 users. Given the sheer disproportion of the population in developing countries this means that 41% of the 1 billion Internet users are in fact from developing countries. Global Information and Communication Technologies Department, The World Bank Group. Information and Communications for Development 2006: Global Trends and Policies. (The World Bank Group, 2006), p.5 [last accessed on January 31, 2007]

Originally published on February 12, 2007 at archivemati.ca