Writing in the Sand: The Need for Ultra-Robust Digital Archiving

Steven Dutch, Natural and Applied Sciences, University of Wisconsin - Green Bay
First-time Visitors: Please visit Site Map and Disclaimer. Use "Back" to return here.


I was just reading an article that suggested that, if we could store a bit of information on a single atom, a few hundred kilograms of carbon would suffice to store all the information generated by the human race for a century. We would "never" have to delete anything again. The problem would be searching and retrieving information.

Well. Geologists like me have a distinctly different take on "never" than most people. A million years from now the earth will probably look basically the same from space as it does today, apart maybe from glacial advances and retreats. Will anyone still remember our civilization? In that million years, our hypothetical data repository will never be hit by a lightning strike, a large electrical surge, a fire, a flood, or a cultural upheaval? I have my doubts. And even if you back the data up, the chances of all the backups being destroyed in a million years are pretty large. (If the archive has a 99.9 per cent chance of surviving 100 years, its chances of surviving a million years are .999^10,000 = 0.00004) And then there's the question whether anyone will remember the technology needed to access the data. Given enough time, even highly unlikely events become virtually certain.

The internet was originally developed as a communications system that could survive a nuclear war. It has become a vast data repository for the human race, where valuable data is archived and preserved. Paradoxically, the spread of digital storage has also made civilization increasingly vulnerable to catastrophic data loss.

As an illustration of the issues, consider the urban legend that all the blueprints for the Saturn V rocket used in the Apollo Program have been destroyed or lost. (This, of course, is to prevent the public finding out that the rocket was really a mailing tube full of Fourth of July sparklers.) Well, for openers, the story is false; the paper blueprints may be mostly gone but microfilms of them still exist. Second, there are extant Saturn V's at Huntsville, Alabama and Cape Canaveral, Florida. Finally, we wouldn't go to the moon on a Saturn V today, anyway, because technology has advanced beyond that level. On the other hand, we are in imminent danger of losing large masses of old data because the magnetic tapes the data are stored on have deteriorated. There are no existing copies of many early television programs, and even a number of films have been lost or soon will be as their film disintegrates. And we've only been recording data photographically for a bit over 150 years, and electronically for less than 100.

I have about 400 Gb of external storage and I can fit it all in a small box. This is most cool. Give me a terabyte and I'll probably be fixed for life. But I'm paranoid. I'd like to be sure that the useful things I publish will continue to be available after I retire, after I die, and as long as anyone considers them useful. I'd like all the images from our planetary missions to be preserved even if we cease space exploration, even if there's a catastrophic global economic and social collapse, even if we revert to medieval technology and have to rediscover science all over again. Because on a time scale of thousands of years, those things are nearly certain.

The medium must have an expected lifetime of centuries

The media now used for most digital storage are extremely ephemeral. Magnetic media have expected lifetimes of years. Digital CD's and DVD's depend on laser etching of pits into light sensitive dyes, which are not chemically stable over more than a few decades. Careful storage can extend those lifetimes. Although floppy disks have a nominal lifetime of a few years, I have disks over twenty years old that I can still read. Nevertheless, recording on any such media is writing in sand. Corrosion-resistant metal, ceramics, and glass are among the materials known to have lifespans in the century or milennium range. Plastics probably do not have the desired chemical and physical stability, however indestructible they may seem to an environmentalist.

I have rocks in my lab that have not changed in three billion years. That's my idea of stability. (But anything will be stable if it's encased deep inside something, like those rocks spent most of their history. If you engraved microdata on a polished surface of the rock and left it exposed to the weather in the open, it could have a lifetime of centuries to millennia, a lot less if lichens or freezing and thawing got at it.)

The medium must be extremely resistant to physical destruction

Plastics of any kind are excluded by this requirement. Corrosion resistant hard metal, glass, and ceramics are among the materials that meet this criterion. One of the media known to have a lifetime of centuries is paper, if made with acid-free methods. However, paper is combustible, not dimensionally stable, and susceptible to chemical and biological attack (it rots).

The nastiest environment for long term storage I can think of is a wet basement. Imagine we place our data archive in a vault, but thanks to funding cuts, maintenance is curtailed and finally ended. Tree roots invade and crack the ceiling and water starts dripping in. We need that archive to be not only immune to damage by fire and electrical surges, but resistant to corrosion in a damp setting. It not only has to be water resistant, it has to be resistant to whatever is likely to be dissolved in the water.

The medium must be immune to electronic disruption

Electromagnetic pulse damage is a potentially disastrous threat to existing computers and data storage. Powerful magnetic fields can erase magnetic media. Robust digital archives must be immune to any such attacks. Not resistant, immune. Any particle or electromagnetic attack, short of something powerful enough to destroy the storage medium physically, must leave the archival media unaffected.

The archive must keep functioning even in hostile environments

Consider our leaky vault above. Not only must our archive medium survive in that environment, it has to keep working. Wet or dry, soaked by an electrically conductive ionic solution (because water will have ions dissolved in it), covered with mold, invaded by tree roots. And the connections have to remain intact.

The archive must be indefinitely expandable

The purpose of a digital archive is the storage of large amounts of data, so it must be possible to add storage without significant effort or cost. If new methods are developed for data archiving, they must be backward compatible with older methods.

The archive itself must consume no energy

Obviously, if the archive consumes energy, someone will have to pay the bills. It may become financially attractive to discard part or all of the archive, or at least shut it down if it is not accessed for a long time. That means that access will only be possible if someone reactivates the archive, which in turn will mean that the prospective user may have to locate the manager in charge of the archive.  Access to the archive must be completely passive. It must be possible for an outside reader to read the data without any energy expenditure or any other action on the part of the archival site. Whatever energy is involved in reading the archive is supplied by the reader. Think of an older land line telephone (no displays or lights). It consumes no energy just sitting there, yet an outside signal can locate it.

The archive must cost the host nothing except space

Even physical space is expensive to maintain, an important reason why libraries are increasingly taking advantage of digital storage. Compact digital archives can reduce the space requirements dramatically. For long term digital archiving, it is essential that the cost to the host be as nearly zero as possible. If the medium is sufficiently durable and resistant to damage, it should require no maintenance, and should also not require a special environment for protection. Furthermore, if the medium requires no energy of its own, it will cost the host nothing to keep connected. The only needs for such an archive would be a reasonably protected environment and periodic inspection. (And, as noted above, if the system breaks down, the archive survives and keeps functioning.)

The archive must be readable without the use of moving parts

Moving parts are the weak spot of all machines. Floppy disks wear out because contact with the reader and the protective sleeve wears out the magnetic medium. Hard drives wear out because the disk shaft wears to the point where alignment of the reader with data tracks becomes unreliable. Data archived on hard drives now must be physically transferred to new drives periodically as the old ones age or become obsolete, a continuing expense for the archival site, and one the archive may eventually choose not to bear.

Robust digital archiving must require no moving parts. Once installed and connected to a network, it should be possible to access it indefinitely without intervention by the archivist. The archive should never need maintenance, never need replacement.

The archive must be both physically and electronically readable

In the event that the archive becomes electronically unreadable, because electronic communications break down or the reading technology is lost, it must be possible to recover the data by other means. Optical and mechanical reading are the most likely backup methods. Since these are not expected to be primary access modes, they may involve the use of moving parts to scan the archive.

Below, I make a guesstimate that any viable archive material would be rather low density compared with today's digital storage, say 100 Mb in a 10 x 10 cm area. That's a megabyte per square centimeter, about the information content of a digital picture. So why not a coding system that digitally encodes the data in a format that is also readable by eye? You not only digitally record the image but you can look at the medium and see it in thumbnail form; or a system that digitally encodes text in a form readable on a microfiche reader?

The archive must make use of error correction code

Error correction codes vastly extend the lifetime of digital information by enabling the information to be reconstructed if a single bit is corrupted.

In the event of damage, all data not physically destroyed must be readable and interpretable without the use of special recovery methods

Anyone who has ever had to resort to elaborate and expensive file recovery software or services to recover data can appreciate the need for this criterion. Probably the worst single design flaw in contemporary digital storage is the reliance on File Allocation Tables, which render data difficult or impossible to access if corrupted. Robust archival storage may use file allocation tables for rapid access, but damage to the table should in no way restrict access to the data.

The archive must be ultra-resistant to data corruption

A single bad bit in a .jpg image can render the image inaccessible. Compressed data is far more vulnerable to corruption than non-compressed data because any data corruption can potentially affect a large amount of encoded data. In my own experience, compressing data is hardly worth the bother unless the compression is by an order of magnitude or more. If we're serious about archiving for the ages, it's worth not compressing at all.

Resistance to corruption also means resistance to attack. For the most part the media should be read-only, with no physical possibility of altering contents from without. Ideally, the data recording process should be a physically different process from the reading process, so that it is physically impossible for an access device, however maliciously programmed, to alter the data. For example, you record data on thin foil by punching holes with a laser. You read the data by scanning it optically, but the optical scanner lacks the power to affect the foil, so you can hack the scanning software forever without corrupting the data.

As an example, consider a book. Created by printing, accessed optically. You can deface a book physically, but no amount of reading it will change its contents. Put it in a museum case under guard, and it's absolutely immune to alteration by the reader.

Security supersedes compactness

In order to achieve extreme robustness plus the ability to scan the medium by other means if electronic reading fails, the physical size of bits in the storage medium must be large by the standards of some present and prospective future storage media. The data density probably cannot exceed that of a present day optical disk and plausibly less, say 100 Mb in a space 10 cm square. If the medium is 1 mm thick, we can get 1 Gb in a volume of 1 x 10 x 10 cm, and one Tb will require 10 m of shelf space. In volume terms, one Gb is 100 cubic centimeters, and 10 Tb is a cubic meter.

The need for infallible data recovery further limits compactness. Ideally, data should be uncompressed for maximum survival, but if compressed formats are used, there need to be redundant copies plus error-correction mechanisms, and the means of decoding the data must be included with the data itself.

We can't store everything

It turns out there's a fair amount of interest in knowing how much information there is in the world. According to How Much Information? (School of Information Management and Systems at the University of California at Berkeley), the total print holdings at the Library of Congress amount to 10 Tb, the database of NOAA is 400 Tb and 3 years of EOS satellite data is a petabyte (a million Gb). The total information capacity of all telephone calls in a year is about 17 exabytes, although the actual information is a lot smaller. It has been estimated that the total information in every word ever spoken by human beings is about 5 exabytes.

Using the volume assumption above, a petabyte would require 10 km of shelf space (100 cubic meters) and an exabyte 10,000 km. But the vast majority of the information in the world is trivial and ephemeral. Physicist John Baez calculated that the total information needed to completely specify the quantum state of all the atoms in a single raindrop is 500 exabytes. But nobody ever needs to do that, even if it were physically possible. The mass of the raindrop, its velocity and impact point would probably be all anyone would ever need.

Nature doesn't care about archiving in the least. The history of the earth is preserved in pathetic fragments. Every time it rains, billions of exabytes of information are lost as raindrops splatter on the ground.

Text information would be relatively trivial to archive. The real capacity hog is visual information. We'd have to restrict the most robust archiving to things we would want to endure for centuries. Still, that's a lot. We'd want to archive all our books, periodicals, radio and TV programs and movies, plus academic archives. But we're not going to store everyone's e-mail and telephone conversations.

Security supersedes cost

Ultra-robust storage media will be more expensive than conventional media, although economies of scale will certainly reduce the disparity. The higher initial cost will be offset by the extra security, the reduced need to back up data, and the reduced need for maintenance and power.

Security supersedes property rights

Some of the criteria developed here will be opposed by owners of copyrights because they would make it difficult or impossible to copy protect or encrypt data. Cry me a river. The data that would primarily be archived would be public domain anyway. Copyrighted material could be archived on robust media and then made accessible once the copyrights expired.

A simple solution would be to amend copyright laws so that copyrights entail archiving. If you don't want it archived, keep it in a closet.

Security Supersedes Ideology

We archive the Bible, the Koran, the Book of Mormon, the writings of Luther, St. Augustine, Maimonides, Mary Baker Eddy and Joseph Smith. And we archive the Satanic Bible, the writings of Robert Ingersoll, Richard Dawkins and Madeleine Murray O'Hair.

We archive Das Kapital and the Communist Manifesto. And we archive Mein Kampf and The Protocols of the Elders of Zion. We archive the Blue Book of the John Birch Society and the leftist forgery of Rigoberta Menchu.

We archive the works of Einstein, Pascal, Newton and Carl Sagan. And we archive the writings of Immanuel Velikovsky, Charles Berlitz and Ignatius Donnelly.

Art. All of it. Rembrandt and dogs playing poker. Music. All of it. Beethoven and the Sex Pistols. Film and TV. All of it.

I'm not advocating a "teach the controversy" approach here. We can certainly archive enough commentary material to enable future researchers to sort out conflicting claims. And it is not our job to protect the future from itself. The Star Trek scenario of a civilization discovering a book about Chicago gangs of the 1920's and modeling itself along those lines was pretty far-fetched.

Sexually explicit materials? Why not? Presumably that's a biological drive that won't evolve out of style. If it does, archiving for the future will be pointless anyway. Meanwhile, future researchers might be very interested to see how standards of attractiveness change over time. Picture Rubens' women and compare them with modern supermodels. They night also be interested in our pathologies.

Classified material? Absolutely. We can restrict access for the standard blackout periods easily enough. Imagine the history that was lost when all the records of World War II cryptographic computing were destroyed.

Access to the Archive should never become an entitlement. Some material needs to be restricted because of property rights, age appropriateness or national security. It's perfectly legitimate to require users to register and have a secure log on. There's plenty of open internet for those who don't want to register.

Bottom line: Nobody gets to censor the Archive. Period. For any reason. If the future decides to censor the archive or destroy certain things, that will be their crime against their future. Considering all the ways we behave criminally irresponsibly toward the future, it will be nice to point to one thing and say that, here at least, we got it right.

We Don't Need Public Internet Service

We need public servers. Preferably redundant. That way useful material can be archived. We don't have to worry about the Wayback Machine or Project Gutenberg shutting down, we don't have to worry about researchers retiring and having their sites taken down, or professors having their course notes lost. (Don't you think historians in a couple of hundred years will be interested in how Early Digital colleges taught their courses?) And best of all, with public servers, there will be no need for advertising to pay the bills. No popups or adware.

Nothing wrong with public internet service, but public servers are a more basic need.

So what can we conclude?

The archiving medium has to be ultra-tough mechanically and chemically. It seems hard to imagine anything other than some super-resistant metal for this. The stuff we use in airline flight recorders is a candidate.

The only physical mechanism robust enough for accessing the archive is electrical. Metals never lose their conductivity. If we envision an optical scanner, the light sources will eventually die and crud will eventually make the archive unreadable. Mechanical methods are even worse; the moving parts will wear out quickly.

Obviously, whatever wires and connections we use will have to be utterly impervious to corrosion. The insulation has to be utterly impervious to breakdown. I'm talking about surviving a Dark Age impervious.

Since the data creation mechanism will be separate both physically and in kind from the reading process, all the archive will be ROM, and we mean read-only. It is not just not in the software, but physically impossible for anyone accessing the media to change its contents. The originator writes the data to be archived, sends it to the archive, the archivist plugs it into a slot, and forgets it. Or the data is sent to the archive, and when they have enough to justify writing to the archive, they generate the archival media, plug it in, and forget it.

Since the archival media will be completely passive, it seems hard to picture anything other than a card-like material. Say, some ultra-tough metal foil with holes punched by laser. We probe it with weak electrical currents on all sides and use the output currents to deduce the location of the holes. Or we make a sandwich with some dielectric material (has to be ultra-stable) and sense variations in capacitance?

I frankly can't see any value in data compression here. Data compression risks the loss of large amounts of data if the data is corrupted or the knowledge of the compression scheme is lost. Uncompressed data can be at least partially recovered. And the physical saving of space just isn't enough to offset the problems.


How Would We Code The Data?

I'm assuming here that the data are both electronically and visually readable. That means if electronic access fails, we can scan the data with a microscope and read it. We want to be able to read the data even if (especially if) high technology fails completely. After all, we had microscopes 400 years ago so we can expect knowledge of microscopes to survive much more than electric power.

Typical resolution of a good microscope is about 0.2 microns or 5000 dots per millimeter. So bits can be as small as 0.2 microns, but it's more complex than just counting bits. Bits 0.2 microns in size mean 25 million bits per square millimeter, or about 3.1 megabytes.

ASCII text files record characters as numbers from 0 to 255 (eight bits or one byte). But we don't want some post-apocalyptic researcher to have to decode binary bits. We'd like to have readable characters. Now way back in the early days of PC's, (when you actually had to get up and walk across the room to change channels! It was barbaric)the only way to get text into a drawing was to code it as a series of numbers and POKE (a BASIC command) the numbers into the memory locations corresponding to where you wanted the text. Each bit of the binary number corresponded to a dot in a character. It was possible to generate pretty presentable text using eight bytes, corresponding to an 8 x 8 array of dots. I created codes for Cyrillic, Greek and even Arabic. So as an order of magnitude approximation, we can guess that readable text can be coded with eight bytes or 64 bits, corresponding to 1.6 microns square. A square millimeter could contain a bit less than 400,000 characters (note that this is in dot-matrix form, eight bytes per character rather than the one in ASCII code). The Bible, with around 3.5 million letters, would take 28 million bytes, or about 3 millimeters square. Not quite the head of a pin but respectable.

Images, to be readable without a computer, have to be coded as bitmaps. Every pixel in the image has to be individually represented. Monochrome line images could simply use single bits for black or white. A most respectable drawing 5000 by 5000 pixels could be shrunken to a millimeter square. (That same space would hold 400,000 letters in dot-matrix form. A picture really is worth a thousand words.)

But complex images, like color drawings or photographs, require much more storage space. We commonly code colors in terms of 8 bits for each color channel: red, green, and blue. That means we need 24 bits per pixel. However, to be visually readable, we can't use simple binary coding. Intensity levels 1,2,4,8,16,32,64 and 128 for each color would all look the same: a single bit, differing only in position. Instead we'd want a coding where few bits meant one extreme of the intensity scale and more bits meant the other. Color codes might run 00000000, 00000001, 00000010 ... 10000000, 00000011,.......11111110, 11111111. Viewed this way the image would at least be recognizable as a monochrome image. We'd need one byte per color. A typical digital photo 3200 by 2400 pixels would be 7.7 million pixels times 3 or 23 million bytes. Since a square millimeter corresponds to 3.1 megabytes,  we have about 7.4 square millimeters or about 3.1 by 2.4 millimeters, or a bit less than one percent of the area of a 35 mm slide. We're talking roughly the information density of microfiche, but electronically as well as visually readable. These numbers make sense; after all, the readability of actual microfiche is limited by optical resolution.

How we'd actually do this would require some human testing. It might well be that we'd interleave the colors or dither them for the best results. No problem as long as the image is visually readable and the machine decoding algorithm is specified (attaching a text file to every data card would be a good plan).

It is conceivable that a nanoscale writing technology might be developed that would show images in color, using, for example, using interference like the microstructures in butterfly wings. However, for maximum insurance, the bits themselves must be optically resolvable.

Microfiche has a projected lifetime of centuries. So why not just use microfiche? Simple. It's not immune to water or fire damage, and it's not directly electronically readable. Put microfiche on an ultra-durable substrate and we're in business.

More complex data like video or audio cannot be coded as simply as text and images. In any case, they won't be retrievable at all in the absence of some kind of playback device. But the lowly phonograph suggests a way to code audio data in a way that is accessible in the absence of computers. And frankly my dear, preserving Gone With the Wind is worth preserving it frame by frame.

A Few Caveats


Return to Pseudoscience Index
Return to Professor Dutch's Home Page

Created 27 February, 2006;  Last Update 30 August, 2011

Not an official UW Green Bay site