The Internet Archive and Universal Access to Information

I’m a big fan of the Internet Archive. I like browsing around and checking out things I’ve never seen before. I came across it a while ago but didn’t use it until there was a thread on Digg that included a discussion on the supposed rise of advertising in movie theaters prior to the start of a film. I doubted that this was a new phenomenon and posted a link to examples from the 30s, 40s and 50s.

So I was pretty excited when I heard that Brewster Kahle was going to be speaking at the MIT Center for Collective Intelligence. Kahle came in and described his vision of universal access to all knowledge. He believes that universal access is in our grasp – from both a cost and technical perspective; but he wondered how we can make this content useful and we can provide the access well?

He started with a discussion of several media types.

Books – The US Library of Congress has about 26 million books. To store them all would require about 28 terabytes. So, for about $60,000 in storage space you could get all of the words in the Library of Congress – indexed, searchable, online. Digital books have all sort of new and interesting possibilities. He passed around a One Laptop per Child $100 laptop to show how digital books could be made available. The quality was impressive.

But sometimes a physical book is still really nice. (I ought to say here that I am a pretty active reader and collector of books myself so I’m still a big fan of the existing form factor.) Digitizing books doesn’t have to remove them from the physical realm. Kahle went on to show several examples of books-on-demand. Including a van that allows people to print a 100 page book for about $1. He passed several examples of on-demand books and they looked and felt terrific. According to Kahle, the cost for a university to shelve a book is about $3 and the cost to build a library works out to about $30 per book.

To get the books online in the first place, the Internet Archive first tried sending them to other parts of the world for manual scanning. They found it made more sense to bring the scanners to the books rather than the other way around. So they looked into a book scanning robot. At the end of the day though it worked at roughly the same pace as a person and needed to have someone there to monitor its performance. Scratch that one. The Internet Archive has created its own scanning system that works out to about ten cents per page. It scans, digitizes the text and creates a PDF of the page. These systems are in a number of libraries around the country – including the BPL here in Boston.

To scan a typical book with this system costs about $30; so the scan the entire Library of Congress would cost almost $800M. Not a small cost, but at least it is a one-time cost.

Audio – There have been approximately 2-3M commercial recordings made (from wax cylinders to CD) and this is an area that is heavily litigated. Where could the Internet Archive start? With those areas that are not part of the commercial music industry – folk and indigenous music for example. They offer the Internet Archive is making is free hosting and bandwidth forever for anything that ought to be in a library. – so where could we start? Many areas that are not a part of the music industry – folk, etc. IA is offering free hosting and bandwidth for anything that belongs in the library.

One group of artists that came on board are those that allow their fans to tape and share concerts. There are now more than 2,000 bands represented and the collection includes every concert played by the Grateful Dead.

Overall, the audio archive has about 100,000 items in 1000 collections.

Video – So far, there have been between 150-200k feature films. A few 1000 are up now in the Archive. Besides feature films, there are many other things that are a part of the collection. These include news, sports, ephemera and the Prelinger Archives. All of these films have value and needed a centralized home – Kahle believes that the Internet Archive should function as the shelves of the Internet.

This collection also features a TV archive. They are recording 20 channels from around the world around the clock. Most of this content is not available – only the 9/11 collection is at this point.

Overall, the video collection includes about 50,000 videos and approximately one million hours of television.

Software – There have been about 50,000 pieces of packaged software made. Archiving this is a challenge due to the Digital Millennium Copyright Act. The Internet Archive has a three year window to collect as much software as possible before they face the restrictions of the DMCA. Kahle pointed out that the gaming community is doing a much better job of preserving and developing emulators for many titles than anyone else.

Web – The Internet Archive started collecting the Web in 1996. Not just the homepage but all public content, every two months. (This is the Way Back Machine).

People use this all the time with the site collection receiving 300-500 hits per second.

After describing the various collections and the Archives capabilities, Kahle turned to the reason and philosophy behind the Archives. It’s not just about preserving content, but also about creating services that make use of that content and that’s where collective intelligence can play a role.

One of the important lessons for the Internet Archive is the one taught by the library at Alexandria – don’t have only one copy of anything and don’t keep everything in one place. Their solution is to work with international libraries that share the commitment to universal access and to sharing their collections with each other. The goal is to house large, petabyte scale collections in facilities around the world

The Internet Archive’s collections are stored in Alexandria 2 – a massive set of open source storage systems..

Sister sites, like the European Archive in Amsterdam, allows the Archive to avoid faults – both physical and political. It is interesting to see that the their collection of audio recordings “cannot be displayed in your jurisdiction.”

This led to the question – should content be public or private? Should these collections be create through open or proprietary methods? Some content has already gone proprietary – the law, for example. Even though the law is public information, the digital collections (through Lexis for example) are proprietary. There had, famously, been an attempt to create a proprietary map of the human genome; but in this case the public sphere stepped in and created an open version of the map.

According to Kahle, Google is trying to do the same thing with a number of its projects. Their goal is to capture all of the knowledge and to put it under perpetual restriction. Despite the potential limitations Google may place on use, many libraries are participating in what is essentially a private and proprietary program.

As digitized content comes under new forms of control – whether through Google or Corbis – what role will libraries perform and what services will they no lover be able to offer? These are questions that people in the public sector need to consider and answer together. If the content of libraries fall under private control, libraries as they are understood today risk perishing, suggested Kahle.

This makes it critical that open and public collections be created – to preserve open and free access to information.

Kahle finished by describing a couple of projects where he thought collective intelligence could be used to help achieve the goals of the Archive. These included human powered “universal OCR” and “universal translation” applications.

The event was good. Very interesting for me personally and it raised questions for me about the changing nature of freedom and control that all of us face whether we realize it or not.

[tags]MIT, CCI, Center for Collective Intelligence, Collective Intelligence, Internet Archive, Brewster Kahle, universal access, open source, libraries, books, audio, video, software, Web, content, Google, Corbis, European Archive, OCR, television, Grateful Dead, music, Wayback Machine, Boston Public Library, OLPC[/tags]


3 thoughts on “The Internet Archive and Universal Access to Information

  1. Google Base is a big disappointment. I wrote recently about Freebase, which is yet another similar-but-different uber-archive. Freebase should take the God database to the next level. The Wikipediadb site is good too. Someone should interview the Internet Archive tape-retrieval robots, the stores they could tell.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s