- By John Breeden II
- Mar 05, 2014
Since 1974, archivists in Michigan have been looking for ways to preserve the state’s electronic records in such a way that they could both last for hundreds of years and always be easily searchable by government officials and the public at large. But until very recently, there simply wasn’t an available system that would allow that vision to be realized.
Archivist Mark Harvey said that over the years the state figured out piecemeal ways to preserve data, such as storing files on network drives, CDs, DVDs and other portable media. “Saving to portable media worked,” Harvey said, “but how useful it was depended on the type of data. Without access to some of the legacy programs that created the data, it really didn’t do much.”
In 1996 the state began to tackle its electronic records management in earnest. It needed a solution that would not only preserve the records, but ensure that they would be searchable and accessible, even as file and media formats evolved.
As part of that effort, Michigan hired Caryn Wojcik as a government records archivist and she soon encountered obstacles that would define the solution’s scope. First, there was no computer system or data center that could handle the required storage and no budget for one. Second, the state didn’t want to hire programmers to build out an expensive custom solution.
Armed with that knowledge, Wojcik participated in several grant-funded research projects to develop software that would suit the state’s needs. Eventually, Wojcik found a commercial-off-the-shelf program called Preservica that was already being used by governments in Europe, library systems around the world and smaller government organizations whose requirements were similar to Michigan’s.
“Because we put so much effort into defining our needs and goals, we were confident that Preservica was the program we needed once we saw it,” Wojcik said.
Preservica was developed by Tessella, a digital preservation technology, consulting and research company. It is a Web-based application that ingests and curates content of all types and stores it in the Amazon Web Services cloud, where it is regularly checked for data corruption and redundancy. Users can read, update, delete and preserve each piece of content or metadata, as determined by their roles.
One of Preservica’s key features is “active preservation,” the ability to move files to new formats to avoid obsolescence. When files are uploaded, Preservica identifies their formats and determines if they are at risk from obsolescence. If such files are found, Preservica offers a variety of migration tools to create a “manifestation of the file, which is accessible to current technologies,” the company said.
And because Preservica is a software-as-a-service solution, the state didn’t need to purchase special hardware, hire developers to create a special interface or install software locally.
Tessella’s director of archives Jon Tilbury explained that the system preserves records the same way professional archivists are used to conducting business.When archivists put records into Preservica, they prepare a submission package for upload, he said.
“They can embed more data if needed, but much of the existing data is already captured by the software already. For example, when archiving email, all of the tags as well as full-text archiving is already present. With a photo or a video, the name of the file and any information about it is automatically used with the option of adding more descriptive terms if needed.”
That methodology fell in perfectly with Michigan’s existing archiving workflow. Wojcik said that state agencies first identify and schedule documents that should be preserved. Those documents can be anything from the minutes of a public meeting to the findings of the state Supreme Court to reports about student achievement levels to entries for a recent art contest for the Michigan state quarter. Each department puts those records on a disk and couriers them to the Archives of Michigan or transmits them via FTP. Once the Archives of Michigan has the data, the archivists prepare an upload package and send the files into Preservica where they are backed up and protected in the cloud.
But data sitting in the Preservica cloud isn’t dead. Although the system can handle 800 different file formats, the company also keeps track of programs and version numbers, updating the archived files as needed, while also preserving the original format in case it’s needed for technical or legal reasons.
“For example, if a file is sitting around in Word 2.0 then its going to obsolete,” Tilbury said. “So we will migrate that file over to Word 2012, but when we do, we will compare the two documents to make sure each and every character is the same. And we will keep the master file in place in case we ever have to prove that the data in the file hasn’t been modified in any way.”
Wojcik says that the system is working well, but she is anxious to improve it even more, adding features like full public access. Right now the public can search to find what records have been archived. But to actually view the documents, people have to get a state archivist to retrieve them, she said. “They are all available, but not in as accessible of a format as we want. Very soon Preservica will be adding a public viewing component.”
Tilbury said Preservica’s new public viewing capability will be quickly deployed to Michigan where viewers will be restricted to read-only access. Also, the use of tabs can manage whether a limited number of documents, or not every part of a document, is viewable. So if a record has confidential information like a Social Security number, it can be hidden from public view. Tabs can likewise be matched to different roles and security levels once the new component comes online.
The Preservica system is scalable based on usage, with Tessella’s largest customer at the moment archiving over 8 petabytes of information. At the low end, organizations can store their records with the system for $1,000 per month. Michigan uses a bit more storage space than the average user, so it costs $14,000 per year, a price that will remain flat until the state needs to put more data in place.
Harvey notes that the cost to solve one of the Archive’s biggest and longest running problems is relatively inexpensive when compared to the alternatives. “For that amount of money, we probably could have only bought about 100 hours of a developer’s time,” Harvey said. “And that would not have gotten us anywhere close to having a production system. Instead, we are already archiving records with Preservica and can concentrate on improving a system that already works well.”
About the Author
John Breeden II is a freelance technology writer for GCN.