Running Fixity Checks in the Cloud

This post was originally written as a set of notes for a presentation to the National Digital Stewardship Alliance in 2021. This presentation came about after we had heard that a number of universities, libraries, museums and memory organizations storing digital materials in the cloud were looking for cost-effective ways to run fixity checks on large numbers of stored objects.

Some were advised by solutions architects to use lambdas, which are not a good choice for this type of workload for two reasons. First, running checks on very large files (500+ GB) will take a long time. Second, and more importantly, lambdas will cost substantially more in the long run if you're running regularly scheduled checks on large volumes of data.

Below are the notes for the 2021 NDSA meeting, followed by a brief update describing where things stand with APTrust fixity checking in 2024. The video of the original presentation is available at https://www.youtube.com/watch?v=k8FEYmUg6Dk.

Reasons to Check Fixity

Check for bit rot. You had to do this in the old days to understand when your storage media were beginning to fail. This is no longer necessary, since cloud providers have warning systems to alert them to failing storage. They also keep multiple copies, overwriting the bad with the good when necessary.
Check for malicious file corruption. This is probably rare in most environments, but it’s good to proactively check for these cases, and to know when they occur.
Check for accidental file corruption. This is usually caused by the misbehavior of or misuse of internal systems. Two examples:
1. Library of Congress mentioned at the 2019 preservation storage summit that they discovered some checksum mismatches in an internal collection. The cause was a system that was correctly updating files without properly noting updated fixity values in a central registry. This is an example of a misbehaving system. It’s essential to identify and fix these, but you may never know they exist unless you actively check you files.
2. APTrust ingested duplicates of some files on its first major ingest back in 2015. When cleaning up the duplicates, we accidentally removed the current version of some files, leaving only an outdated version. This is an example of system misuse. Scheduled fixity checks revealed the problem a few weeks later. After this incident, APTrust implemented policy changes, restricting internal administrators’ ability to delete files, and requiring depositors’ review and approval for all deletions.

Built-in Cloud Fixity #1: eTags & “Self Healing” Storage

S3-compliant storage services return an eTag when you send a new object into storage. eTags do not provide strong fixity assurance for two reasons:

For files smaller than 5GB, they use simple md5 values, which can be spoofed by malicious actors. That is, a smartypants hacker can craft an invalid file whose md5 digest matches that of your valid file. They can swap their file for yours, and if you’re only checking digests, you won’t notice.
For files larger than 5GB, S3-compliant services calculate multipart eTags which consist of what looks like an md5 checksum plus a suffix indicating the number of parts in the upload. That eTag can change every time you re-upload the file, even if the file’s underlying bits are the same. Multipart etags change based on the number of parts in the upload and the number of bytes in each part. They’re not true fixity values because they don’t describe the underlying bits. They describe a combination of the bits and the upload process.

Amazon and other cloud providers typically maintain multiple copies of each object. They do periodic integrity checks (details of which they don’t describe) to identify corrupt objects. If they find a corrupt object, they overwrite it with an intact copy of the object.

Cloud providers don’t disclose how they identify corrupt or intact copies, but they are likely not using computationally expensive, cryptographically secure algorithms like sha256 and sha512. They may be using less secure, but “good enough” algorithms like CRC, and they may not even be performing fixity on your files or objects.

Often, their CRC integrity checks look for bad blocks on a storage device. If parts of one or more files are stored in a bad block, cloud providers can assume those files are corrupt and need to be re-copied from an intact version stored elsewhere. Their systems do this automatically.

From a depositor’s perspective, this means the cloud provider cannot answer the question “Does my file still match digest XYZ?” They can only assure you that your file is stored on a series of intact and uncorrupted blocks, which means that when you retrieve it, it will match byte-for-byte whatever you last gave them.

This is great for ensuring 99.9999999% durability, but not great when your stakeholder asks, “Can you show me the results of a recent fixity check to prove my file is still intact?”

Built-in Cloud Fixity #2: Amazon and Google Microservices

Both AWS and Google Cloud Storage offer “serverless” fixity checking. Serverless means you do not have to provision or pay for a server to run your fixity checks. The fixity services simply run on a schedule on hardware of the provider’s choice, and they bill you according to your usage.

The more checking you do, the more you pay. This can lead to unpredictable costs, which makes budgeting hard. The factors that contribute to cost are:

How many bytes are you checking? Some providers, such as AWS, charge data egress fees when you pull data from S3. Those fees can get quite high if you’re checking several terabytes of data per month. You can avoid them by running fixity checks on servers in the same region (and preferably the same availability zone) as the S3 servers that store your data.
How long does your fixity check run? AWS and Google charge by the millisecond for your serverless functions. A fixity check on a small file may take only a second, resulting in a negligible charge. Fixity checks on larger files take longer and cost more. Most of the time required for cloud fixity checks lies in streaming the data from storage to your fixity function. You have no guarantee of the bandwidth of the anonymous machines on which serverless functions run. If bandwidth is low and streaming is slow, your function will run for a long time, and you’ll be paying for each millisecond. When you have terabytes of data to check, those costs add up quickly. (And for reference, we've seen 1 TB files take over 24 hours to stream on some AWS instance types.)

AWS Serverless Fixity For Digital Preservation Compliance contains a description of how Amazon's service works, along with an implementation guide for setting it up. It can calculate fixity values for specified buckets on a set schedule and write the results to an S3 bucket of your choosing. You’ll have to fetch the results from there on your own. This service can also send SMS and email alerts.

Note that the AWS service calculates only md5 and sha1 checksums. It does not support other digest algorithms.

Fixity Metadata for Google Cloud Storage (with setup instructions) provides a similar service for Google Cloud Storage (GCS), with some additional limitations, which may be pluses or minuses, depending on your workflow:

The GCS service expects your files/objects to be stored in BagIt format.
It calculates only md5 checksums.

The GCS service writes its output to Google’s Big Table, and it’s your responsibility to retrieve the data from there.

DIY Cloud Fixity

APTrust runs its own fixity service, pulling files from S3 every 90 days to calculate both md5 and sha256 checksums. As of February, 2021, we check about 8 million files comprising 154 TB of data every 90 days.

The fixity checker runs as a background process on an EC2 instance* that runs round-the-clock anyway to process ingests. Because ingests come in batches, the server has substantial idle time to do other work, like fixity checks.

To avoid S3 data egress charges, the fixity checker runs on a server in the same region as the S3 storage servers. Because the data never leaves AWS’s internal network, there’s no egress charge.

We stream the data from S3 through the fixity function and then throw it out (by writing it to a throwaway stream, or to /dev/null). It never touches the disk. This results in a substantial speedup of the process, since writing multi-gigabyte streams to attached storage (such as EBS volumes) can actually be slower than streaming it from a nearby S3 bucket. You can test this for yourself. Doing only an S3 read is substantially faster than doing an S3 read plus write-to-disk. In the latter case, the CPU spends a lot of time idly waiting for disk I/O to complete, and you're paying for that time.

Not writing file streams to disk also saves on cost. Our fixity checking service doesn't need to maintain any local storage.

We calculate both the md5 and sha256 checksums in a single pass. The Go programming language lets us do this by passing the bit stream from S3 through a MultiWriter that can calculate a number of checksums at once. JavaScript can do the same with readableStream.tee(). Other languages should have similar functionality.

These three facts--proximity to the S3 data, lack of disk writes, and multiple calculations in a single pass--make fixity checking very fast. We often check over 100k files in single day while simultaneously processing ingests and restorations.

The benefits of DIY cloud fixity are:

Your monthly costs are known and budgetable.
You can use any digest algorithms you want (md5, sha256, sha512, etc.)
You have full control of what happens when the fixity check completes (store the result in a database, send an alert, etc.)

The downside is that you may need to hire a developer to implement your custom solution.

* Update: January, 2024

In late 2022, we re-launched our services to run in scalable Docker containers instead of EC2 instances. We're now running regular 90-day fixity checks on over 9 million files comprising almost 200 TB of data in S3 and Wasabi. (As described in our storage fact sheet, we don't check fixity on the other 130 or so terabytes stored only in Glacier.)

The fixity services runs in an always-on ECS micro spot instance (0.25 CPU), scaling up to three containers when busy. We've noticed that AWS imposes network throttling on micro containers that run consistently high network I/O. While this has affected some of other services, it has not affected our fixity checker, which tends to work on batches of files, taking a short rest between each batch. (The idle rest period seems to prevent AWS throttling from kicking in.)

Under our current ECS setup, the cost of running checksums on 200 TB of data every 90 days comes to less than $30 per month. That's a cost we can live with.

Fixity Tagging and Source of Truth

S3-compliant services (and possibly GCS as well) allow you to store metadata tags with your objects. APTrust uses this feature to store, among other things, the md5 and sha256 checksums for each object. When we retrieve the object itself from storage, we get the checksums with it.

This is generally good practice, since the checksums give you an idea of what you should be getting in your download. But it does not protect against the two key scenarios of malicious and accidental file corruption. If an attacker has managed to overwrite one of your files, they’ve probably overwritten the metadata as well. They can replace your good checksum with their bad one, and you won’t know you’re getting an altered file.

The same goes for accidental corruption. A legitimate internal system or an authorized user may accidentally overwrite both a file’s contents and its metadata.

Unless you keep a central registry of what each file’s checksum should be, you have no way of knowing whether your data is still valid. The central registry is your source of truth, and only secure, trusted systems should be able to write to it.

APTrust keeps its registry in a Postgres database that has both Web and REST API front ends. The registry, which is entirely independent of S3, Glacier and Wasabi storage, is our ultimate source of truth for fixity values.

Code Samples

To see an example of our fixity checking code, take a look at the processFile() function that we use to calculate file checksums on ingest. This function calculates md5, sha1, sha256 and sha512 digests from an S3 stream in a single pass. (Note that this sample does save certain special files to disk, including manifests and tag files, for additional parsing later. Most files, however, are not saved to disk.)

In Go, we do this by creating a writer for each hashing algorithm, stuffing all those writers into a MultiWriter, then running the stream through the MultiWriter.

Version 2.x of DART does the same thing in JavaScript by piping data through a set of crypto.Hash objects, though the process is more complex. You'll find a sample of that code here.

Technical