Preservation and Storage
Goals
- Store digital files deposited with APTrust securely and maintain strict management of file life cycles and actions.
- Provide a means to store multiple copies of digital files with diversity in the storage layer to mitigate failures of a particular storage technology.
- Provide a means to store multiple copies of digital files with geographic diversity to mitigate against the loss of content due to large regional disasters.
- Confirm the continued integrity of digital files by generating and reporting the results of regular fixity checks on preserved files.
- Record actions and outcomes of significant events related to files managed and preserved by APTrust
Storage
Content preserved by APTrust is stored in a combination of S3, S3 Glacier, and S3 Glacier Deep Archive for redundancy, mitigation against failure in a particular storage layer, and geographic redundancy. The APTrust High Assurance storage class, the default, and oldest offering combines S3 and S3 Glacier.
Storage Redundancy in High Assurance Storage Class
The primary preservation copy is maintained in the S3 layer, with the secondary preservation copy to be used in case of corruption of the primary copy or failure of the S3 storage layer. While the High Assurance storage class provides a primary and secondary layer, members may choose storage layers for a primary and secondary. Each storage layer enforces local redundancy policies within the region and storage layer, as documented in the AWS FAQ.
Geographic Diversity
The primary S3 storage is part of the US-Standard region and routes copies across Northern Virginia and the Pacific Northwest for the High Assurance storage class. However, APTrust provides Glacier offerings in Ohio, Virginia, and Oregon as secondary coverage options through our Basic Archive storage class.
Primary Preservation Storage (S3)
The primary S3 storage bucket uses the US-Standard region (see above). Permissions to this storage bucket are granted to the APTrust administrative account ONLY for proper lifecycle management and never given to an external service or application. Access to items from the preservation bucket can only be granted by copying the content to an appropriate staging bucket after properly authorizing any request.
Second-Tier Preservation Storage (Glacier and Wasabi)
In addition to S3 standard storage, APTrust provides lower cost but less available types of storage under the S3 Glacier tier through the Basic and Deep Archive storage classes. While very inexpensive, such storage has a much longer Recovery Time Objective (RTO) or can cost a great deal to restore at a more rapid pace. APTrust offers the Flexible Retrieval Glacier storage (Basic Archive) and the Glacier Deep Archive (Deep Archive). These offerings are spread across multiple regions, with the goal of regional and storage type diversity. Glacier Flex and Glacier Deep Archive are in Oregon, Ohio, and Virginia.
APTrust has now integrated into Wasabi object storage to provide non-AWS storage diversity for the membership, which is called our Premium Archive storage class. This storage integrates using the S3 API format and operates in the same manner as AWS S3 standard storage (primary preservation storage). It is offered in Oregon, Texas, and Virginia. APTrust system users (non-AWS) have been created with keys for APTrust to manage and interact with Wasabi storage. Fixity checks are also performed on files stored in Wasabi. The costs for Wasabi are substantially lower than the AWS S3 standard; however, moving the data has egress costs associated.
Preservation Storage Logging
Activity on the primary storage preservation bucket is logged using AWS standard logging to a bucket for deeper auditing purposes and security. Wasabi logs are copied into this same bucket. This is in addition to any logging already provided by locally coded content services. Logs for activity in members’ receiving and restoration buckets are also stored in a centralized logging bucket.
APTrust records preservation events as PREMIS metadata, including creation within the APTrust repository, identifier assignment, ingestion, access assignment, fixity, restoration, and deletion.
APTrust performs recurring fixity checks of hot copies every 180 days to ensure data integrity. Internal alerts notify APTrust staff when a fixity check fails. The cold copy is retrieved and verified before replacing the hot copy, and advisory representatives are emailed to be informed.
Checksums on Primary Preservation Storage (S3)
Although S3 has a method of auditing and recovery, a manual process for confirming files provides more flexibility and assurance. A locally implemented service regularly copies files out of S3 to confirm a fixity check using MD5 and SHA256 values and the outcomes reported in the administrative interface. Objects failing fixity tests are retried up to five times to ensure they are an actual fixity error, not a copy error.
S3 eTag Values
S3 eTag values from AWS will be reported but are NOT used to confirm fixity. Instead, this value is used as an internal identifier for items stored in S3 and a convenient way to determine if processed files are duplicates or updates of files already preserved in the system.
Checksums on Glacier Storage
APTrust relies on Glaciers’ internal SHA256 checksum reporting for regular short-term fixity checking as a base-level enhancement to the S3 fixity checks. If needed, a service may be developed to manually confirm fixity over a longer timescale (e.g., 24 months). This more extended manual fixity confirmation is throttled to use Glaciers’ slower free IO allotment to recover files from Glacier, confirm the fixity by performing manual MD5 and SHA256 checksums, and register the outcome with the Administrative Interface.
MD5 Checksums
MD5 Checksums are generated by depositors as part of their submission preparation process and provided as part of the original Submission Information Package in APTrust. They allow original depositors to confirm that the object preserved by APTrust continues to match the file they deposited exactly with bit-level granularity.
SHA256 Checksums
SHA256 checksums are generated as part of content ingestion into APTrust and are used to mitigate against malicious tampering of files with the intent of obscuring the tampering. This algorithm mitigates against this threat because it is currently a cryptographically secure algorithm resistant to tampering. If this algorithm is compromised in the future, we should switch to a new secure algorithm. SHA256 checksums may also be generated by depositors as part of their submission preparation process and provided as part of the original submission information package in APTrust.
SHA1, SHA512
As of November, 2022, we calculate SHA1 and SHA512 checksums on ingest for all files. We calculate SHA1 to comply with the Beyond the Repository (BTR) BagIt profile. We calculate SHA512 because of its strength and because the Library of Congress and others recommend it as a good practice
S3 Bucket lifecycle policies are set on all member buckets. These serve two purposes:
- Remove incomplete multipart uploads and “hidden” data remnants in restore/ingest buckets that may take up and cost storage space
- Deletion of objects in:
- Receiving buckets every 60 days, incomplete multipart: 7 days
- Restoration buckets every 14 days, incomplete multipart: 7 days
- Test receiving buckets every 30 days, incomplete multipart: 7 days
- Test restoration, buckets every 7 days, incomplete multipart: 7 days