Preservation and Storage Copy URL

Goals

  • Store digital files deposited with APTrust in a secure location and maintain strict management of file life-cycles and actions.
  • Provide a means to store multiple copies of digital files with diversity in the storage layer to mitigate for failures of a particular storage technology.
  • Provide a means to store multiple copies of digital files with geographic diversity to mitigate against the loss of content due to large regional disasters.
  • Confirm the continued integrity of digital files by generating and reporting the results of regular fixity checks on preserved files.
  • Record actions and outcomes of major events related to files managed and preserved by APTrust

Storage

Content preserved by APTrust is stored in a combination of S3, S3 Glacier, and S3 Glacier Deep Archive for redundancy, mitigation against failure in a particular storage layer as well as geographic redundancy. APTrust core, the default and oldest offering, is a combination of S3 and S3 Glacier.

Storage Redundancy

The primary preservation copy is considered to be the one maintained in the S3 layer with the secondary preservation copy to be used in case of corruption of the primary copy or failure of the the S3 storage layer.  While the core preservation service provides a primary and secondary layer, members may choose storage layers for a primary and secondary. Each storage layer enforces it’s own local policies of redundancy with-in the region and storage layer as documented in the AWS FAQ.

Geographic Diversity

The primary s3 storage is part of the US-Standard region and routes copies across the Northern Virginia and Pacific Northwest for the core service. However, APTrust provides Glacier offerings in Ohio, Virginia, and Oregon as options for secondary coverage.

Primary Preservation Storage (S3)

The primary S3 storage bucket uses the US-Standard region (see above).  Permissions to this storage bucket are granted to the APTrust administrative account ONLY for proper lifecycle management and never granted to an external service or application.  Access to items from the preservation bucket can only be granted by copying the content out to an appropriate staging bucket after properly authorizing any request.

Second Tier Preservation Storage (Glacier and Wasabi)

In addition to S3 standard storage, APTrust provides lower cost but less available types of storage under the glacier tier. While very inexpensive, such storage has a much longer Recovery Time Objective (RTO), or can cost a great deal to restore at a more rapid pace. APTrust offers both the Flexible Retrieval Glacier storage, and the S3 Deep Archive.  These offerings are spread across multiple regions, with the goal of regional and storage type diversity. Currently, there is Glacier Flex and Glacier Deep Archive in Oregon and Ohio, with Glacier Deep Archive ONLY in Virginia. 

To provide non AWS storage diversity for the membership, APTrust has now integrated in Wasabi object storage. This storage integrates using the S3 api format, and operates in the same manner as AWS S3 standard storage (primary preservation storage). Currently it is offered both in Oregon ( US-West) and Virginia (East-East). APTrust system users ( non aws) have been created with keys for APTrust to manage and interact with Wasabi storage. Fixity checks can and are done on Wasabi.  The costs for Wasabi are substantially lower than AWS S3 standard, however moving the data has egress costs associated. 

Preservation Storage Logging

Activity on the primary storage preservation bucket are logged using AWS standard logging to a bucket named aptrust.preservation.logs for deeper auditing purposes and security.   This is in addition to any logging already provided by locally coded content services. Buckets belonging to members for preservation activities also log to a central bucket named aptrust.S3.logs.

description of the common preservation activities preformed on files

Checksum Requirements

Checksums on Primary Preservation Storage (S3)

Although S3 has a method of auditing and recovery, a manual process for confirming files provides more flexibility and a greater level of assurance.  Files will be regularly copied out of S3 by a locally implemented service to confirm fixity check using both md5 and sha256 values and the outcomes reported in the administrative interface.  Objects failing fixity tests will be retried up to 5 times to ensure it is a true fixity error and not a copy error.

S3 eTag Values

S3 eTag values from AWS will be reported but are NOT used to confirm fixity.  Instead this value is used as an internal identifier for items stored in S3 and as a convenient way to determine if files being processed are duplicates or updates of files already preserved in the system.  

Checksums on Glacier Storage

For regular short term fixity checking we will rely on Glaciers internal SHA256 checksum reporting as a base level enhancement to the S3 fixity checks.  Additionally a service may be developed if needed to manually confirm fixity on a longer timescale (~24 months).  This longer manual fixity confirmation is throttled to use Glaciers slower free IO allotment to recover files form Glacier and confirm the fixity by performing manual MD5 and SHA256 checksums and register the outcome with the Administrative Interface.

MD5 Checksums

MD5 Checksums are generated by depositors as part of their submission preparation process and provided as part of the original submission package in APTrust. 

It provides a means for original depositors to confirm the object preserved by APTrust continue to match the file they deposited exactly with bit-level granularity.

SHA256 Checksums

SHA256 checksums are generated as part of content ingestion into APTrust and is used to mitigate against malicious tampering of files with the intent of obscuring the tampering.  It mitigate against this threat because it is currently a cryptographically secure algorithm that is resistant to tampering.  In the advent that this algorithm is compromised in the future, we should switch to a new secure algorithm.

SHA1 , SHA512

As of November, 2022, we calculate SHA1 and SHA512 checksums on ingest for all files. We calculate SHA1 to comply with the Beyond the Repository (BTR) BagIt profile. We calculate SHA512 because of its strength and because the Library of Congress and others recommend it as a best practice

S3 Bucket lifecycle policies are set on all member buckets. These serve two purposes:

  1. Remove incomplete multipart uploads, Remove “hidden” data remnants in restore/ingest buckets that may take up and cost storage space
  2. Deletion of objects in:
    1. receiving buckets every 60, incomplete mulitpart: 7 days
    2. restore buckets every 14 days, incomplete mulitpart: 7 days
    3. receiving.test buckets every 30 days , incomplete mulitpart: 7 days
    4. restore.test, buckets every 7 days, incomplete multipart: 7 days