4.1.3 Recognition and parsing of SIPs
The repository shall have adequate specifications enabling recognition and parsing of the SIPs.
This is necessary in order to be sure that the repository is able to extract information from the SIPs.
Packaging Information for the SIPs; Representation Information for the SIP Content Data, including documented file format specifications; published data standards; documentation of valid object construction.
The repository must be able to determine what the contents of a SIP are with regard to the technical construction of its components. For example, the repository needs to be able to recognize a TIFF file and confirm that it is not simply a file with a filename ending in ‘TIFF’. Another example, would be a website for which the repository would need to be able to recognize and test the validity of the variety of file types (e.g., HTML, images, audio, video, CSS, etc.) that are part of the website. This is necessary in order to confirm: 1) the SIP is what the repository expected; 2) the Content Information is correctly identified; and 3) the properties of the Content Information to be preserved have been appropriately selected.
APTrust will create an intellectual object for the SIP and assign a unique identifier. (BagIt) Tag files submitted with the bag will be parsed to index PDI and APTrust then extracts the contents of the tarred bag. Generic files are created for each data object, assigning a unique identifier for each file. Each generic file will undergo characterization via FITS to generate representation information that is indexed. Then, generic files are replicated thrice to Amazon S3 (Virginia, United States region). Each of the three instances in Amazon S3 will share the same unique identifier and metadata.
APTrust preserves the entire set of information properties associated with the content information as submitted in the SIP.
Content information includes one or more data objects and representation information that describes the meaning of the data object. APTrust collectively refers to the content information as intellectual objects. APTrust refers to data objects as generic files. Each data object receives file characterization using FITS (File Information Tool Set from Harvard Library), the resulting data from FITS forms the representation information.
The context for content information (intellectual object) is supplied by the member as part of the SIP. It consists of associated metadata, including title and description. Members may include additional descriptive information as it relates to the content information. APTrust stores the related intellectual object (content information) in the metadata for each generic file (data object). For more information about structure and content of SIPs see Definition of SIP. The general process of file format identification and validation is described in Ingest. Definition of AIP outlines the process of transforming SIPs into AIPs.