-
Notifications
You must be signed in to change notification settings - Fork 0
Checksum
A checksum is a small-sized block of data derived from another block of digital data for the purpose of detecting errors that may have been introduced during its transmission or storage. By themselves, checksums are often used to verify data integrity but are not relied upon to verify data authenticity.[1] The procedure which generates this checksum is called a checksum function or checksum algorithm. (Wikipedia, accessed 2022-12-09)
Checksums are generated to create a "digital fingerprint" that identifies a file or group of files. If any part of that file(s) gets changed or corrupted, for example due to data degradation (aka "bit rot"), then the checksum will also change. Using computer scripts, we can compare checksums quickly in an automated way, helping to quickly identify files that have problems. When combined with polling and voting, checksums ensure data integrity.
MD5 used to be the go-to, then many practitioners switched to a SHA (SHA1, SHA256, SHA512) after a paper talked about MD5's vulnerabilities. Some institutions use two checksums.
When selecting a checksum algorithm, think about your intention. SHA are cryptographic - more secure, but not always needed. MD5 is weaker over the long term (it can be spoofed), but more efficient in terms of energy consumption and time to create.
Checksums are easily created using bagging software.
However, checksums are not required for the LOCKSS software to work, but they will help confirm file fixity if you recover the AU in the future.
For more information about this topic, see the iPres paper at https://phaidra.univie.ac.at/o:923643.
This answers the AirTable Question: "What manifest algorithms are recommended? What are the pros and cons of each option?"