Friday, July 07, 2006

Brief Note about MD5, Hash, Checksum and Digest

Recently, I became bit curious about checking the integrity of downloaded files from net and had to read up about MD5, Hash, Checksum et al.

I started off with wikipedia and google search; both of these gave me a bunch of info which I have summarized here for a quick intro.

What is "Hash"?
---------------
The term "hash" apparently comes by way of analogy with its standard meaning in
the physical world, to "chop and mix."

In the SHA-1 algorithm, for example, the domain is "flattened" and "chopped" into "words"
which are then "mixed" with one another using carefully chosen mathematical functions.


Hash Function
----------------
A hash function (or hash algorithm) is a way of creating a small digital "fingerprint"
from any kind of data.

The function chops and mixes the data to create the fingerprint, often called a hash
value.

The hash value is commonly represented as a short string of random-looking letters
and numbers (Binary data written in hexadecimal notation).

A good hash function is one that yields few hash collisions in expected input domains.
In hash tables and data processing, collisions inhibit the distinguishing of data,
making records more costly to find.


Error Detection - Use of Hash Function and Redundancy Check
-------------------------------------------------------------------------
Using a hash function to detect errors in transmission is straightforward.

The hash function is computed for the data at the sender, and the value of this hash is
sent with the data. The hash function is performed again at the receiving end, and if
the hash values do not match, an error has occurred at some point during the transmission.
This is called a redundancy check.

Checksum
---------------
This article is about checksums calculated using addition.

The term "checksum" is sometimes used in a more general sense to refer to any kind of
redundancy check.

A checksum is a form of redundancy check, a very simple measure for protecting the integrity
of data by detecting errors in data that is sent through space (telecommunications) or time (storage).

It works by adding up the basic components of a message, typically the asserted bits, and
storing the resulting value.

Later, anyone can perform the same operation on the data, compare the result to
the authentic checksum, and (assuming that the sums match) conclude that the
message was probably not corrupted.

The simplest form of checksum, which simply adds up the asserted bits in the data,
cannot detect a number of types of errors. In particular, such a checksum is not changed by:

* reordering of the bytes in the message
* inserting or deleting zero-valued bytes
* multiple errors which sum to zero

More sophisticated types of redundancy check, including Fletcher's checksum, Adler-32, and
cyclic redundancy checks (CRCs), are designed to address these weaknesses by considering not only the value of each byte but also its position.

The cost of the ability to detect more types of errors is the increased complexity of computing
the checksum.

These types of redundancy check are useful in detecting accidental modification such as corruption to stored data or errors in a communication channel. However, they provide no security against a malicious agent as their simple mathematical structure makes them trivial to circumvent.

To provide this level of integrity, the use of a cryptographic hash function, such as SHA-256,
is necessary. (Collisions have been found in the popular MD5 algorithm and finding collisions in SHA-1 seems possible, but there is no evidence as of 2005 that SHA-256 suffers similar weaknesses.)

On Unix, there is a tool called "cksum" that generates both a 32 bit CRC and a byte count for
any given input file.

CRC
----
A cyclic redundancy check (CRC) is a type of hash function used to produce a checksum
a small, fixed number of bits against a block of data, such as a packet of network traffic or
a block of a computer file. The checksum is used to detect errors after transmission or storage.

A CRC is computed and appended before transmission or storage, and verified afterwards by recipient to confirm that no changes occurred on transit. CRCs are popular because they are simple to implement in binary hardware, are easy to analyze mathematically, and are particularly good at detecting common errors caused by noise in transmission channels.

Hashes are "digests", not "encryption"
-------------------------------------
Encryption transforms data from a cleartext to ciphertext and back (given the right keys),
and the two texts should roughly correspond to each other in size: big cleartext yields
big ciphertext, and so on. "Encryption" is a two-way operation.

Hashes, on the other hand, compile a stream of data into a small digest
(a summarized form: think "Reader's Digest"), and it's strictly a one way operation. All hashes of the same type have the same size no matter how big the inputs are:

We'll note here that though hashes and digests are often informally called "checksums",
they really aren't. True checksums, such as a Cyclic Redundancy Check are designed to
catch data-transmission errors and not deliberate attempts at tampering with data.

Cipher
---------
A cipher is an algorithm for performing encryption (and the reverse, decryption)
A series of well-defined steps that can be followed as a procedure. An alternative term is encipherment

Excellent References:
-----------------------
1) An Illustrated Guide to Cryptographic Hashes


2) What is a Digital Signature

No comments: