4 Data deduplication
In this chapter:
•
What is data deduplication? (page 19)
•
Data deduplication and the HP StoreOnce Backup System (page 19)
•
Tape rotation example with data deduplication (page 20)
What is data deduplication?
Data deduplication is a process that compares blocks of data being written to the backup device
with data blocks previously stored on the device. If duplicate data is found, a pointer is established
to the original data, rather than storing the duplicate data sets. This removes, or “deduplicates,”
the redundant blocks. The key part of this is that the data deduplication is being done at the block
level and not at the file level which reduces the volume of data stored significantly.
Figure 3 Data stored after deduplication
The importance of the Index files
As a backup stream arrives at the HP StoreOnce Backup System the stream of data is “chunked”
into nominal 4K chunks, a hashing algorithm is run on each of these 4K chunks and this produces
a unique digital fingerprint which is written to an index file.
This process is repeated real time for every chunk of data involved in the first backup stream. When
subsequent backups run it is highly likely they will create identical hash codes, in which case the
hash count in the index is increased; the data associated with the hash code is not stored again
because it already resides in the Deduplication Store. So we only store the data once for any given
hash code – hence StoreOnce.
The Index files contain the mapping for the hashed data chunks created by deduplication and are
the main point of reference accessed and updated by both replication and housekeeping. Without
them, data cannot be restored successfully.
Data deduplication and the HP StoreOnce Backup System
Data deduplication is applied per library device or share. When you configure the library or share,
it defaults to deduplication enabled; this cannot be disabled.
A device is associated with a host server and deduplication allows a greater amount of backup
history to be stored for that host. A larger number of full backups can be achieved, which makes
possible a rotation strategy with a longer retention history. It does not increase the number of host
servers that may be connected. The deduplication factor that has been applied to a device is
calculated and displayed on the Web Interface. This figure is dynamic, it updates automatically
as more data is written to the device.
What is data deduplication?
19