It is a term used to describe an algorithm that eliminates duplicate copies of data from a storage. It saves only one single instance of replicated data. When a client wishes to store some piece of data a copy of which has already been saved in the storage system, then a link to this existing copy is provided in place of storing other original copies of this data. It allows to reduce the costs of stored data. Mainly due to the economical reasons data de-duplication is gaining on popularity.
From the data granularity point of view the de-duplication methods can be divided into two main categories:
According to : 1 data de-duplication techniques for cloud storage can be divided into:
Another strategies categorization is by the location, where de-duplication is performed:
Simple hashing
As it stays in : 1 a common technique to de-duplicate data is hashing the content of the data and using this (unique) hashed value as the identifier of the data. In the source-based de-duplication scenario the client sends this identifier to the storage server to determine whether such an identifier already exists in the system.
Suppose that Bob wants to cheat Alice. If no control is made by the cloud storage system, Bob can upload a file \(F_1\) with the identifier of a file \(F_2\). Then, if Alice wishes to upload with its real identifier, the system will detect the existence of \(F_2\) identifier in the system and will not store \(F_2\). Rather, the system will store a reference meaning that Alice also owns file \(F_1\) which is the file corresponding to the identifier of \(F_2\) in the system. Later, when Alice will request the file corresponding to the identifier of \(F_2\), the system will send \(F_1\) to Alice.
According to : 1 the most common types of attacks that appear in the literature (apart from the manipulation of data identifiers) are:
Monitoring the network traffic on the client side
It gives an attacker the capability to determine whether an inter-user de-duplication has been applied on a given piece of data. Such an attacker can exploit this knowledge to identify data items stored in a cloud storage system by other users or even learn the content of these data items. For instance, to determine whether a file \(F\) is stored in a system, the attacker just tries to upload \(F\) and observes the network traffic from his/her device toward the storage server. If \(F\) is uploaded, it means that \(F\) does not exist in the storage system. This fact makes the storage system vulnerable to brute force attacks on file contents. For instance, to determine which copies of a movie exist on a specific cloud storage system, an attacker can upload each format or version of such a movie, and show that those that are not uploaded already exist in the storage system.
Backup time observation
Detecting an inter-user de-duplication by observing the backup duration gives an attacker the ability to perform the same attacks as the ones done with network traffic observations. However, observing the duration of a backup operation is less accurate than a network traffic monitoring as it depends on the size of the file and the state of the network. For small files, observation may not be accurate, while for larger ones, it gains in accuracy.
In the literature one can find a lot of different schemes for data de-duplication. Here we describe one, which seems to be secure
against manipulation of identifiers, monitoring the network traffic and backup time observation.
(from : 1 )
It combines both intra and inter-user de-duplication techniques by introducing de-duplication proxies (DPs) between the clients and the storage server (SS).
Is performed by client. Data confidentiality can be strengthened by letting clients to encrypt their files before uploading them. The scope of the intra-user de-duplication is limited to the files previously stored by the client trying to store a file. It allows the storage system to be protected against both manipulating the identifiers attacks and network traffic observations. Each client uploads each of his/her files to his/her associated DP exactly once even if the same files already exist in the storage system (due to a prior upload by an other client). This accordingly prevents any client from determining whether a file already exists in the storage system through network traffic observations
Is performed by the DPs. To make indistinguishable the upload of a file to the SS from the sending of a reference - when the referred file already exists in the storage system - the concerned DP may introduce an extra delay before sending the notification to the client. The goal of adding this delay is to make similar the duration of storing a reference and storing the entire file. This aims to protect the system against attacks based on backup time observation.
Client:
De-duplication proxy:
Forwards the request to SS
Storage server:
Sends one of the following three responses to DP:
De-duplication proxy:
If it gets Intra-user de-duplication as the response from SS, it forwards the Intra-user de-duplicationresponse to the client. Else it forwards File upload.
Client:
In case of Intra-user de-duplication the Intra-user de-duplication is done and put operation is finished. In case of File upload the client:
De-duplication proxy:
Meye, P., Raipin, P., Tronel, F., Anceaume, E.: A secure two-phase data deduplication scheme.
In: 6th International Symposium on Cyberspace Safety and Security (CSS), August 2014, Paris, France. (2014)
Ning, P., di Vimercati, S.D.C., Syverson, P.F.: Proceedings of the 2007 ACM Conference on Computer and Communications Security,
CCS 2007, Alexandria, Virginia, USA, October 28-31, 2007, ACM (2007), 35