5.5. Secure data de-duplication

Data de-duplication categorization 1/7

Data de-duplication

It is a term used to describe an algorithm that eliminates duplicate copies of data from a storage. It saves only one single instance of replicated data. When a client wishes to store some piece of data a copy of which has already been saved in the storage system, then a link to this existing copy is provided in place of storing other original copies of this data. It allows to reduce the costs of stored data. Mainly due to the economical reasons data de-duplication is gaining on popularity.

From the data granularity point of view the de-duplication methods can be divided into two main categories:

  1. File-level
  2. Block-level

Strategies for data de-duplication 2/7

According to   : 1  data de-duplication techniques for cloud storage can be divided into:

  1. Intra user de-duplication - done by a client on his own data.
  2. Inter user de-duplication - takes into account the data previously stored by all the clients.

Another strategies categorization is by the location, where de-duplication is performed:

  1. Source-based (client-side) de-duplication is handled by the client before transferring to the server.
  2. Target-based (server-side) de-duplication is handled by the server -- in this case the client is unmodified and should not be aware of any de-duplication.

Simple hashing approach 3/7

Simple hashing

As it stays in   : 1  a common technique to de-duplicate data is hashing the content of the data and using this (unique) hashed value as the identifier of the data. In the source-based de-duplication scenario the client sends this identifier to the storage server to determine whether such an identifier already exists in the system.

Problem - manipulation of data identifiers

Suppose that Bob wants to cheat Alice. If no control is made by the cloud storage system, Bob can upload a file \(F_1\) with the identifier of a file \(F_2\). Then, if Alice wishes to upload with its real identifier, the system will detect the existence of \(F_2\) identifier in the system and will not store \(F_2\). Rather, the system will store a reference meaning that Alice also owns file \(F_1\) which is the file corresponding to the identifier of \(F_2\) in the system. Later, when Alice will request the file corresponding to the identifier of \(F_2\), the system will send \(F_1\) to Alice.

Another threats 4/7

According to   : 1  the most common types of attacks that appear in the literature (apart from the manipulation of data identifiers) are:

Monitoring the network traffic on the client side

It gives an attacker the capability to determine whether an inter-user de-duplication has been applied on a given piece of data. Such an attacker can exploit this knowledge to identify data items stored in a cloud storage system by other users or even learn the content of these data items. For instance, to determine whether a file \(F\) is stored in a system, the attacker just tries to upload \(F\) and observes the network traffic from his/her device toward the storage server. If \(F\) is uploaded, it means that \(F\) does not exist in the storage system. This fact makes the storage system vulnerable to brute force attacks on file contents. For instance, to determine which copies of a movie exist on a specific cloud storage system, an attacker can upload each format or version of such a movie, and show that those that are not uploaded already exist in the storage system.

Backup time observation

Detecting an inter-user de-duplication by observing the backup duration gives an attacker the ability to perform the same attacks as the ones done with network traffic observations. However, observing the duration of a backup operation is less accurate than a network traffic monitoring as it depends on the size of the file and the state of the network. For small files, observation may not be accurate, while for larger ones, it gains in accuracy.


In the literature one can find a lot of different schemes for data de-duplication. Here we describe one, which seems to be secure against manipulation of identifiers, monitoring the network traffic and backup time observation.

Secure two-phase data de-duplication scheme 5/7

(from   : 1 )

It combines both intra and inter-user de-duplication techniques by introducing de-duplication proxies (DPs) between the clients and the storage server (SS).

The intra-user de-duplication

Is performed by client. Data confidentiality can be strengthened by letting clients to encrypt their files before uploading them. The scope of the intra-user de-duplication is limited to the files previously stored by the client trying to store a file. It allows the storage system to be protected against both manipulating the identifiers attacks and network traffic observations. Each client uploads each of his/her files to his/her associated DP exactly once even if the same files already exist in the storage system (due to a prior upload by an other client). This accordingly prevents any client from determining whether a file already exists in the storage system through network traffic observations

The inter-user de-duplication

Is performed by the DPs. To make indistinguishable the upload of a file to the SS from the sending of a reference - when the referred file already exists in the storage system - the concerned DP may introduce an extra delay before sending the notification to the client. The goal of adding this delay is to make similar the duration of storing a reference and storing the entire file. This aims to protect the system against attacks based on backup time observation.

Secure two-phase data de-duplication scheme - put operation for storing file 6/7

Secure two-phase data de-duplication scheme   : 1  - put operation for storing file \(F\) by the client with identifier \(C_{id}\)

Client:

  • \(f_h := Hash(F)\) -- computes a hash (pre-image, second pre-image and collision resistant hash function)
  • \(e_f := E_{sym}(f_h, F)\) -- encrypts file \(F\) using symmetric encryption scheme \(E_{sym}\) with the key \(f_h\)
  • \(F_{id} := Hash(e_f)\) -- computes an identifier of file \(F\)
  • sends \(\mbox{checkForDedup}(C_{id}, F_{id})\) to DP

De-duplication proxy:

Forwards the request to SS

Storage server:

Sends one of the following three responses to DP:

  1. \((C_{id}, F_{id}, \mbox{Intra-user de-duplication})\) if \(C_{id}\) has already stored file that corresponds to identifier \(F_{id}\)
  2. \((C_{id}, F_{id}, \mbox{File upload})\) if the file corresponding to \(F_{id}\) does not exist in the storage system
  3. \((C_{id}, F_{id}, \mbox{Inter-user de-duplication})\) if the file corresponding to \(F_{id}\) has already been stored in the storage system by a client different from \(C_{id}\)

De-duplication proxy:

If it gets Intra-user de-duplication as the response from SS, it forwards the Intra-user de-duplicationresponse to the client. Else it forwards File upload.

Client:

In case of Intra-user de-duplication the Intra-user de-duplication is done and put operation is finished. In case of File upload the client:

  • computes \(e'_f = E_{asym}(pk_{C_{id}}, f_h)\) -- encrypts the key \(f_h\) using asymmetric encryption scheme \(E_{asym}\) with the key public key \(pk_{C_{id}}\) (it is assumed that only \(C_{id}\) has the corresponding private key \(sk_{C_{id}}\))
  • sends \((C_{id}, F_{id}, e_f, e'_f)\) to DP

De-duplication proxy:

  1. Checks if \(F_{id} = Hash(e_f)\) If not the operation is aborted and notification is send to the client.
  2. If the consistency check is successful and the file upload has been requested because of an inter-user de-duplication, then only a reference - \((C_{id}, F_{id}, e'_f)\) is uploaded to the SS to be stored.
  3. If the consistency check is successful and the file upload has been requested because file does not exist in the storage then the upload message is forwarded to SS.
  4. Finally the UploadFileResponse\((C_{id}, F_{id}, \mbox{OK})\) message is send to the client. In order to ensure that inter-user de-duplication is unnoticeable to the client in case of step 2, before sending the notification the DP delays it to make the duration of the whole put operation similar to a transmission of the file to the SS. The added delay is thus a function of the size of the file and the state of the network.

Secure two-phase data de-duplication - scheme get operation for retrieving the file 7/7

Secure two-phase data de-duplication scheme   : 1  - get operation for retrieving the file \(F\) by the client with identifier \(C_{id}\)

  1. The client sends \(\mbox{get}(C_{id}, F_{id})\) to DP.
  2. DP forwards the request to SS.
  3. SS looks for \(F_{id}\) in its file ownerships index. If it is not found among the identifiers of files belonging to \(C_{id}\), the SS sends a getResponse message to the DP with an error notification to terminate the get request. Otherwise, the SS sends a getResponse containing the ciphertext \(e_f\) corresponding to the identifier \(F_{id}\) and the encrypted decryption key \(e'_f\) corresponding to the client \(C_{id}\)
  4. DP forwards the response to the client.
  5. Client decrypts \(e'_f\) using his secret key \(sk_{C_{id}}\) and gets \(f_h\) Using \(f_h\) he decrypts \(e_f\) and gets the file \(F\). e

Bibliography 1/1

1

Meye, P., Raipin, P., Tronel, F., Anceaume, E.: A secure two-phase data deduplication scheme.

In: 6th International Symposium on Cyberspace Safety and Security (CSS), August 2014, Paris, France. (2014)

2

Ning, P., di Vimercati, S.D.C., Syverson, P.F.: Proceedings of the 2007 ACM Conference on Computer and Communications Security,

CCS 2007, Alexandria, Virginia, USA, October 28-31, 2007, ACM (2007), 35




Projekt Cloud Computing – nowe technologie w ofercie dydaktycznej Politechniki Wrocławskiej (UDA.POKL.04.03.00-00-135/12)jest realizowany w ramach Programu Operacyjnego Kapitał Ludzki, Priorytet IV. Szkolnictwo wyższe i nauka, Działanie 4.3. Wzmocnienie potencjału dydaktycznego uczelni w obszarach kluczowych w kontekście celów Strategii Europa 2020, współfinansowanego ze środków Europejskiego Funduszu Społecznego i budżetu Państwa