Data De-duplication Using Plagiarism

Check out more papers on Amazon


Cloud storage is a remote storage service, where users can upload and download their data anytime and anywhere. Data duplication, data leakage, space consumption, etc. are the main issues concerning cloud storage. During the time of upload, data is converted into binary data. In the proposed method, by using AES algorithm, cipher text is created, which is stored in the cloud. Further, using MD5 algorithm, a hash value is created and it is stored in the hash table. Parallel to this, plagiarism is running and it involves content checking. Out of the different methods of plagiarism, syntactic based method is used here. Syntactic based method do not consider the meaning of words, phrases, or sentences. Content checking is performed to eliminate duplicate cipher text. Here, a threshold value is set and if content similarity is smaller than the threshold value, user opinion is asked to upload the data. Here, data compression is performed at the time of uploading data to the cloud and this is done to reduce the amount of storage space. Plagiarism and compression techniques avoid unwanted memory usage. The proposed method offers better data security than the existing one and user can utilize the data as needed.

A censorious confrontation for cloud storage is the management of aggregate volume of accumulating data. In order to manipulate the data management, data compression or data deduplication technique has been proposed and intrigues more attention. Since the amount of data storage is larger, there may be large amount of duplicate copies. In order to avoid those unwanted data and to save the storage space, a peculiar data compression technique has been enabled to remove the redundant data .This helps to reduce the byte storage in cloud. Only one copy of the tautological data is kept and the remaining data are excluded.

Redundant data are replaced with pointers, so that only eminent data can be retrieved. Pointers are provided to users with same file so that there is no obligation to upload the file. Though there are many privileges, certain security crisis may occur internally and externally. Hence certain encryption techniques are handled and are accompanied by cipher texts. Deduplication can be made possible by formatting contrast cipher texts for divergent users.

The users download the file that is encrypted and then AES keys are used to decrypt the file. Authorization is provided to guide the user while uploading the file in the cloud. Users without proper authentication are not allowed to perform dualistic checks. These checks are compassed in public cloud. After transmitting the file, checking is done for any existing privileges that correspond to match the privilege of newly uploaded data. Hence for competent storage of uploaded data, Storage Service Provider is imported.


Cloud computing has become a hot topic and brings many advantages through various services. The complex hardware, database, and operating system can be handled by a cloud server. Users only need some simple devices, which can connect to the cloud server. However, in the environment, the cloud server can obtain and control all the uploaded data because all the data are stored or operated in the cloud. The security and privacy issues are very important in cloud computing. In order to protect privacy, users encrypt their data by some encryption algorithms the encrypted data to the cloud.

Project is focused on the cloud storage service. Users can store their data in the cloud storage and download the stored data anywhere. Even if users exhaust their own storage spaces, the cloud storage server can expand the storage spaces without destroying the stored data. However, the fast growth of storage requirements burdens the cloud storage, which is not infinite. The cloud storage server typically applies the data de-duplication technique to reduce the consumption of storage space.

At the same time, the goal of encryptions is to keep information secret and make it difficult to distinguish the encrypted data (i.e., cipher texts) from random values. If an encryption is secure, it would be hard to obtain information from cipher texts. Hence, encrypted data de-duplication becomes a challenge because the first step of data de duplication is to search for duplicate data.

Cloud computing increases the speed and dexterity which alludes to accessing the internet in a specific data center of different hardware and software. It is a used to describe a class of network based computing that takes place over the internet. It comprises the procurement of dynamically adaptable and virtualized reserves as a indulgence over the internet. This technology allows more efficient computation by centralizing storage memory processing and bandwidth.


Cloud storage is a remote storage service, where users can upload and download their data anytime and anywhere. It raises issues regarding privacy and data con?dentiality because all the data are stored in the cloud storage. This is a subject of concern for users, and it affects their willingness to use cloud storage services. On the other hand, a cloud storage server typically performs a specialized data compression technique (data deduplication) to eliminate duplicate data because the storage space is not in?nite.

Data deduplication, which makes it possible for data owners to share a copy of the same data, can be performed to reduce the consumption of storage space. In the existing system encrypted data deduplication mechanism which makes the cloud storage server be able to eliminate duplicate cipher texts and improves the privacy protection.

There are 2 methods of data-deduplication:
Source-Based Approach

Data de-duplication acts on the client users, client users need to query storage server whether the data has been uploaded, before really uploading. Source-Based: With source-based de-duplication, the data is most often de-duplicated by software agents, installed on the source servers, working with the central de-duplication appliance. Only unique data is sent across the network. In addition to capacity benefits there are significant advantages to be gained from the reduction in network traffic. This can be very beneficial for organizations with large campuses, or organizations with remote offices, that backup to a central location.

Examples of source-based de-duplication are EMC's Avamar and CommVault's Simpana 9. Avamar is based on a storage grid appliance, and priced on capacity of this appliance with unlimited, no additional cost, software agents for the source servers. Simpana 9 is a pure software solution, and will run on a wide range of hardware.

However, the performance capabilities of the equipment must be up to the task, and CommVault makes some recommendations in this regard. Software-based solutions provide for a great deal of flexibility but often at the cost of increased complexity, so it is important that your chosen supplier is experienced with not only the chosen product, but also its suitability to the specific application.

Target-Based Approach

The steps of data de-duplication are handled by the storage server , while client users just upload and download their data. Target-based deduplication employs a disk storage device as the data repository or target. The data is driven to the target using standard backup software. Once it reaches the device, the deduplication is processed as it enters the target (In-Line Processing), or it is received by the device in its raw data state and is processed after the entire backup job has arrived (Post-Process). There are pros and cons with each of these methods and picking the correct technology for your specific environment is important.

A good example of in-line technology is EMC's Data Domain product line. These appliances have extremely fast and capable possessing power and are specifically designed to be able to handle and deduplicate data as fast as it can be supplied to them. In fact, network performance is often found to be the limiting factor to speed of backup, rather than de-dupe processing. A key benefit with in line systems, such as these, is the ability to replicate to disaster recovery sites immediately because of the fact that the data is deduplicated as soon as it is received.

Post-process products ingest data to local storage, and then process the stored data. In some products, the de-dupe process can start at the same time as the backup starts, but in most cases, the process lags the incoming data, and can take a considerable time to complete. This method avoids the need for the high performance processing power in order to reduce the cost. However there are trade offs. First, you can't replicate data until the whole backup is deduplicated. Second, the solution has to have more disk storage capacity than an in-line method, as it needs to be able to store a complete backup session in unduplicated form.

Limitations: In both methods there is no content checking. Files having similar content may be uploaded to the could. It leads to the wastage of space in cloud.


Document Plagiarism Opposed to other types of plagiarism (such as music, graphs, etc.), document plagiarism falls in two categories; source code plagiarism and free text plagiarism. Given the constraints and keywords of programming languages, detecting the former are easier than detecting the latter and hence source code plagiarism detection is not the focus of current research. Plagiarism takes several forms. Maurer et al stated that the followings are some of what considered practices of free text plagiarism:

  • Copy-paste: or verbatim (word-for-word) plagiarism, in which the textual contents are copied from one or multiple sources. The copied contents might be modified slightly.
  • Paraphrasing: changing grammar, using synonyms of words, re-ordering sentences in original work, or restating same contents in different semantics.
  • No proper use of quotation marks: failing to identify exact parts of borrowed contents.

Plagiarism detection methods can be broadly classified into three main categories. The first category tries to capture the author style of writing and find any inconsistent change in this style. This is known as Stylometry analysis. The second category is more commonly used which is based on comparing multiple documents and identifying overlapping parts between these documents. The third category takes a document as input and then searches for plagiarism patterns over the Web either manually or in an automated manner.

Fig 1: Figure provides taxonomy of plagiarism detection methods

Syntactic-Based Detection Unlike semantic-based, syntactic-based methods do not consider the meaning of words, phrases, or sentence. Thus the two words “exactly” and “equally” are considered different. This is of course a major limitation of these methods in detecting some kinds of plagiarism. Nevertheless they can provide significant speedup gain comparing to semantic-based methods especially for large data sets since the comparison does not involve deeper analysis of the structure and/or the semantics of terms. To quantify the similarity between chunks, usually a similarity measure is used.

Porter stemmer

Porter stemmer algorithm is a process for removing words from English. There are several types of stemming algorithms which differ in respect to performance and accuracy and how certain stemming obstacles are overcome.

Porter’s algorithm is important for two reasons. First, it provides a simple approach to conflation that seems to work well in practice and that is applicable to a range of languages. Second, it has spurred interest in stemming as a topic for research in its own right, rather than merely as a low-level component of an information retrieval system. The algorithm was first published in 1980; however, it and its descendants continue to be employed in a range of applications that stretch far beyond its original intended use.

Suffix-stripping algorithm: Suffix stripping algorithms do not rely on a lookup table that consists of inflected forms and root form relations. Instead, a typically smaller list of 'rules' is stored which provides a path for the algorithm, given an input word form, to find its root form. Some examples of the rules include:

  • if the word ends in 'ed', remove the 'ed'
  • if the word ends in 'ing', remove the 'ing'
  • if the word ends in 'ly', remove the 'ly'

Suffix stripping approaches enjoy the benefit of being much simpler to maintain than brute force algorithms, assuming the maintainer is sufficiently knowledgeable in the challenges of linguistics and morphology and encoding suffix stripping rules. Suffix stripping algorithms are sometimes regarded as crude given the poor performance when dealing with exceptional relations (like 'ran' and 'run'). The solutions produced by suffix stripping algorithms are limited to those which have well known suffixes with few exceptions. This, however, is a problem, as not all parts of speech have such a well formulated set of rules. Lemmatization attempts to improve upon this challenge.Prefix stripping may also be implemented. Of course, not all languages use prefixing or suffixing.

This module manages the downloading, uploading and searching of files. To implement the uploading process, the file data converted into binary is encrypted using AES. Using Plagiarism the content checking is done and the file is uploaded or discarded. The downloading process request for the particular data and the searching is done; if it is available the data is downloaded.


AES was published by the national institute of standards and technology in 2001. AES is a symmetric block cipher. Cipher takes a plane text block size of 128 bits or 16 bytes. The key length can be 16, 24 or 32 bytes (128,192,256 bits). The algorithm is referred to as AES-128, AES-192 or AES-256 depending on the key length.

Each word is 4 bytes, and the total key schedule is 44 words for the 128-bit key. The ordering of bytes within a matrix is by column. The cipher consist of N rounds, where the number of rounds depends on the key length: 10 rounds for a 16-byte key , 12 rounds for a 24-byte key , and 14 round for a 32-byte key.

There are four different stages used:

1. Substitute bytes: Uses an S-box to perform a byte-by-byte substitution of the block.

2. Shift Rows: A simple permutation.

3. Mix Columns: A substitution that makes use of arithmetic over.

4. AddRoundKey: A simple bitwise XOR of the current block with a portion of the expanded key.


The MD5 message-digest algorithm is a widely used cryptographic hash function producing a 128-bit (16-byte) hash value, typically expressed in text format as a 32 digit hexadecimal number. MD5 has been utilized in a wide variety of cryptographic applications, and is also commonly used to verify data integrity.

MD5 processes a variable-length message into a fixed-length output of 128 bits. The input message is broken up into chunks of 512-bit blocks (sixteen 32-bit words); the message is padded so that its length is divisible by 512. The padding works as follows: first a single bit, 1, is appended to the end of the message. This is followed by as many zeros as are required to bring the length of the message upto 64 bits fewer than a multiple of 512.

The remaining bits are filled up with 64 bits representing the length of the original message, modulo 264. MD5 digests have been widely used in the software world to provide some assurance that a transferred file has arrived intact.


GZIP power is format and a softer application used for file compression and de compression. The GZIP is based on the DEFLATE algorithm which is the combination of LZ77 and Huffman coding. GZIP is also used to refer to the GZIP file format which is a 10-byte header, containing a magic number, version number and a timestamp.

Although its file format also allows for multiple such stream to be concatenated. GZIP is normally used to compress just single files. Compress archives are typically created by assembling collections of file into a single tar archive, and then compressing that archives format. The ZIP format also used DEFLATE. The ZIP format can hold collection of file without an external archive, but is less compact than compressed tar balls holding the same data, because it compress file individually and cannot take advantages of redundancy between files.


Implementation and Testing Environment We implemented the proposed scheme and tested its performance.. We applied a MySQL database to store data files and related information. In our test, we did not take into account the time of data uploading and downloading. We focused on testing the performance of the deduplication procedure and algorithms designed in our scheme.

1: Efficiency of data encryption and decryption In this experiment, we tested the operation time of data encryption and decryption with AES by applying different AES key sizes (128 bits, 196 bits and 256 bits) and different data size (from 10 megabytes to 600 megabytes). The testing environment was Intel Core i5-3337U CPU 1.80 GHz 4.00 GB. RAM, Ubuntu v13.10 2.0 GB RAM, Dual-Core processor, 25.0G Hard disk. As shown in Fig. 1, we observed that even when the data is as big as 600 MB, the encryption/decryption time is less than 13 seconds if applying 256-bit AES key.

Applying symmetric encryption for data protection is a reasonable and practical choice. The time spent on AES encryption and decryption is increased with the size of data.

This is inevitable in any encryption schemes. Since AES is very efficient on data encryption and decryption, 1024-bit PRE with different sizes of AES symmetric keys.

Fig. 2 shows the operation time that is the average time of 500 tests. We observe that our scheme is very efficient. The time spent for PRE key pair generation (KeyGen), re-encryption key generation (ReKeyGen), encryption (Enc), re-encryption (ReEnc) and decryption (Dec) is not related to the length of an input key. For the tested three AES key sizes, the encryption time is less than 5 milliseconds. The decryption time is about 1 millisecond. The main aim of our project is to avoid deduplication and reduce the consumption of storage space in cloud. Amazon S3 cloud is used in our project. It is an online file storage web service offered by amazon web services. Amazon S3 provides storage through web services interfaces (SOAP).

Amazon Web Services (AWS), is a collection of cloud computing services that make up the on-demand computing platform offered by These services operate from 12 geographical regions across the world. The most central and best-known of these services arguably include Amazon Elastic Compute Cloud, also known as 'EC2', and Amazon Simple Storage Service, also known as 'S3'. AWS now has more than 70 services that range from compute, storage, networking, database, analytics, application services, deployment, management and mobile. Amazon markets AWS as a service to provide large computing capacity quicker and cheaper than a client company building an actual physical server farm.

Amazon Simple Storage Service (Amazon S3), provides developers and IT teams with secure, durable, highly-scalable cloud storage. Amazon S3 is easy to use object storage, with a simple web service interface to store and retrieve any amount of data from anywhere on the web. With Amazon S3, you pay only for the storage you actually use. There is no minimum fee and no setup cost.Amazon S3 offers a range of storage classes designed for different use cases including Amazon S3 Standard for general-purpose storage of frequently accessed data, Amazon S3 Standard - Infrequent Access (Standard - IA) for long-lived, but less frequently accessed data, and Amazon Glacier for long-term archive. Amazon S3 also offers configurable lifecycle policies for managing your data throughout its lifecycle. Once a policy is set, your data will automatically migrate to the most appropriate storage class without any changes to your applications. There are two type of users registered user and an admin .If a user want to use the cloud he/she must have to register initially. After the approval of the admin he/she can login in to the cloud and upload or download file. There are two type of user’s .Normal and Premium Users. Normal user can use the cloud for 1 year and Premium user can use the cloud for 1year and 6 months. After the validity time they are blocked and they can’t use the Cloud.


We propose a feasible data deduplication mechanism in which all data are stored as the cipher structure in the cloud storage. The main two functional modules of our project is Uploading and Downloading. The two conditions we provided for Uploading are; making the file public and private. There is no special condition for Downloading; it is just simply the downloading.
The cipher structure in the cloud storage is formed using Advanced Encryption Standard (AES) algorithm. The Encryption of datas will provide a high level of security. Here sender and receiver use the same encryption/decryption key.In future other mechanism to implement the upload and download functions using pdf document as well as video,audios,semantic type plagiarism can be used in future.
The heart of our project is content checking and it is done using Plagiarism. Porter Stemmer enhances the searching of texts written in other languages. This avoids the deduplication in text files.
For each file an independent hash value is generated using Message Digest algorithm (MD5). Inorder to reduce the storage space we are here compressing the datas and it is done by using GZIP mechanism.

Did you like this example?

Cite this page

Data De-duplication Using Plagiarism. (2022, Sep 29). Retrieved July 18, 2024 , from

Save time with Studydriver!

Get in touch with our top writers for a non-plagiarized essays written to satisfy your needs

Get custom essay

Stuck on ideas? Struggling with a concept?

A professional writer will make a clear, mistake-free paper for you!

Get help with your assignment
Leave your email and we will send a sample to you.
Stop wasting your time searching for samples!
You can find a skilled professional who can write any paper for you.
Get unique paper

I'm Amy :)

I can help you save hours on your homework. Let's start by finding a writer.

Find Writer