Removal of Semi-Duplicated and Fully Duplicate Shards using Hadoop Techniques for Elastic Search
Subhani Shaik1, Nallamothu Naga Malleswara Rao2

1Subhani Shaik, Department of CSE, Acharya Nagarjuna University, Guntur (Andhra Pradesh), India.
2Nallamothu Naga Malleswara Rao, Department of IT, RVR & JC College of Engineering, Chowdavaram, Guntur (Andhra Pradesh), India.

Manuscript received on 18 February 2019 | Revised Manuscript received on 27 February 2019 | Manuscript published on 28 February 2019 | PP: 529-533 | Volume-8 Issue-3, February 2019 | Retrieval Number: C5972028319/19©BEIESP
Open Access | Ethics and Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Abstract: Duplicate Records Identification is the most complex issues in information distribution center. This issue occurs when multiple databases are formed as a cluster. The duplicate records identification needs to be incorporated on both semi and completely copied records. Duplicate data identification is a technique for identifying all instances of numerous representation of some true values, client relationships, administration or information mining. Another application is Data Mining i.e. to adjust input information is important to develop helpful. In this manuscript a efficient algorithm is proposed for effective removal of the partial and fully copied data. Here the data in the database is divided into small parts called shards in which the duplicate data can be identified easily and accurately. In this paper dynamic duplication calculation is done with the assistance of Hadoop and mad reduce methods. The duplicate shards are identified and they are completely erased from the dataset. An Enhanced De-Duplicate Remover (EDDR) algorithm is proposed in this manuscript to erase the excess copied information and to effectively process the information on the final stage. To identify data repetition, the information utilize a few parameters, and afterward the recognized excess information will be erased by a few constraints as determined. The duplicate shards are removed and the memory wastage is reduced.
Keywords: Duplicate Detection, Data Cleaning, Map Reduce. Information Purifying, Incomplete Duplication, Hadoop, Map Reduce, Duplicate Information Removal Method,

Scope of the Article: Data Analytics