An Efficient Approach for Automated Token Formation for Record De-duplication with special reference to Real-Time Data-Warehouse Environment
Vaishali C. Wangikar1, Sachin N. Deshmukh2, Sunil G. Bhirud3
1Vaishali C. Wangikar, Senior Assistant Professor at MIT Academy of Engineering , Pune (M.H), India.
2Sachin N. Deshmukh , B.E. Department of Computer Science and Engineering, Dr. Babasaheb Ambedkar Marathwada University, Aurangabad (M.H), India.
3Sunil G. Bhirud, Professor at Computer Engineering and Information Technology department, Veermata Jijabai Technological Institute (VJTI) Mumbai (M.H), India.
Manuscript received on 18 April 2019 | Revised Manuscript received on 25 April 2019 | Manuscript published on 30 April 2019 | PP: 151-159 | Volume-8 Issue-4, April 2019 | Retrieval Number: D6316048419/19©BEIESP
Open Access | Ethics and Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: The record de-duplication is an important part of data cleaning process of a data-warehouse. Identification of multiple duplicate entries of a single entity in a data-warehouse is known as de-duplication. A lot of research is carried out on various aspects of record de-duplication such as use of blocking and indexing techniques, choice of blocking predicate, quality of blocking and optimization in comparison space. A special attention is required for de-duplication process in a Real-time Environment. This research attempts to address automatic token formation for real-time data de-duplication process. In the proposed approach no human intervention is required for the deduplication process. Proposed Optimized Automated Token Formation (OATF) is a two-step approach where in the former step candidates of token are generated and in the later step, optimal candidates are selected which assure maximum true positive coverage. Experimentation shows that OATF outperforms manual token formation by 29 % and 14 % respectively for Cora and Restaurant data-sets. It also shows 40 % better results over existing FDY-SNI algorithm for Cora dataset. A framework for Real-time de-duplication is also proposed where dis-joint sorted indexes are used to accomplish real-time data update. Alike other existing methods it works well without any parameter setting by human experts for real-time deduplication.
Keywords: Automated Token Formation; Automated Blocking Key Formation; Record De-Duplication; Automated Record Linkage; Dis–Joint Sorted Index; Recursive Feature Elimination; Real-Time Record De-Duplication; Real-Time Record Linkage. Real-Time Data-Warehousing; Data Cleansing.
Scope of the Article: Real-Time Communication