Volume 18, No. 4, 2021

Human-In-The-Loop Data Classification Framework Using Locality Sensitive Hashing


T. Jebeula , Dr. J. Jebamalar Tamilselvi

Abstract

The information generated by the Internet of Things (IoT) is often high dimensional, large volumes with high velocity, and comes from different sources that aggravate the challenges in unifying/pre-processing the data before applying the data analysis to extract the valuable information. Classical ETL framework with rule-based data classification approach may not scale up to handle such high dimensional big data. Machine learning techniques for high dimensional data unification tasks works well. However, it can be accurate only when it does have a sufficient amount of labelled training data. Getting the right labelled data from the big data environment is a challenging task, labor-intensive and cost-prohibitive. This research aims to develop a scalable and efficient data classification framework to classify high dimensional big data by combining probabilistic hashing algorithm (SimHash) and Human-in-the-Loop Machine learning strategy to build a classifier with less labelled training data. Probabilistic hashing algorithms solve the problems associated with the curse of dimensionality in the data classification tasks. The human is brought back into the machine learning loop using active learning strategies through which human, and machine learning processes interact to solve the problem of select the highly informative sample for labelling to build the machine learning model with desired accuracy faster with significantly less labelled data. The evaluation of the proposed data classification framework on a real-time dataset shows that the proposed framework is scalable, sustainable and efficient in classifying high dimensional data.


Pages: 48-60

Keywords: ETL; Active Learning; SimHash; Clustering; Curse of Dimensionality; Uncertainty Sampling; Data Labeling.

Full Text