A Data-Centric Approach to Improve Machine Learning Model’s Performance in Production
Pritom Bhowmik1, Arabinda Saha Partha2

1Pritom Bhowmik*, B.Tech. Department of Computer Science & Engineering, Institute of Engineering & Management, Salt-Lake, Kolkata, India.
2Arabinda Saha Partha, B.Tech. Department of Computer Science & Engineering, Institute of Engineering & Management, Salt-Lake, Kolkata, India.
Manuscript received on October 09, 2021. | Revised Manuscript received on October 27, 2021. | Manuscript published on October 30, 2021. | PP: 240-243 | Retrieval Number: 100.1/ijeat.A32011011121 | DOI: 10.35940/ijeat.A3201.1011121
Open Access | Ethics and  Policies | Cite | Mendeley
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Abstract: Machine learning teaches computers to think in a similar way to how humans do. An ML models work by exploring data and identifying patterns with minimal human intervention. A supervised ML model learns by mapping an input to an output based on labeled examples of input-output (X, y) pairs. Moreover, an unsupervised ML model works by discovering patterns and information that was previously undetected from unlabelled data. As an ML project is an extensively iterative process, there is always a need to change the ML code/model and datasets. However, when an ML model achieves 70-75% of accuracy, then the code or algorithm most probably works fine. Nevertheless, in many cases, e.g., medical or spam detection models, 75% accuracy is too low to deploy in production. A medical model used in susceptible tasks such as detecting certain diseases must have an accuracy label of 98-99%. Furthermore, that is a big challenge to achieve. In that scenario, we may have a good working model, so a model-centric approach may not help much achieve the desired accuracy threshold. However, improving the dataset will improve the overall performance of the model. Improving the dataset does not always require bringing more and more data into the dataset. Improving the quality of the data by establishing a reasonable baseline level of performance, labeler consistency, error analysis, and performance auditing will thoroughly improve the model’s accuracy. This review paper focuses on the data-centric approach to improve the performance of a production machine learning model.
Keywords: Annotation, Augmentation, big-data, bias-error, baseline, consistent-leveling, Data-centric, model-centric, error-analysis, good-data, Model-accuracy, Human-level-performance, proxy.