Towards Optimization of Malware Detection using Chi-square Feature Selection on Ensemble Classifiers
Fadare Oluwaseun Gbenga1, Adetunmbi Adebayo Olusola2, Oyinloye Oghenerukevwe Eloho3, Mogaji Stephen Alaba4

1Fadare Oluwaseun Gbenga*, Dept. Of Computer Science, Joseph Ayo Babalola, Ikeji-Arakeji. Osun-State. Nigeria.
2Prof. Adetunmbi Adebayo Olusola, Dept. of Computer Science, Federal University of Technology, Akure. Ondo-State. Nigeria.
3Dr. (Mrs) Oyinloye Oghenerukevwe Eloho, Dept. of Computer Science, Ekiti State University, Ado-Ekiti. Ekiti-State. Nigeria.
4Dr. Mogaji Stephen Alaba, School of Computing, Federal University of Technology, Akure. Nigria
Manuscript received on March 25, 2021. | Revised Manuscript received on April 29, 2021. | Manuscript published on April 30, 2021. | PP: 254-262 | Volume-10 Issue-4, April 2021. | Retrieval Number: 100.1/ijeat.D23590410421 | DOI: 100.1/ijeat.D23590410421
Open Access | Ethics and Policies | Cite | Mendeley
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Abstract: The multiplication of malware variations is probably the greatest problem in PC security and the protection of information in form of source code against unauthorized access is a central issue in computer security. In recent times, machine learning has been extensively researched for malware detection and ensemble technique has been established to be highly effective in terms of detection accuracy. This paper proposes a framework that combines combining the exploit of both Chi-square as the feature selection method and eight ensemble learning classifiers on five base learners- K-Nearest Neighbors, Naïve Bayes, Support Vector Machine, Decision Trees, and Logistic Regression. K-Nearest Neighbors returns the highest accuracy of 95.37%, 87.89% on chi-square, and without feature selection respectively. Extreme Gradient Boosting Classifier ensemble accuracy is the highest with 97.407%, 91.72% with Chi-square as feature selection, and ensemble methods without feature selection respectively. Extreme Gradient Boosting Classifier and Random Forest are leading in the seven evaluative measures of chi-square as a feature selection method and ensemble methods without feature selection respectively. The study results show that the tree-based ensemble model is compelling for malware classification. 
Keywords: Chi-square, Extreme Gradient Boosting Classifier, K-Nearest Neighbors, Random forest.