IMPLEMENTATION OF BALANCING DATA METHOD USING SMOTETOMEK IN DIABETES CLASSIFICATION USING XGBOOST

Authors

  • Fatwa Ratantja Kusumajati UPN "Veteran" Jawa Timur, Indonesia
  • Basuki Rahmat UPN "Veteran" Jawa Timur, Indonesia
  • Achmad Junaidi UPN "Veteran" Jawa Timur, Indonesia

DOI:

https://doi.org/10.21107/kursor.v12i4.410

Keywords:

Balancing Data, Diabetes Classification, SMOTETomek, XGBOOST

Abstract

In this research, XGBoost algorithm and the SMOTETomek approach are employed with the objective of enhancing the accuracy of diabetes classification. The study utilises 2,000 patient data points, comprising demographic and medical information, sourced from Kaggle. The dataset employed in this study comprises a number of variables, including pregnancies, glucose levels, blood pressure, skin thickness, insulin levels, Body Mass Index (BMI), diabetes pedigree function, age, and an outcome variable. The latter is a binary classification label, taking on the values 0 and 1. A value of 0 indicates that the patient is not affected by diabetes, whereas a value of 1 indicates that the patient has diabetes. Diabetes represents a significant public health concern in Indonesia. A significant challenge in this study was the imbalanced nature of the dataset, which included a disproportionate number of non-diabetic samples relative to diabetic samples. To address this class imbalance, the researchers employed the SMOTETomek method. SMOTETomek integrates the SMOTE (Synthetic Minority Over-sampling Technique) and Tomek links algorithms to oversample the minority class and remove borderline samples, thereby balancing the class distributions. The SMOTETomek method achieved higher accuracy (95.01%) than SMOTE and the original data (both 92.13%), highlighting the benefits of combining SMOTE with Tomek Links for XGBoost. During testing, SMOTETomek slightly reduced the minority class accuracy (0.97 vs. 0.99 for SMOTE and original data) but maintained strong F1-score and precision, indicating effective handling of data imbalance despite minor trade-offs.

Downloads

Download data is not yet available.

References

[1] T. Ligita, K. Wicking, K. Francis, N. Harvey, and I. Nurjannah, “How people living with diabetes in Indonesia learn about their disease: A grounded theory study,” PLoS One, vol. 14, no. 2, pp. 1–19, 2019, doi: 10.1371/journal.pone.0212019.

[2] F. Thabtah, S. Hammoud, F. Kamalov, and A. Gonsalves, “Data imbalance in classification: Experimental evaluation,” Inf. Sci. (Ny)., vol. 513, pp. 429–441, 2020, doi: 10.1016/j.ins.2019.11.004.

[3] H. Hairani, A. Anggrawan, and D. Priyanto, “Improvement Performance of the Random Forest Method on Unbalanced Diabetes Data Classification Using Smote-Tomek Link,” Int. J. Informatics Vis., vol. 7, no. 1, pp. 258–264, 2023, doi: 10.30630/joiv.7.1.1069.

[4] A. A. Robbani, A. M. Siregar, and D. S. Kusumaningrum, “Klasifikasi Penderita Penyakit Diabetes Menggunakan Algoritma C4.5,” Sci. Student J. Information, Technol. Sci., vol. III, no. 1, pp. 76–82, 2022, [Online]. Available: https://journal.ubpkarawang.ac.id/mahasiswa/index.php/ssj/article/view/424/338

[5] A. Indrawati, “Penerapan Teknik Kombinasi Oversampling Dan Undersampling Untuk Mengatasi Permasalahan Imbalanced Dataset,” JIKO (Jurnal Inform. dan Komputer), vol. 4, no. 1, pp. 38–43, 2021, doi: 10.33387/jiko.v4i1.2561.

[6] J. Jeremiah Tanimu, M. Hamada, M. Hassan, and S. Yusuf Ilu, “A Contemporary Machine Learning Method for Accurate Prediction of Cervical Cancer,” SHS Web Conf., vol. 102, p. 04004, 2021, doi: 10.1051/shsconf/202110204004.

[7] W. Chandra, B. Suprihatin, and Y. Resti, “Median-KNN Regressor-SMOTE-Tomek Links for Handling Missing and Imbalanced Data in Air Quality Prediction,” Symmetry (Basel)., vol. 15, no. 4, 2023, doi: 10.3390/sym15040887.

[8] Z. Wang, C. Wu, K. Zheng, X. Niu, and X. Wang, “SMOTETomek-Based Resampling for Personality Recognition,” IEEE Access, vol. 7, pp. 129678–129689, 2019, doi: 10.1109/ACCESS.2019.2940061.

[9] H. F. Putro, R. T. Vulandari, and W. L. Y. Saptomo, “Penerapan Metode Naive Bayes Untuk Klasifikasi Pelanggan,” J. Teknol. Inf. dan Komun., vol. 8, no. 2, 2020, doi: 10.30646/tikomsin.v8i2.500.

[10] M. Sholihin, “Classification of Batik Lamongan Based on Features of Color, Texture and Shape,” Kursor, vol. 9, no. 1, pp. 25–32, 2018, doi: 10.28961/kursor.v9i1.114.

[11] S. E. Herni Yulianti, Oni Soesanto, and Yuana Sukmawaty, “Penerapan Metode Extreme Gradient Boosting (XGBOOST) pada Klasifikasi Nasabah Kartu Kredit,” J. Math. Theory Appl., vol. 4, no. 1, pp. 21–26, 2022, doi: 10.31605/jomta.v4i1.1792.

[12] A. Mariani, R. Siki, N. H. Harani, C. Prianto, and A. Bachelor, “Decision Tree Method, Vendors, Procurement,” J. Ilm. KURSOR, vol. 10, no. 2, pp. 65–70, 2019.

[13] N. Ahmad, M. J. Awan, H. Nobanee, A. M. Zain, A. Naseem, and A. Mahmoud, “Customer Personality Analysis for Churn Prediction Using Hybrid Ensemble Models and Class Balancing Techniques,” IEEE Access, vol. 12, no. January, pp. 1865–1879, 2024, doi: 10.1109/ACCESS.2023.3334641.

[14] X. W. Liang, A. P. Jiang, T. Li, Y. Y. Xue, and G. T. Wang, “LR-SMOTE — An improved unbalanced data set oversampling based on K-means and SVM,” Knowledge-Based Syst., vol. 196, 2020, doi: 10.1016/j.knosys.2020.105845.

[15] D. Dablain, B. Krawczyk, and N. V. Chawla, “DeepSMOTE: Fusing Deep Learning and SMOTE for Imbalanced Data,” IEEE Trans. Neural Networks Learn. Syst., vol. 34, no. 9, pp. 6390–6404, 2023, doi: 10.1109/TNNLS.2021.3136503.

[16] V. N. Wijayaningrum, A. P. Kirana, and I. K. Putri, “Student Academic Performance Prediction Framework With Feature Selection and Imbalanced Data Handling,” J. Ilm. Kursor, vol. 12, no. 3, pp. 123–134, 2024, doi: 10.21107/kursor.v12i3.356.

[17] S. Wang, Y. Dai, J. Shen, and J. Xuan, “Research on expansion and classification of imbalanced data based on SMOTE algorithm,” Sci. Rep., vol. 11, no. 1, pp. 1–11, 2021, doi: 10.1038/s41598-021-03430-5.

[18] R. M. Pereira, Y. M. G. Costa, and C. N. Silla, “MLTL: A multi-label approach for the Tomek Link undersampling algorithm: MLTL: The Multi-Label Tomek Link,” Neurocomputing, vol. 383, pp. 95–105, 2020, doi: 10.1016/j.neucom.2019.11.076.

[19] E. F. Swana, W. Doorsamy, and P. Bokoro, “Tomek Link and SMOTE Approaches for Machine Fault Classification with an Imbalanced Dataset,” Sensors, vol. 22, no. 9, 2022, doi: 10.3390/s22093246.

[20] D. Devi, S. kr Biswas, and B. Purkayastha, “Redundancy-driven modified Tomek-link based undersampling: A solution to class imbalance,” Pattern Recognit. Lett., vol. 93, pp. 1339–1351, 2017, doi: 10.1016/j.patrec.2016.10.006.

[21] L. Ai-Jun and Z. Peng, “Research on Unbalanced Data Processing Algorithm Base Tomeklinks-Smote,” ACM Int. Conf. Proceeding Ser., pp. 13–17, 2020, doi: 10.1145/3430199.3430222.

[22] Q. Leng, J. Guo, J. Tao, X. Meng, and C. Wang, “OBMI: oversampling borderline minority instances by a two-stage Tomek

Downloads

Published

2024-12-30

Issue

Section

Articles

Citation Check