CASE BASED REASONING (CBR) FOR OBESITY LEVEL ESTIMATION USING K-MEANS INDEXING METHOD

of 0.8. CBR-K-Means produces an average computation time at the retrieve stage of each of which produces an average accuracy of 88.365% and 88.270% at a threshold system. The test results with K=2 indicate that the CBR is better than the CBR-K-Means, similarity between cases. Random subsampling method was used to validate the CBR relevant clusters of new cases and Euclidean distance similarity is used to calculate the retrieve stage and still produce optimal solutions. Cosine similarity is used to find does not involve all cases on a case base so that it can shorten the computation time at methods such as the K-Means Algorithm are needed so that the search for similar cases The old case with the highest similarity will be the solution for the new case. Indexing Reasoning (CBR) can be used to estimate a person's obesity level using previous cases. metabolic, degenerative diseases, and even death at a young age. Case Based that is not treated immediately will be a risk factor for increasing cardiovascular, As many as 600 million of the 1.9 billion adults who are overweight are obese. Obesity Abstract


INTRODUCTION
Obesity is an excessive accumulation of fat due to an imbalance of energy intake with energy used for a long time [1].Ng added that obesity is thought to occur due to lack of physical activity, genetic factors, stress and others.As many as 62% of obese people in the world are in developing countries.The state of Indonesia is one of them [2].Other factors such as drugs, environment, and hormones can be the cause of obesity [1].
The Ministry of Health of the Republic of Indonesia (Kemkes RI) revealed the fact that obese people in the world have more than doubled since 1980.More than 1.9 billion adults aged 18 years and over are overweight.ss More than 600 million are obese of this number [3].Based on the results of the Riset Kesehatan Dasar (Riskesdas), there was an increase in the prevalence of obesity in the population aged over 18 years from 11.7% in 2010 to 15.4% in 2013 in Indonesia [1].Based on the 2015-2019 RPJMN indicators, 15.4% are obese with a BMI 27 in Indonesia.In adults aged 18 years and over, 13.5% were overweight.Meanwhile, 28.7% were obese with a BMI 25.In children aged 5-12 years, 18.8% were overweight and 10.8% were obese [3].
According to WHO there are five levels of obesity based on the measurement of Body Mass Index (BMI).The levels are underweight, normal weight, overweight with risk, obesity I, and obesity II [4].In addition to using BMI measurements, underweight, normal weight, overweight with risk, obesity I, and obesity II [4].In addition to using BMI measurements, obesity levels can be determined from dietary habits and physical condition.Levels of obesity are divided into seven levels, namely underweight, normal weight, overweight level I, overweight level II, obesity level I, obesity level II, and obesity level III [5].
Obesity cannot be underestimated.Hubby et al. [2] explains, there are health impacts caused by obesity in adults.The increase in the incidence of non-communicable diseases such as type 2 diabetes, cancer, other cardiovascular diseases, and even death at a young age is caused by obesity as a risk factor.Kemkes RI added that obesity that is not immediately treated will be a risk factor for metabolic diseases (triglycerides, decreased HDL cholesterol, and increased blood pressure), degenerative diseases, and other diseases (exacerbation of asthma, knee and hip osteoarthritis, gallstone formation, cardiac arrest).breath during sleep, and low back pain) [1], [3].Therefore, it is important to know a person's obesity level early so that appropriate treatment can be carried out by experts.Case Based Reasoning (CBR) system can be used to estimate a person's obesity level based on eating habits and physical activity using previous cases.
CBR is a method used to study and solve a problem based on past experiences.This method uses a case-based approach [6]- [8].The solution of a problem can be determined by finding a solution of the old case in a similar case base.If a case in the case base is similar to an old case, it can be said that both are identical [9].CBR has been widely implemented in several ways, such as cases of caesarean section [8], diagnosis of hypertension [10], prediction of length of study for prospective students [11], diagnosis of heart disease [9], diagnosis of stroke [6], [12], and determine deviant sexual behavior [13].CBR will be used to estimate the level of obesity based on dietary habits and physical condition in this study.
The process of finding old cases that are similar to new cases is a problem with CBR.The computation time will be longer when the system has to calculate the similarity value of the new case to all the old cases in the case base.Indexing of old cases is needed to handle this [10].The process of calculating the similarity value will not involve all of the old cases in the case base, but only the old cases closest to the new case.With the indexing of old cases is expected to shorten the computation time and still produce optimal solutions.
The proposed of indexing method in this study uses a partition clustering algorithm, namely K-Means Algorithm.With the K-Means Algorithm, several data will be grouped based on similarities and dissimilarities into groups called clusters.Thus, each cluster will contain a set of data with a high level of similarity and a low level of dissimilarity.The implementation of the algorithm is expected to increase the efficiency of computing time, especially during the retrieval stage and still produce an optimal solution.

Flowchart System
The flowchart of the CBR system is shown in Figure 1.

Research Data Collection
This study uses secondary data obtained from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Estimation+of+obesity+le vels+based+on+eating+habits+and+physical+ condition+) with the name "Estimation of Obesity Levels Based On Eating Habits and Physical Condition Data Set".This data consists of 2,111 records and 17 attributes.This data is estimation of obesity levels in individuals from Mexico, Peru and Colombia based on their dietary habits and physical condition.The data has been classified into seven levels of obesity [5].

Data Cleaning
Data cleaning is carried out on data that is still dirty.Data is said to be dirty if the data has an empty value (missing value) and/or outliers and/or inconsistent [14].Missing values and outliers will be handled by discarding the data and not being used in subsequent data processing.There are no missing values and outliers after checking the data.

Label Encoding
Label Encoding is a method for converting data with categorical data types (nominal and ordinal) into numeric data types.Most of the attributes in the data have a category data type, therefore this process is carried out.The purpose of this process is so that each stored case can later be indexed using the K-Means Algorithm.This process is only carried out for attributes of type category except for attributes that are the target class of the data.

Data Normalization
Data normalization is a process to change the attribute values in the data into the same range.Normalization is usually done into the range [0,1] or [-1.1], so that each attribute will have the same weight and avoid bias [15].The normalization strategy used in this study uses min-max normalization according to Equation (1).This strategy was chosen because it provides the highest average accuracy value compared to other normalization methods such as z-score normalization and decimal scaling [16], [17]; and sigmoid, softmax, and statistical column [16].

Determination of the Optimal Number of Clusters
Determination of the optimal number of clusters is carried out before the case base is indexed using K-Means.The Elbow method will be used at this stage.The Elbow method determines the optimal number of clusters by getting the largest difference value from the decrease in Sum of Square Error (SSE) and forming an angle.SSE is the sum of squared errors between all data in cluster Cj and the cluster center point (centroid) Pj.SSE is calculated using Equation (2).

Case Representation
The case representation model used is flat frame.The attributes and solutions of the cases will be represented as in Table 1.There are 17 attributes and 7 solution spaces (target classes).A row named Cluster added to the case representation to store the relevant cluster values of the case.

Base Case Indexing Using K-Means Algorithm
The K-Means Algorithm is used as a method for indexing old cases in the case base.The number of k optimal groups formed was obtained using the Elbow method.After the number of k optimal is obtained, all the old cases on the case base will be indexed according to the cluster of cases.The case base indexing flow shown in Figure 2. The new case will look for the similarity value from the old cases that are in the same cluster as it.

New Case's Relevant Cluster
CBR System with K-Means indexing will determine the closest cluster to the new case to be able to find a solution.The closest cluster is obtained by calculating the similarity of the new case to the cluster center point (centroid) that has been generated in the previous stage.The new case and the centroid are represented as vectors and the similarity is calculated using the equation cosine similarity shown in Equation (3).The two vectors (cases) are said to be similar if the result of cosine value is equal to one, otherwise if the value is zero, then the two vectors are said to be dissimilar [10].
(3) Descriptions: cosine(u ̅,v ̅) : cosinus value of new case u ̅ dan centroid ̅  ̅ : new case vector ̅ : centroid vector  : number of attributes in vector  ̅  : value of i-th element in vector  ̅ ̅  : value of i-th element in vector ̅

Case Similarity
Euclidean distance similarity method used to calculate similarity during the retrieval stage.This method measures the similarity of the new case to the old case by calculating the closeness of the distance between the two case based on the similarity of weights on each of the existing attributes [10], [11], [13].Similarity measurements are carried out, namely local similarity and global similarity.Local similarity is a measurement of similarity at the attribute level, while global similarity is a measurement of similarity at the level of the whole case [9], [10], [12], [18].

Local Similarity Measurement
Local similarity measurement in this study is distinguished for numeric and ordinal type attributes; as well as nominal and binary.Local similarity for numeric and ordinal type attributes is calculated using Equation (4), while for nominal and binary type attributes is calculated using Equation ( 5) [9], [10], [12].
Descriptions: (  ,   ): local similarity of i-th attribute between case S with case T   : i-th attribute value of case S   : i-th attribute value of case T   : maximum value of i-th attribute value in case base   : minimum value of i-th attribute value in case base

Global Similarity Measurement
Measurement of global similarity between new cases and old cases that are in the case base using Equation ( 6) [9], [10] with calculation Euclidean distance.The weight for each attribute uses Pearson Correlation to the target class obtained through correlation analysis by ignoring the positive and negative signs.

Evaluation and Validation of Results
The output results of the CBR system will be evaluated using an accuracy measure.The accuracy generated by the CBR system for estimating the level of obesity will be determined based on the percentage of the number of cases estimated correctly according to Equation ( 7).This accuracy value will be used to compare the performance of the CBR system using K-Means indexing method and without indexing method.acc = ∑ number of correct cases ∑ number of test case ×100% (7) Holdout method used to divide research data into case base and test case.The data set D is divided randomly into two independent subsets of data, namely training data and test data [14], [15], [19].The training data will be used as the case base, while the test data will be used as test cases to compare the performance of the CBR system with and without K-Means indexing method.This study uses the proportion of training data of 80% and test data of 20% [20].The illustration of Holdout method used in this study is shown in Figure 3.
Validation of the results of the CBR system will use random subsampling method.This method is a Holdout method that run some of k iterations.The average value of accuracy to k iterations will be used to compare the performance of CBR systems with and K-Means indexing method.In this study the number of iterations was set at 20 iterations.

Analysis of Data Processing Results
Analysis of data processing results is carried out at the final stage of this research.The thing that is analyzed in this study is to compare the computation time and accuracy values generated by the CBR system with and without K-Means indexing method.

RESULT AND DISCUSSION
Two test scenarios were used to analyze the ability of the CBR system to estimate obesity levels.Each scenario will be carried out 20 iterations.The first scenario is system testing using CBR without K-Means indexing method and the second scenario is system testing using CBR with K-Means indexing method.Cosine similarity method is used to find clusters that are relevant to the new case.Euclidean distance is used to determine the similarity of the new case to the case base.

System Implementation
Figure 4 shows the implementation of the CBR system that has been made based on the website.The figure presents the results of the comparison of CBR with and without K-Means indexing method.Figure 5 shows the solution of the CBR system with and without K-Means indexing method for each tested test case.

Testing the optimal number of clusters
The Elbow method is used to test number of clusters or k.The test was carried out 10 times.The highest average SSE difference from 10 times of testing will be the optimal value of k.This test will use a range of values k from 1 to 10 [19].The test results for the optimal number of clusters are shown in Table 3 and Table 4.
Based on the results in Table 4, the value k equal to two gives the highest result for the average value of the SSE difference, which is 2631.27 with a standard deviation of 6.95.The value of k equal to two is the point that forms the angle in Figure 6.Therefore, the number of optimal k used in the CBR test with and without K-Means indexing method is two.

Testing the Comparison of CBR and CBR-K-Means
This test was conducted to see a comparison of the performance of the resulting CBR system.The value of accuracy (in percent) and computation time at the retrieval stage (in seconds) will be compared between the CBR system with and without K-Means indexing method.The results of this test are shown in Table 5 and illustrated in graphical form in Figure 7 for the accuracy value and Figure 8 for the computational time at the retrieval stage.
Based on Table 5, the CBR system without K-Means indexing method provides an average accuracy of 88.365% and is better than the CBR system with K-Means indexing method of 88.270%.The average computation time at the retrieval stage of the CBR system with K-Means indexing method is 33.55 seconds, faster than the CBR system without K-Means indexing method of 35.5 seconds.
In the CBR system with K-Means indexing method, the search for relevant cases only occurs on the case base that have the same cluster as new cases.Therefore, the computation time at retrieval stage becomes faster.The accuracy value generated by the CBR system has not provided 100% optimal results because the system has not been able to handle two or more cases that have the same similarity value as the new case.If this happens, the CBR system will take the case in the first serial number to be used as a new case solution which will have implications for the resulting accuracy value.

Fig 1 .
Fig 1. Flowchart of the case based reasoning system

Fig 4 .
Fig 4. Implementation of the website based CBR system

Fig 7 .Fig 8 .
Fig 7. Comparison chart of the value of accuracy from CBR-K-Means and CBR system for 20 iterations

Table 1 .
Case representation model in flat frame

Table 2 .
The correlation analysis results of each attribute with the target Calculation of correlation analysis of each attribute to the target class using IBM SPSS Statistics 26 software.The calculation results are shown in Table 2. Att represents the list of attributes used.Pc represents the Pearson correlation value.Scp represents the sum of squares and cross-products.Cov represents the covariance value.

Table 3 .
The Sum of Squares Error (SSE) value in each i-th iteration for each cluster

Table 4 .
The Sum of Squares Error (SSE) difference value in each i-th iteration for each cluster Fig 6.Graph of the average SSE value for every k for 10 times of testing