APPLICATION OF HYBRID GA-PSO TO IMPROVE THE PERFORMANCE OF DECISION TREE C5.0

Data mining is a data extraction process with large dimensions and information with the aim of obtaining information as knowledge to make decisions. Problems in the data mining process often occur in high-dimensional data processing. The solution to handling problems in high-dimensional data is to apply the hybrid genetic algorithm and particle swarm optimization (HGAPSO) method to improve the performance of the C5.0 decision tree classification model to make decisions quickly, precisely and accurately on classification data. In this study, there were 3 datasets sourced from the University of California, Irvine (UCI) machine learning repositories, namely lymphography, vehicle, and wine. The HGAPSO algorithm combined with the C5.0 decision tree testing method has the optimal accuracy for processing highdimensional data. The lymphography and vehicle data obtained an accuracy of 83.78% and 71.54%. The wine dataset has an accuracy of 0.56% lower than the conventional method because the data dimensions are smaller than the lymphography and vehicle dataset.


INTRODUCTION
Data mining is an extraction process of data with large dimension to obtain information as knowledge to make decisions. Machine learning is part of data mining to help find data patterns automatically. Method that requently used in data mining is decision tree C5.0 [1].
In the mining process, the data used sometimes has problems that can interfere with the results of the mining process. Among them are missing values, redundant data, outliers, or data formats that are incompatible with the system. This problem often occurs in high dimensional data or data that has a large dimensional size. Some methods that are often used to solve problems in high dimensional data are the genetic algorithm (GA) and the particle swarm optimization (PSO) [2].
Various methods have been developed in several studies to solve problems in high dimensional data. First research is about Feature Selection for Varying Coefficient Models in Ultrahigh-Dimensional Covariates. The performance result in this study was 95% [3]. Another research is Model-Free Feature Screening in Ultrahigh Dimensional Discriminant Analysis. The performance result of this method is 94.30% [4]. Other research uses hybrid genetic algorithm and particle swarm optimization methods to solve bi-level linear programming problems. In this study, it is able to minimize the error rate to 0.0094 and show that the optimization results using only GA or PSO are no better than the hybrid method [5]. Research that has been carried out by conventional optimization techniques is considered not optimal in dealing with feature selection problems on high dimensional data, therefore it is necessary to optimize feature selection in the decision tree C5.0 method by maximizing the application of the model in preprocessing [6].
This research is a solution for dealing with problems in high dimensional data by using the hybrid genetic algorithm and particle swarm optimization methods to improve the performance of the decision tree classification model C5.0 to make decisions quickly, precisely and accurately on classification data.

Data Mining
Data mining is also called knowledge discovery in database (KDD). Data mining has three main points which are shown in figure 1.  Figure 1, statistics are the most important points in data mining. Statistics are used to identify systematic relationships between different variables, when there is not enough information on those variables. Artificial Intelligence (AI) contributes to data processing techniques, based on human reasoning models for data mining development [2].
Closely related to AI, Machine Learning (ML) is very important in data mining development. ML uses techniques that allow computers to learn by 'training'. In this context also consider Natural Computing (NC) as a solid additional root for data mining. Databases Systems (DBS) provide information which is then processed using data processing methods [2].

High Dimensional Data
High dimensional data can help machine learning models to learn more rules and better generalize new data [4]. However, adding lowquality data and reckless input features may create too much noise and can slow down the training algorithm. A number of techniques for data dimension reduction are available to estimate how informative each column is and, if necessary, to filter it from the dataset [6].
Here are some data-dimensionality reduction techniques [7]

Selection of Genetic Algorithm Attributes
Genetic algorithm (GA) is an optimization and search technique based on the principles of genetics. Genetic algorithms mainly consist of three operators: selection, crossover, and mutation [8]. The flowchart of the GA attribute selection method is shown in Figure  2. In Figure 2, the fitness function in GA is a simple function, assigning ratings to individual attributes at the bottom of the correlation coefficient. It can be said that the lower the correlation, the higher the fitness value of the attribute. The relative fitness function is as follows [9]: (1) f = fitness value ∑f = the total fitness value of all chromosomes i = the iteration of chromosome P = relative probability Then the results of calculation 1 will be used as a value material to find the cumulative fitness. The calculation of the cumulative fitness is carried out as much as the number of the population owned, so that there will be groups that have a certain value distance in each group. Cumulative fitness is calculated as follows [9]: (2) C = cumulative fitness i = the iteration of chromosome P = relative probability Then form a random number between 0 and 1 and check the position of the random number on the given relative probability.

Particle Swarm Optimization Attribute Selection
The PSO method is made on the basis of the movement activities and behavior of a group of fish and flocks of birds in behavior such as looking for prey, which was first proposed by James Kennedy and Russell C. Eberhart in 1995. PSO consists of a group of particles looking for the best position, which is the best position for optimization problem in feature space [11]. The schematic of the particle swarm optimization attribute selection method is shown in Figure 3.
Based on Figure 3, the initialization of the PSO algorithm begins by assigning a random initial position of the particle (solution) and then searching for the optimal value by updating its position. As explained above, each iteration of each particle updates its position according to the two best values, namely the best solution that has been obtained by each particle (pbest) and the best solution in the population (gbest) [10]. The PSO method can be applied using equations (3) and (4) shown below to update the velocity and position of each particle.
The equation for determining the attribute position of i on d dimension as follows. 1 1 = Gbest (global best) in d dimensions Vid is the velocity of particle in the iteration k, and Xid is the solution (position) of the particle in k iteration. c1, c2 are positive constants, and rand1, rand2 are two random variables using uniform distribution between 0 to 1. W is the inertion weight which shows the effect of changing the velocity from the old vector to the new vector [11].

HGAPSO Attribute Selection
The method of hybrid genetic algorithm and Particle Swarm Optimization is a method that is carried out in two phases to produce a new population. The hybrid model is performed by selecting N randomly generated individuals. The new individual can be considered a chromosome in the case of GA or called a particle in the case of PSO. N individuals are sorted by fitness, and the best N individuals are entered into the GA model to make N individuals new by crossover [12].
The crossover operator in GA is applied using the concept of a linear combination of two vectors, which shows two individuals in the GA algorithm with 100% crossover probability. The random mutations generated by the GA algorithm will be replaced by the PSO method [6]. The procedure for adjusting the N particles in the PSO method involves selecting particles globally, selecting particles from the best population, and then updating the velocity values. The best global particle population is determined according to the fitness value that has been sorted [13].
The following is the pseudo code notation of the hybrid genetic algorithm and particle swarm optimization [12]. Equation 6 illustrates that the new velocity of each particle is updated with the previous velocity (V id ), the best location in the particle population ( P id ) and the best global location (P gd ). The velocity particles in each dimension are sandwiched using V max which is arranged into certain blocks of the search space for each dimension i. Equation 7 shows that each particle (X id ) is updated during the search for the best solution [ 12 ].

Algorithm C5.0
Algoritm a C5 .0 is a refinement of the previous algorithm, namely C4.5 and ID3. Compared with C4.5, C5.0 algorithms faster and more effective to generate decisionmaking tree. Information gain is a separation criterion that uses entropy measurements. To get information gain from an attribute, it takes the entropy of the whole class or Entropy (S) Entropy (S) is the estimated number of bits needed to be able to extract a class from a number of random data in the sample space. Mathematically, entropy is formulated as follows [14 ]. (8) After getting the entropy value, then look for the information gain value. Information gain is used to measure the effectiveness of attribute characteristics in classify classes. Equation 2.6 is used to calculate the information gain as follows [15 ] : Gain (A) is the expected reduction in entropy caused by knowledge of the value of attribute A. The algorithm calculates the information gain for each attribute. The attribute with the greatest gain value is chosen as the attribute test (node root ). A node is created and labeled with attributes, branches are created for each attribute value [14 ] .

Research Methods
This test uses Microsoft Excel 2013 to process the dataset and Rapid Miner 5.3.0 software to design and analyze the results of the calculation method. The research was conducted on the data collection phase, the initial data processing, application HGAPSO method, the application of the classification model, testing, and validation.
The data collection stage is the selection of data sets from the UCI Machine Learning Repository. Then the initial data processing is carried out on the selected data set. Initial processing includes data cleaning processes such as replace missing values, removing duplication values, noise and outliers. After the data were normalized, the HGAPSO method was applied.
The next step is to apply the decision tree C5.0 classification model and test it. The classification results are then evaluated and validate the results of the research performance. The data used in this study came from open source data, namely the UCI Machine Learning Repository. The research data used is shown in table 1.

 
In Table 1, the dataset lymphography is data that utilizing x-ray technology to view the lymphatic circulation and gland lymph in the diagnosis of disease. Dataset vehicle is data about what types of vehicles based angel view different images. Dataset wine is the analysis of wines which were planted in the area the same in the field of chemical research . The dataset is selected based on different attribute dimensions to test the effectiveness of the HGAPSO algorithm.
Initial processing of the dataset is carried out to obtain quality data, some of the techniques used are Data Selection, Min-Max Normalization, and Attribute Selection . The classification process is shown in Figure  5.

HGAPSO Concept
The application of the hybrid method starts with the GA method then continues with the PSO method. In the hybrid method, the population generation process is limited to 4N chromosomes to produce the best 2N individuals and produce 2N new individuals. The application of PSO is carried out to process the worst 2N individuals in the population generation. The initial steps in the GA method are population initialization, evaluation function, selection process, crossover, and mutation. The following HGAPSO algorithm steps: 1. Initialization: Determine the initial population generation of 4N 2. Evaluation and ranking: Evaluate the fitness value of each individual 4N and sort by the best value using the equation:

Testing Method
The multiclass confusion matrix is a method used to measure the performance of a classification method [19] . In this study, the parameters used to measure the classification performance are accuracy, precision, recall, and F-measure. D ata are classified into several class as shown in Table 2

RESULT AND DISCUSSION
Data processing begins by setting an initial population of 4N attributes . The best 2N attribute is processed using the GA algorithm to produce a new 2N population mutation. The worst 2N atibut is processed using the PSO algorithm to produce optimal particles . The new 2N fertilization and optimal particles are then reprocessed by determining the next 4N population generation , to produce a new optimal population. Then the new population was tested using the DT C5.0 method. The accuracy performance result is shown in table 3. Table 3. Parameter Value of Acuracy C5.0 Based on Table 3, classification results in green indicate the best percentage, red indicates the worst and yellow indicates intermediate. The test results obtained by the accuracy value used to measure the performance of the method in classifying the dataset . The performance of the C5.0 algosrithm is small compared to the optimized performance of C5.0. Algorithm C5.0 which has been optimized by the use of PSO on the dataset lymphography, vehicle, wine can be increased respectively 83,11%, 71.28%, 97.19%. While the results of the C5.0 algorithm using GA on the sympographic dataset , vehicle, wine were able to increase the accuracy value of 83.11%, 68.88%, 96.63%, respectively.
The test results using the GA PSO hybrid method on the C5.0 decision tree using the hybrid method on the sympography and vehicle dataset can increase the accuracy higher than the conventional method, namely at 83.78% and 71.54%. While in the dataset wine have accuracy values were slightly lower than optimization PSO, namely 96 , 63 %. The results of the DT C5.0 + HGAPSO test are shown in table 4.   Table 4 and Figure 6 , the results of testing the lymphography, vehicle, and wine dataset obtained optimal performance results from 4 test parameters, namely accuracy, recall, precision, and F-measure..

CONCLUSION
Based on the test results using the HGAPSO method on the DT C5.0 using the hybrid method on the lympography dataset and the vehicle, it can increase the accuracy higher than the conventional method at 83 , 78 % and 71.54%. Meanwhile, the wine dataset has a slightly decreased accuracy value compared to the PSO optimization of 96.63%. This is because the dimensions of the wine dataset are smaller than the lymphography and vehicle dataset. The GA-PSO hybrid method can be quite effective in improving classification performance on highdimensional data.