APPLICATION OF MULTISTAGE CLUSTERING FOR MAPPING ECONOMIC POTENTIAL IN EAST JAVA PROVINCE

This study aims to map the economic potential in East Java Province based on GRDP according to business field category. Multistage clustering is a method developed for outlier data and datasets with large variance. Multistage clustering is a combination of Ordering Points to Identify the Clustering Structure (OPTICS) and K-Means. The first stage was grouped using OPTICS. The outlier data resulting from the clustering stage is used as a dataset in the second stage using K-Means. The performance of this method is compared with several other methods, namely: K-Means, DBSCAN – K-Means, Agglomerative, Fuzzy C-Means (FCM), Possibilistic C-Means (PCM), and Fuzzy Possibilistic C-Means (FPCM) based on the characteristics of the Silhouette score and Davies-Bouldin score. Multistage clustering was chosen as the best method with a Silhouette score of 0.442 and Davies-Bouldin score of 0.388. With the Elbow method and the two metrics, the optimum number of clusters is 8 clusters. The results of this mapping method, the City of Surabaya forms a separate cluster which has the highest economic potential in 15 categories of business fields. Next Gresik, Pasuruan, Sidoarjo, and Probolinggo have the second highest economic potential with 10 categories of business fields ranking in the top 3.


INTRODUCTION
There are 17 targets and policies in the Sustainability Development Goals (SDGs) whose achievement requires cooperation from all stakeholders, both at the central and regional levels.Indonesia's demographic composition is dominated by the productive age population  as much as 68.7% of the total population in 2019 (Bappenas, BPS, and UNFPA, 2018), which has the potential for a large workforce that can accelerate economic growth [1].
Gross Regional Domestic Product (GRDP) is an important indicator to determine the economic condition of a country in a certain period of time.GRDP calculation is a very important part in macroeconomics, especially regarding the economic analysis of a region [2] [3].GRDP can measure the rate of economic growth [4].Based on data from the Central Statistics Agency (BPS), Indonesia's economic growth in the second quarter of 2022 reached 5.44% (y-o-y).The acceleration of economic performance was supported by domestic demand which continued to increase, especially household consumption and export performance which remained high.Economic improvement was driven by several business fields such as processing, transportation and warehousing, as well as wholesale and retail trade [5].
In 2021, East Java Province will contribute 14.57% to Indonesia's Gross Domestic Product (GDP).East Java has a variety of potentials in its sectors ranging from agriculture, processing industry, trade to services.Diversity in regions geographically and socio-culturally is the driving force for the various potentials that exist in the East Java region [6].East Java's economic growth continues to show good progress.In the second quarter of 2022, the business sector that experienced significant growth was transportation and warehousing by 22.21%.Next, other service business fields by 13.07%, and electricity and gas procurement by 9.58% [7].
The 2015 World Bank report shows that Indonesia's economic growth is only enjoyed by 20% of the population in the highest income group identified as the consumer class.And in 2021 Indonesia's economic growth which reached 7.07% will only be enjoyed by the upper middle class [8].
The purpose of this research is to map the economic potential in the East Java Province.Mapping of economic potential based on clustering results of gross regional income (GDP) based on current prices according to business field.Some researchers use various clustering methods for mapping.The Fuzzy C-Means Clustering method is used for a mapping system in East Java province for classification of cities/regencies with potential for transmigration [9].Fuzzy C-Mean and K-Means algorithms are also used to manage agricultural data from data mining results, which are used to find and form clusters of agricultural land areas according to the type of commodity based on the supporting attributes used.The results of this analysis are implemented to provide land information such as the number of clusters, land area, area, location and level of productivity, which can be used as input in the process of conversion and arrangement of agricultural land [10].
The Location Quotient (LQ) is used to determine the leading/base and noncompetitive/non-base sectors to support economic growth.The data analyzed is the Gross Regional Domestic Product (GDP) at constant prices in 2010, Arfak Mountains District and West Papua Province in 2014-2019.The result is that there are 6 leading sectors that contribute significantly to the economy of Arfak Mountains Regency.As a recommendation, the Regional Government of Arfak Mountains Regency is expected to be able to manage and improve the quality of sectors that are not competitive/non-base [11].Analysis of Location Quotient, Shift Share, and Klassen Typology is used to classify the growth of the economic sector and find out the basic sectors and leading sectors in Pamekasan Regency [12].Location Qoutient (LQ) method, Shift-Share analysis, Growth Ratio Model (MRP), Overlay Analysis are also used to map the potential of Ngawi Regency based on GRDP in 2015-2019.The results of the analysis show that the base sectors in Ngawi Regency are agriculture, transportation and communication, and services [13].
Hierarchical cluster analysis is used to classify the level of welfare of districts/cities needed as input for policy formulation and as a tool to view conditions, monitor and evaluate the success of development in East Java in accordance with the SDGs [14].Grouping areas based on economic potential is also done by clustering data with mixed attributes (numerical and categorical data).Clustering was carried out using the Fuzzy k-prototypes algorithm and modified Eskin distances to measure categorical attribute distances.The results of this clustering can be used as a guide in determining village development targets in increasing the Village Building Index in Demak Regency [15].
Clustering is the most important unsupervised learning technique as it deals with finding structure in unlabeled data collections.Several clustering algorithms such as K-means clustering, Fuzzy C-means hierarchical clustering, DBSCAN, OPTICS, STING, ROCK and CACTUS are used to select an appropriate clustering approach [16].The DBSCAN method is also used for grouping earthquake data [17].
OPTICS is an unsupervised learning algorithm based on hierarchical density, which is not sensitive to parameters.This method can handle the problem of clusters that have nonuniform densities in the dataset and can visualize the cluster structure of the data set [18].OPTICS utilizes the distance between neighboring points to construct reachability plots, which are used to distinguish groups of varying densities of noise [19].
A two-stage clustering method was used to capture a more representative temporal pattern of the loading shape and peak demand through a cluster merging approach [20].A two-stage clustering technique is also used for image based hyperspectral sensing Neighboring Union Histogram (NUH).The first stage of relatively coarse clustering using K-means for classify each group's NUH and the second stage uses K-means to perfect the results of the first grouping [21].Multistage clustering is also used for summarizing unstructured text documents [22].
This research collaborates two clustering methods that are often used by other researchers, namely Ordering Points to Identification of the Clustering Structure (OPTICS) and K-Means.Multistage clustering was developed to handle outlier data and dataset cases that have large variances.This method is a combination of OPTICS and K-Means.In the first stage, the OPTICAL method is used to separate cluster results with noise or outliers.In the second stage, the remaining datasets that have been grouped as outliers are regrouped using the K-Means method.The multilevel clustering performance will also be compared with other clustering methods.

MATERIAL AND METHODS
The stages of this research are shown in Fig. 1 starting from dataset collection, performing extraction and transformation (ETL), preliminary analysis, OPTICS clustering (Stage-1), taking outliers as new datasets, K-Means clustering (Stage-2), combining labeling results (cluster), modeling with some other method, model evaluation to implementation and analysis.In the early stages, descriptive analysis was carried out on the original data and some of the results of the dataset transformation using natural logarithms, square root, and a combination of both.In addition, data adequacy testing was also carried out using The Kaiser-Meyer-Olkin (KMO) test and normal multivariate testing using the Shapiro-Wilk normality test.
In stage-1, grouping was carried out using OPTICS.Data that are classified as outliers at this stage are grouped using K-Means, then the labeling results at stage-1 and stage-2 are combined.Furthermore, grouping was carried out using several other methods, namely: K-Means, Multistage clustering (DBSCAN -K-Means), Agglomerative, FCM, PCM, and FPCM.At the model evaluation stage, the best model was selected based on the characteristics of the Silhouette score and Davies-Bouldin score.The final stage is to apply the best method for mapping economic potential based on GRDP for each category of business field in regencies/cities in East Java Province.

Datasets
The source of the dataset in this study is the Publication of Regency/City Gross Regional Domestic Product in East Java Province According to Business Sector Central Bureau of Statistics or Badan Pusat Statistik (BPS) East Java for 2014 -2021.Clustering attributes can be seen in Table 1.

Multistage Clustering Algorithm
The inability to find clusters with different densities is the main drawback of DBSCAN [23].For this reason, several DBSCAN authors developed the OPTICS method.Multistage clustering is two stages of clustering, the first stage uses OPTICS.Observations identified as outliers are new datasets for the second stage of clustering using K-Means.The OPTICS algorithm steps are shown in the pseudo code in [24].Then in the second stage using the K-Means algorithm [25].

Model Evaluation
The performance of multistage clustering is compared to other methods, namely K-Means, DBSCAN -K-Means, Agglomerative, FCM, PCM, and FPCM based on the Silhouette score and Davies-Bouldin score.

Implementation and Analysis
In the final stage, the best method or model will be applied for classifying economic potential in the East Java Province area based on GRDP for all categories of business fields.The grouping will be carried out from 2014 -2021.The shift of a district/city from one cluster to another will be analyzed at this stage.

Statistic Descriptive
Based on Fig. 2

Preliminary Analysis
The distribution of GRDP in each business field which is then used as the specified attributes in clustering is shown in the histogram matrix in Fig. 4. All attributes are not normally distributed and tend to have positive skewness.2, it is found that the results of the transformation of natural logarithms, square roots, and the combination of the two data are still not normally distributed and tend to be exponentially distributed.This is evidenced by the p-value which is less than 5%.The number of outlier data from the original data and the data resulting from the transformation is the same, namely as many as 10 observations and the KMO test value is more than 0.5 (sufficient sample).The distribution of data resulting from the transformation of natural logarithms is shown in the histogram matrix in Fig. 5. Using Kolmogorov -Smirnov (a = 1%), there are 53% of the attributes that have a Normal distribution, namely category C, category E, category F, category G, category J, category K, category, L, category O, and category Q.

Model Evaluation
Based on the results of the initial analysis, at the clustering stage the data set resulting from the natural logarithm transformation was used.The next step is to determine the optimal number of classes based on the Elbow method, Silhouette score and DBS score.At this stage the FCM, PCM, and FPCM methods are ignored.The elbow curve can be seen in Fig. 7. From this curve, the number of classes can be determined between 4 to 9 because there is no clear elbow.

Implementation and Analysis
In the following analysis there is still a mention of the word "city" but no mention of the word "regency" for the regency area.The multistage clustering method (OPTICS -K-Means) was implemented for mapping the economic potential of East Java Province based on the GRDP of each business field category.Fig. 11. is a scatter plot between potential category A (Agriculture, Forestry and Fisheries) and potential category B (Mining and excavation).Cluster-4, especially Bojonegoro has the highest GRDP in this sector.It can be seen that the red dot has a high value on the category A axis close to Rp. 10,000 billion and category B more than Rp.40,000 billion.Table 3 shows there is a shift in the economic potential clusters of Batu City from 2019 to 2020, namely from cluster-2 to cluster-5.From 2020 to 2021 there will be a shift in the Trenggalek economic potential cluster from cluster-5 to cluster-1.This also happened in Probolinggo , which shifted from cluster-2 to cluster-6 so that it became one cluster with Gresik, Pasuruan , and Sidoarjo.The second highest in category E (Water Procurement, Waste Management, Waste and Recycling), the third highest in category G (Wholesale and Retail Trade, Car and Motorcycle Repair), the third highest in category I (Provision of Accommodation and Food and Drink), the second highest in category K (Financial Services and Insurance ), the second highest in the MN (Company Services) category, the second highest in the P (Education Services) category, the second highest in the Q (Health Services and Social Activities) category, and the second highest in the RSTU (Other Services) category.In [11][12][13], the mapping is based on the average location quotient which is classified as sector basis and non basis.On [14] using the Internal Cluster Dispersion Rate (ICDRate) to evaluate clustering.While this research is a case of grouping by looking for the optimal number of groups based on Silhouette score and DBS.
There are similarities between this study and research [17], which used a Silhouette score at the clustering evaluation stage.In [17] using 4 attributes for clustering with a fairly high Silhouette value between 0.979 -0.811 at MinPts 3 -5 with the number of clusters = 3.The DBSCAN method is better than OPTICS.While in this study using 17 attributes with a fairly high variance and outlier data.The highest Silhouette score is 0.442 in the multistage clustering method (OPTICS -K-Means) with the number of clusters = 8.This method is better than DBSCAN -K-Means, K-Means, and Agglomerative.

CONCLUSION
In the early stages several transformations have been carried out but the normal multivariate assumptions cannot be fulfilled because there are 10 outliers detected in the data.However, the KMO test dataset is sufficient for clustering analysis.Multistage clustering (OPTICS -K_Means) is one solution that can be used as a clustering method for datasets with numeric types that have high variability and outliers.Although this method has a relatively low Silhouette score, it is the best method compared to several other methods, namely K-Means, DBSCAN -K-Means, Agglomerative, and fuzzy clustering.Multistage clustering can be applied and further developed for data sets with a relatively large number of observations.Table 1.Testing data

RESULT AND DISCUSSION
This chapter discusses the results of the modified LMKNCN algorithm in doing certain tests.In the evasion direction avoidance tests, the features used in training data [17] is first discussed, as well as accuracy testing.Then, the algorithm is tested in quadcopter flight plans that must reach its target point with static and dynamic obstacles in the way.
The LMKNCN classification features used in this research is the dimensions of the obstacle against the quadcopter's position.The dimensions feature data is processed into deviance distance data.This feature data consisted of 4 parameters, that of upper span ℎ  , left span ℎ  , right span ℎ  , and lower span ℎ  .The deviance distance data  consisted of 4 parameters, that of left, right, up and down deviances.Table 2 shows the feature data used in cluster training data, Table 3 shows the obstacle training data, and Table 4 shows the testing data, all of which resolves as correct.
The simulation tests used a computer with Intel Core i3 CPU of 1.70 GHz and 4 Gb RAM.
The tests result in an accuracy of 97.5% (Table 4).The learning process between training and testing data required a computation time of 0.142341 seconds.

Case 1
In Case 1, the start point is in coordinate (0.5,4,2) and the target point is in coordinate (7.5,4,2).This case has 1 static obstacle in coordinate

Case 2
In Case 2, there is 1 dynamic obstacle moving up and down the positive z-axis.This obstacle has an innate velocity of 0.005/.The quadcopter is positioned at the start point (0.5,4,2) and has the target point (7.5,4,2).The dynamic obstacle has an initial coordinate
The closest training cluster data (Table 5) in this case is in Cluster 8.The first closest data in the cluster is the 12th data point.The centroids are located in the 15th and the 16th, shown in Table 6.The most efficient evasion direction is to the right due to the 3 nearest neighbors to the static obstacle training data in Cluster 8 (Table 7) showed the right class.

Fig 4 .
Fig 4. Histogram matrix of original data attributes Furthermore, several transformations are carried out to see the distribution of the data, the number of outlier data, the data adequacy test (KMO test), and the Shapiro-Wilk normality test.Based on Table2, it is found that the results of the transformation of natural logarithms, square roots, and the combination of the two data are still not normally distributed and tend to be exponentially distributed.This is evidenced by the p-value which is less than 5%.The number of outlier data from the original data and the data resulting from the transformation is the same, namely as many as 10 observations and the KMO test value is more than 0.5 (sufficient sample).

Fig 6 .
Fig 6.Comparison of silhouette score multistage clustering (OPTICS -K-Means) with other methods.Before implementing Multistage clustering (OPTICS -K-Means), this method will be compared with several other methods, namely: K-Means, Multistage clustering (DBSCAN -K-Means), Agglomerative, FCM, PCM, and FPCM.The parameter used to compare is the Silhouette score.For multistage clustering, the Min Pts value varies from 2 to 6.While other methods are replicated 5 times.The silhouette score can be seen in Fig.6.The silhouette score on K-Means and Agglomerative is relatively the same, which is around 0.26 (blue and green lines that almost coincide).As for the FCM, PCM and FPCM methods (each with a gray,

Fig 7 .
Fig 7. The elbow method Fig. 8. is the Silhouette score on the number of classes 3 to 9 of the 4 methods.In the K-Means method, as the number of clusters increases, the monotone Silhouette score decreases.Agglomerative performance is not better than the K-Means method.For multistage clustering (OPTICS -K-Means), the Silhouette score for the number of clusters = 3, 5, 8, and 9 has a score of more than 0.44.Whereas in multistage clustering (DBSCAN -K-Means), the Silhouette score for the number of clusters = 4,5,7, and 9.When the number of clusters = 8, the Silhouette score is only 0.238.The next analysis focuses on comparing the performance of OPTICS -K-Means with DBSCAN -K-Means.

Fig 11 .
Fig 11.Scatter plot Category A vs. Category B Fig.12. is a scatter plot between potential category A (Agriculture, Forestry and Fisheries) and potential category D (Procurement of Electricity and Gas).Clusters with relatively high GRDP compared to other clusters in these two potentials are Cluster-6, namely Gresik, Pasuruan, Sidoarjo, and Probolinggo (dots colored brown).Surabaya City has the highest potential for the procurement of electricity and gas, but the potential for agriculture, forestry and fisheries is very low.Likewise, Cluster-4 tends to have high potential in the agriculture, forestry, and fisheries business fields.

Fig 12 .
Fig 12. Scatter plot Category A vs. Category D Fig. 13. is a scatter plot between the potential category C (Industry and Processing) and the potential category F (Construction).Cluster-8 (Surabaya City) has a very high GRDP in both categories with a very large gap compared to Cluster-6, Cluster-4, Cluster-7 and other clusters.

Fig 13 .
Fig 13.Scatter plot Category C vs. Category F In this study, Multistage clustering (OPTICS -K-Means) was implemented for grouping economic potential from 2014 -2021 to see cluster shifts from year to year.Table3shows there is a shift in the economic potential clusters of Batu City from 2019 to 2020, namely from cluster-2 to cluster-5.From 2020 to 2021 there will be a shift in the Trenggalek economic potential cluster from cluster-5 to cluster-1.This also happened in Probolinggo , which shifted from cluster-2 to cluster-6 so that it became one cluster with Gresik, Pasuruan , and Sidoarjo.
(4,4.1,2), shown in top view in Fig 6 and side view in Fig 7.

Table 1 .
Attributes for clustering

Table 4 and
Table 5 show the cluster centers (centroids) from the 2021 dataset clustering.In general, Cluster-8, namely the City of Surabaya has the highest economic potential in 15

Table 4 .
Cluster Center of Category A -I

Table 5 .
Cluster Center of Category J -RSTU