R. Q. A. Fernandes, W. A. Pinheiro, G. B. Xexéo and J. M. de Souza
Cluster, grid, complexity, points, shapes.
|PUBLISHED DATE||December 2017|
|PUBLISHER||The Author(s) 2017. This article is published with open access at www.chitkara.edu.in/publications|
This work proposes an algorithm that uses paths based on tile segmentation to build complex clusters. After allocating data items (points) to geometric shapes in tile format, the complexity of our algorithm is related to the number of tiles instead of the number of points. The main novelty is the way our algorithm goes through the grids, saving time and providing good results. It does not demand any configuration parameters from users, making easier to use than other strategies. Besides, the algorithm does not create overlapping clusters, which simplifies the interpretation of results.
In spite of the fact that many clustering algorithms have been proposed in the last years , it is still a challenge to find irregular shaped clusters from large data sets in acceptable time not demanding too many computational resources. Methods that are able to identify arbitrary shapes impose restrictions of processed data size sets in order to maintain admissible levels of memory usage and processing time.
In the context of Big Data and Data Mining, it is quite relevant the ability of finding clusters without knowing features of the data sets to be clustered. Often, these areas have to deal with large data sets that may produce clusters with arbitrary shapes from different sources, such as: geographic systems, medical systems, and sensors, among others.
This way, clustering algorithms should present some features, such as: ability of assembling clusters of arbitrary shapes and high dimensionality data, capacity of dealing with large data sets having noise and outliers, independence of data order and low time complexity. These characteristics of clustering techniques allow discovering patterns that were not forecast previously. Current clustering algorithms have facing problems to execute these tasks.
This work presents a new clustering algorithm, named Path Clustering, which addresses these features, having better performance than some competitor algorithms existent in literature. Despite the existence of grid clustering algorithms not be something new, the main novelty of Path Clustering is the way it goes through the grids, saving time and getting good results in terms of formed clusters. It considers points allocated in a space divided in small square forms, named tiles (subdivisions of a space in squares, cubes, etc). Neighbors tiles containing one or more points are grouped, which may create a clustering path. This strategy allows building clusters of irregular shapes, independently of data order, and with low complexity. In this context, a relevant cluster in this paper follows the definition related to Contiguous Cluster (Nearest neighbor or Transitive) given at : “A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster”.
The experiments demonstrate our strategy is scalable and provide clusters of good quality when it is compared to other clustering algorithms. At the same time, it allows to find complex patterns from arbitrary data distributions, irregular or not.
The rest of our work is organized as follow: Section 2 presents the related work, Section 3 details the proposed algorithm, Section 4 shows the experiments and Section 5 presents our remarks and points out future works.
|ISSN||Print : 2321-3906, Online : 2321-7146|
This paper presents a grouping algorithm named Path Clustering. It is a clustering grid algorithm that divides data space in segments limited by square shapes called tiles. The way used to navigate from one tile to another provides a new faster strategy to group data. It uses a greedy strategy, navigating through local neighbors following a specific path. Tiles containing one or more points separated by tiles containing no points receive different colors. Different colors indicate different groups.
After allocating data items (points) in geometric shapes in tile format, the complexity of our algorithm is related to the number of tiles instead of the number of points. Thus, the number of tiles is fundamental to reduce the complexity of our algorithm. In our strategy, a tile is represented by a square shape and the square-side size is a number related to a concept named Resolution Scale. In order to keep a balance between low complexity and good results for different data distributions, we proposed that maximum Resolution Scale should provide a complexity of O(n), reducing space in memory and time of processing when compared to other algorithms.
Other interesting characteristic is that the algorithm presented in this paper does not require parameters from users beyond the data set itself, making easier to use than many other algorithms.
It was performed a comparative experiment considering Path Clustering and other five algorithms. The experiments demonstrated our strategy is scalable using two type of distributions (Gaussian and Radial) with different number of points. Path clustering showed the best performance in terms of time. Besides, their clusters get a good precision and recall in most part of tested distributions, considering the idea of a Contiguous Cluster (Nearest neighbor or Transitive). At the same time, it allows to find complex patterns from arbitrary data distributions, irregular or not.
Path Clustering implementation used in this work did not apply concurrent computing to execute the processes. Concurrent algorithms can execute processes in parallel, reducing processing time. Thus, we intend to implement this feature in future works, allowing to reduce even more the execution time for grouping data. Besides, we intend to compare the algorithm in a high-dimensional data scenario and calculate automatically the best number of grids according to the distribution and number of points.