J. Today’s Ideas - Tomorrow’s Technol.

Path Clustering: Grouping in a Efficient Way Complex Data Distributions

R. Q. A. Fernandes, W. A. Pinheiro, G. B. Xexéo and J. M. de Souza

  • Download PDF
  • DOI Number

Cluster, grid, complexity, points, shapes.

PUBLISHED DATE December 2017
PUBLISHER The Author(s) 2017. This article is published with open access at www.chitkara.edu.in/publications

This work proposes an algorithm that uses paths based on tile segmentation to build complex clusters. After allocating data items (points) to geometric shapes in tile format, the complexity of our algorithm is related to the number of tiles instead of the number of points. The main novelty is the way our algorithm goes through the grids, saving time and providing good results. It does not demand any configuration parameters from users, making easier to use than other strategies. Besides, the algorithm does not create overlapping clusters, which simplifies the interpretation of results.


In spite of the fact that many clustering algorithms have been proposed in the last years [1], it is still a challenge to find irregular shaped clusters from large data sets in acceptable time not demanding too many computational resources. Methods that are able to identify arbitrary shapes impose restrictions of processed data size sets in order to maintain admissible levels of memory usage and processing time.

In the context of Big Data and Data Mining, it is quite relevant the ability of finding clusters without knowing features of the data sets to be clustered. Often, these areas have to deal with large data sets that may produce clusters with arbitrary shapes from different sources, such as: geographic systems, medical systems, and sensors, among others.

This way, clustering algorithms should present some features, such as: ability of assembling clusters of arbitrary shapes and high dimensionality data, capacity of dealing with large data sets having noise and outliers, independence of data order and low time complexity. These characteristics of clustering techniques allow discovering patterns that were not forecast previously. Current clustering algorithms have facing problems to execute these tasks.

This work presents a new clustering algorithm, named Path Clustering, which addresses these features, having better performance than some competitor algorithms existent in literature. Despite the existence of grid clustering algorithms not be something new, the main novelty of Path Clustering is the way it goes through the grids, saving time and getting good results in terms of formed clusters. It considers points allocated in a space divided in small square forms, named tiles (subdivisions of a space in squares, cubes, etc). Neighbors tiles containing one or more points are grouped, which may create a clustering path. This strategy allows building clusters of irregular shapes, independently of data order, and with low complexity. In this context, a relevant cluster in this paper follows the definition related to Contiguous Cluster (Nearest neighbor or Transitive) given at [17]: “A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster”.

The experiments demonstrate our strategy is scalable and provide clusters of good quality when it is compared to other clustering algorithms. At the same time, it allows to find complex patterns from arbitrary data distributions, irregular or not.

The rest of our work is organized as follow: Section 2 presents the related work, Section 3 details the proposed algorithm, Section 4 shows the experiments and Section 5 presents our remarks and points out future works.

Page(s) 141-155
URL http://dspace.chitkara.edu.in/jspui/bitstream/123456789/703/3/4-%20Path%20Clustering%20in%20a%20efficient%20way%20complex%20data%20-%20Fernandes.pdf
ISSN Print : 2321-3906, Online : 2321-7146
DOI https://doi.org/10.15415/jotitt.2017.52004

This paper presents a grouping algorithm named Path Clustering. It is a clustering grid algorithm that divides data space in segments limited by square shapes called tiles. The way used to navigate from one tile to another provides a new faster strategy to group data. It uses a greedy strategy, navigating through local neighbors following a specific path. Tiles containing one or more points separated by tiles containing no points receive different colors. Different colors indicate different groups.

After allocating data items (points) in geometric shapes in tile format, the complexity of our algorithm is related to the number of tiles instead of the number of points. Thus, the number of tiles is fundamental to reduce the complexity of our algorithm. In our strategy, a tile is represented by a square shape and the square-side size is a number related to a concept named Resolution Scale. In order to keep a balance between low complexity and good results for different data distributions, we proposed that maximum Resolution Scale should provide a complexity of O(n), reducing space in memory and time of processing when compared to other algorithms.

Other interesting characteristic is that the algorithm presented in this paper does not require parameters from users beyond the data set itself, making easier to use than many other algorithms.

It was performed a comparative experiment considering Path Clustering and other five algorithms. The experiments demonstrated our strategy is scalable using two type of distributions (Gaussian and Radial) with different number of points. Path clustering showed the best performance in terms of time. Besides, their clusters get a good precision and recall in most part of tested distributions, considering the idea of a Contiguous Cluster (Nearest neighbor or Transitive). At the same time, it allows to find complex patterns from arbitrary data distributions, irregular or not.

Path Clustering implementation used in this work did not apply concurrent computing to execute the processes. Concurrent algorithms can execute processes in parallel, reducing processing time. Thus, we intend to implement this feature in future works, allowing to reduce even more the execution time for grouping data. Besides, we intend to compare the algorithm in a high-dimensional data scenario and calculate automatically the best number of grids according to the distribution and number of points.

  • Fan Yang, Xuan Li, Qianmu Li, Tao Li. (2014). Exploring the diversity in cluster ensemble generation: Random sampling and random projection, Expert Systems with Applications, Volume 41, Issue 10, Pages 4844-4866, ISSN 0957-4174, http://dx.doi.org/10.1016/j.eswa.2014.01.028.
  • Khan, Latifur; Luo, Feng. (2005). Hierarchical clustering for complex data, in press. Int. J. Artif. Intelligence Tools. Vol. 14 No. 5. World Scientific.
  • Aggarwal, Charu C. and Reddy, Chandan K. (2014). Data Clustering: Algorithms and Applications. Chapman and Hall/CRC. ISBN-13: 978-1466558212.
  • Sasirekha, K., Baby, P. (2013). Agglomerative Hierarchical Clustering Algorithm- A Review. International Journal of Scientific and Research Publications, Volume 3, Issue 3. ISSN 2250-3153.
  • Berkhin, Pavel (2002). Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA, 2002.
  • Telgarsky, M., & Vattani, A. (2010). Hartigan’s method: k-means clustering without voronoi. In International Conference on Artificial Intelligence and Statistics (pp. 820-827).
  • Ester, M., Kriegel, H., Sander, J. and Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. Proc. 2nd Int. Conf. Knowledge Discovery and Data Mining (KDD’,96), pp.226 -231.
  • Schikuta, E., Erhart, M. (1997). The BANG-clustering system: grid-based data analysis. In Proceeding of Advances in Intelligent Data Analysis, Reasoning about Data, 2nd International Symposium, 513-524, London, UK.
  • Schikuta, E. (1996). Grid-clustering: a fast hierarchical clustering method for very large data sets. In Proceedings 13th International Conference on Pattern Recognition, 2, 101- 105.
  • Wang, W., Yang, J., and Muntz, R. (1997). STING: a statistical information grid approach to spatial data mining. In Proceedings of the 23rd Conference on VLDB, 186- 195, Athens, Greece.
  • Procopiuc, C.M., Jones, M., Agarwal, P.K., Murali, T.M. (2002). A Monte Carlo algorithm for fast projective clustering Proceedings of the 2002 ACM SIGMOD international conference on Management of data, ACM New York, pp. 418-427.
  • Yiu, M. L. and Mamoulis. (2003). N. Frequent-pattern based iterative projected clustering. In Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM), Melbourne, FL, pages 689–692.
  • Assent, I., Krieger, R., Muller, E., and Seidl, T. (2008). InSCY: Indexing subspace clusters with in process-removal of redundancy. In Eighth IEEE International Conference on Data Mining, 2008. ICDM’08, pages 719–724. IEEE.
  • Baeza-Yates, R., Ribeiro-Neto, B. (2011). Modern Information Retrieval: The Concepts and Technology Behind Search. 2011. ISBN-13: 978-0321416919.
  • Scikit. (2017). scikit-learn user guide. Release 0.19.1. Extracted from: http://scikit-learn. org/stable/_downloads/scikit-learn-docs.pdf. 2017.
  • Müller. E., Günnemann. S., Assent. I., Seidl. T. (2009). Evaluating Clustering in Subspace Projections of High Dimensional Data. Home-page: http://dme.rwth-aachen.de/OpenSubspace/. In Proc. 35th International Conference on Very Large Data Bases (VLDB), Lyon, France.
  • Tan, P., Steinbach, M., Karpatne, A. and Kumar, V. (2013). Introduction to Data Mining, (Second Edition). Ed. Pearson. 2013. ISBN-13: 978-0133128901.
  • Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters, Volume 31, Issue 8. 2010. ISSN 0167-8655.
  • Ultsch, A., Lötsch, J. (2017). Machine-learned cluster identification in high-dimensional data. Journal of biomedical informatics, ISSN: 1532-0480.
  • Hancer, E., Karaboga, D. (2017). A comprehensive survey of traditional, merge-split and evolutionary approaches proposed for determination of cluster number. Swarm and Evolutionary Computation, ISSN: 2210-6502.