Semisupervised clustering algorithm combining SUBCLU and constrained clustering for detecting groups in high dimensional datasets

None; None

doi:10.18845/tm.v31i3.3904

home Sumário navigate_before Anterior Atual Seguinte navigate_next

home Sumário

Artículos • Tecnología en Marcha 31 (3) • Jul-Sep 2018 • https://doi.org/10.18845/tm.v31i3.3904 linkcopy

Semisupervised clustering algorithm combining SUBCLU and constrained clustering for detecting groups in high dimensional datasets

Authorship SCIMAGO INSTITUTIONS RANKINGS

Abstract

High dimensional data poses a challenge to traditional clustering algorithms, where the similarity measures are not meaningful, affecting the quality of the groups. As a result, subspace clustering algorithms have been proposed as an alternative, aiming to find all groups in all spaces of the dataset (¹).

By detecting groups on lower dimensional spaces, each group may belong to different subspaces of the original dataset (²). Therefore, attributes the user considers of interest may be excluded in some or all groups, decreasing the value of the result for the data analysts.

In this project, a new algorithm is proposed, that combines SUBCLU (³) and the clustering algorithms by constraint (⁴), which allows the users to identify variables as attributes of interest based on prior knowledge of domain, targeting direct group detection toward spaces that include user’s attributes of interest, and thereafter, generating more meaningful groups.

Keywords: Data mining; subspaces; SUBCLU; clustering; clustering by constraint

vertical_align_top file_download show_chart

more_horizclose
- image
- translate
- link
- article
- vertical_align_top
- file_download
- show_chart
- image
- translate
- link
- article

location_on

None Cartago, Costa Rica, Cartago, Cartago, Costa Rica, CR, 159-7050 , 25502336, 25525354 - E-mail: alramirez@itcr.ac.cr

rss_feed Acompanhe os números deste periódico no seu leitor de RSS

Acessibilidade / Reportar erro

	SUBCLU	SUBCLU-R
Total de grupos generados	1130	599
Grupos que incluyen el atributo de interés	595	590
Subespacios únicos	1023	521
Subespacios únicos que incluyen el atributo de interés	413	513
Grupos que no incluyen el atributo de interés	535	9
Grupos en común	549	549
Grupos en común con el atributo de interés	539	539
Grupos en común sin el atributo de interés	9	9
Grupos detectados por un algoritmo y no detectados por el otro	581	50
Grupos que incluyen el atributo de interés detectado por un algoritmo y no detectado por el otro	56	51

Parámetro	Función de distancia	SUBCLU	SUBCLU R
Cohesión	Euclideana	4,113	3,8832
Cohesión	Manhattan	2,9312	2,8813
Subtotal promedio		3,5221	3,38225
Separación	Euclideana	17,7541	17,721
Separación	Manhattan	17,5425	17,5101
Subtotal promedio		17,6483	17,6155
Silueta	Euclideana	1,0065	1,0061
Silueta	Manhattan	1,0055	1,0058
Subtotal promedio		1,006	1,006
Tiempo de ejecución (horas)	Euclideana	31:05	36:56
Tiempo de ejecución (horas)	Manhattan	29:45	36:11
Subtotal promedio		30:41	36:34