Current Issue Cover
基于自步学习的鲁棒多样性多视角聚类

唐永强1,2, 张文生1,2(1.中国科学院自动化研究所, 北京 100190;2.中国科学院大学, 北京 100190)

摘 要
目的 大数据环境下的多视角聚类是一个非常有价值且极具挑战性的问题。现有的适合大规模多视角数据聚类的方法虽然在一定程度上能够克服由于目标函数非凸性导致的局部最小值,但是缺乏对异常点鲁棒性的考虑,且在样本选择过程中忽略了视角多样性。针对以上问题,提出一种基于自步学习的鲁棒多样性多视角聚类模型(RD-MSPL)。方法 1)通过在目标函数中引入结构稀疏范数L2,1来建模异常点;2)通过在自步正则项中对样本权值矩阵施加反结构稀疏约束来增加在多个视角下所选择样本的多样性。结果 在Extended Yale B、Notting-Hill、COIL-20和Scene15公开数据集上的实验结果表明:1)在4个数据集上,所提出的RD-MSPL均优于现有的2个最相关多视角聚类方法。与鲁棒多视角聚类方法(RMKMC)相比,聚类准确率分别提升4.9%,4.8%,3.3%和1.3%;与MSPL相比,准确率分别提升7.9%,4.2%,7.1%和6.5%。2)通过自对比实验,证实了所提模型考虑鲁棒性和样本多样性的有效性;3)与单视角以及多个视角简单拼接的实验对比表明,RD-MSPL能够更有效地探索视角之间关联关系。结论 本文提出一种基于自步学习的鲁棒多样性多视角聚类模型,并针对该模型设计了一种高效求解算法。所提方法能够有效克服异常点对聚类性能的影响,在聚类过程中逐步加入不同视角下的多样性样本,在避免局部最小值的同时,能更好地获取不同视角的互补信息。实验结果表明,本文方法优于现有的相关方法。
关键词
Robust and diverse multi-view clustering based on self-paced learning

Tang Yongqiang1,2, Zhang Wensheng1,2(1.Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China;2.University of Chinese Academy of Sciences, Beijing 100190, China)

Abstract
Objective In real-world applications, datasets naturally comprise multiple views. For instance, in computer vision, images can be described by different features, such as color, edge, and texture; a web page can be described by the words appearing on the web page itself and the hyperlinks pointing to them; and a person can be recognized by their face, fingerprint, iris, and signature. Clustering aims to explore meaningful patterns in an unsupervised manner. In the era of big data, with the rapid increase of multi-view data, obtaining better clustering performance than any single view by using complementary information from different views is a valuable and challenging task. Popular multi-view clustering methods can be roughly divided into two categories:spectral clustering based and nonnegative matrix factorization (NMF) based. Multi-view spectral clustering methods can have superior performance in nonlinear separate data partitioning. However, the high computational complexity due to the feature decomposition of Laplacian matrix limits their applications in large-scale data clustering. Conversely, the classical K-means clustering method, which has been proven to be equivalent to NMF, is often used in the big data environment because of its low computational complexity and convenient parallelization. Several studies have extended single-view K-means to a multi-view setting. To a certain extent, multi-view self-paced learning (MSPL) can overcome bad local minima due to non-convex objective functions. However, two drawbacks need to be solved. First, MSPL lacks robustness for the data outliers. Second, MSPL considers only the criterion that samples should be added to the clustering process from easy to more complex sequences while ignoring the diversity in the sample selection process. To solve the above two problems, we propose a robust and diverse multi-view clustering model based on self-paced learning (RD-MSPL). Method The robust K-means clustering method is needed to achieve a more stable clustering performance with respect to a fixed initialization. To address this problem, we introduce a structural sparsity norm (L2,1-norm) into the objective function to replace the L2-norm. The L2,1-norm-based clustering objective enforces the L1-norm along the data point direction of data matrix and L2-norm along the feature direction. Thus, the effect of outlier data points in clustering is reduced by the L1-norm. In addition, ideal self-paced learning should utilize not only easy but also diverse examples that are sufficiently dissimilar from what has already been learned. To achieve this goal, we apply the negative L2,1-norm constraints to the sample weight matrix in the self-paced regularization. As discussed above, the L2,1-norm leads to group-wise sparse representation (i.e., nonzero entries tend to be concentrated in a small number of groups). By contrast, the negative L2,1-norm should have a countereffect to groupwise sparsity (i.e., nonzero entries tend to be scattered across a large number of groups). The anti-structure sparse constraint is expected to realize the diversity of samples selected from multiple views. The difficulty of solving the proposed objective comes from the L2,1-norm non-smoothness. In this study, we propose an effective algorithm to handle this problem. Result We perform experiments on four public datasets, namely, extended Yale B, Notting-Hill, COIL-20, and Scene15. The clustering performance is measured using six popular metrics:normalized mutual information (NMI), accuracy (ACC), adjusted rank index (AR), F-score, precision, and recall. Higher metrics correspond to improved performance. Those metrics favor different properties in the clustering such that a comprehensive evaluation can be achieved. In all datasets, the reported final results on those metrics are measured by the average and standard derivation of 20 runs. We highlight the best values in bold in each table. First, we compare our proposal with robust multi-view K-means clustering (RMKMC) and MSPL, which are the most relevant multi-view clustering methods. The experimental results indicate that the proposed RD-MSPL is superior to these two methods in almost all metrics except for the recall metric on the Notting-Hill dataset. Then, we experimentally prove the importance of two key components in the proposed model (i.e., model robustness and sample diversity). Finally, we compare the proposed RD-MSPL with single view and concatenated multiple views. Its superior performance confirms that RD-MSPL can better capture complementary information and explore the relationship among multiple views. In the proposed model, two self-paced learning parameters influence the clustering performance. These two parameters control the pace at which the model learns new and diverse examples separately, and they usually increase iteratively during optimization. In this study, we conduct further parameter sensitivity analysis to better understand the characteristics of our RD-MSPL model. The experimental results show that although these two parameters play an important role in performance, most results are still better than the single-view baseline. Conclusion In this paper, a new model called RD-MSPL is proposed to perform large-scale multi-view data clustering. The proposed model can effectively overcome the effect of outliers. In the clustering process, with the gradual addition of diverse samples from different views, our proposed method can better obtain complementary information from different views while avoiding the local minima. We conduct a series of comparative analyses with several existing methods on multiple datasets. The experimental results show that the proposed model is superior to the existing related multi-view clustering methods. Future research will focus on 1) expanding the applicability of the method to a wider range of data with kernel trick because the proposed method is based on the assumption that all the features are on linear manifolds and 2) the importance of the adaptive learning approach for the self-paced learning parameter in such unsupervised setting.
Keywords

订阅号|日报