发布时间: 2022-07-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.200467
2022 | Volume 27 | Number 7

图像分析和识别

混合高斯变分自编码器的聚类网络

陈华华, 陈哲, 郭春生, 应娜, 叶学义

杭州电子科技大学通信工程学院, 杭州 310018

收稿日期: 2020-09-08; 修回日期: 2020-11-02; 预印本日期: 2020-11-09

基金项目: 浙江省“领雁”研发攻关计划项目(2022C03065)

作者简介: 陈华华, 1975年生, 男, 副教授, 主要研究方向为数字图像处理、计算机视觉和模式识别。E-mail: iseealv@hdu.edu.cn
陈哲, 男, 硕士研究生, 主要研究方向为数字图像处理。E-mail: 631950238@qq.com
郭春生, 男, 教授, 主要研究方向为视频分析与模式识别。E-mail: gcs@hdu.edu.cn
应娜, 女, 副教授, 主要研究方向为信号处理、语音信号处理及其应用。E-mail: yingna@hdu.edu.cn
叶学义, 男, 副教授, 主要研究方向为图像处理、模式识别、信息隐藏。E-mail: xueyiye@hdu.edu.cn
*通信作者: 陈华华 iseealv@hdu.edu.cn

中图法分类号: TN911

文献标识码: A

文章编号: 1006-8961(2022)07-2148-09

摘要

目的经典的聚类算法在处理高维数据时存在维数灾难等问题，使得计算成本大幅增加并且效果不佳。以自编码或变分自编码网络构建的聚类网络改善了聚类效果，但是自编码器提取的特征往往比较差，变分自编码器存在后验崩塌等问题，影响了聚类的结果。为此，本文提出了一种基于混合高斯变分自编码器的聚类网络。方法使用混合高斯分布作为隐变量的先验分布构建变分自编码器，并以重建误差和隐变量先验与后验分布之间的KL散度(Kullback-Leibler divergence)构造自编码器的目标函数训练自编码网络；以训练获得的编码器对输入数据进行特征提取，结合聚类层构建聚类网络，以编码器隐层特征的软分配分布与软分配概率辅助目标分布之间的KL散度构建目标函数并训练聚类网络；变分自编码器采用卷积神经网络实现。结果为了验证本文算法的有效性，在基准数据集MNIST (Modified National Institute of Standards and Technology Database)和Fashion-MNIST上评估了该网络的性能，聚类精度(accuracy，ACC)和标准互信息(normalized mutual information，NMI)指标在MNIST数据集上分别为95.86%和91%，在Fashion-MNIST数据集上分别为61.34%和62.5%，与现有方法相比性能有了不同程度的提升。结论实验结果表明，本文网络取得了较好的聚类效果，且优于当前流行的多种聚类方法。

关键词

聚类; 混合高斯分布; 变分自编码器(VAE); 软分配; KL散度

A Gaussian mixture variational autoencoder based clustering network

Chen Huahua, Chen Zhe, Guo Chunsheng, Ying Na, Ye Xueyi

School of Communication Engineering, Hangzhou Dianzi University, Hangzhou 310018, China

Supported by: the "Leading Goose" Technology Research and Development Program of Zhejiang Province (2022C03065)

Abstract

Objective Effective automatic grouping of data into clusters, especially clustering high-dimensional datasets, is one of the key issues in machine learning and data analysis. It is related to many aspects of signal processing applications, including computer vision, pattern recognition, speech and audio recognition, wireless communication and text classification. Current clustering algorithms are constrained of high computational complexity and poor performance in processing high-dimensional data due to the dimension disaster. Deep neural networks based clustering methods have its potential for real data clustering derived of their high representational ability. Autoencoder (AE) or variational autoencoder (VAE) clustering networks improve clustering effectiveness. But, their clustering performance is easy to be distorted intensively because of poor features extraction in distinguishing clear and unclear data or posterior collapse to clarify determining its posterior parameters of the latent variable of VAE, and they are insufficient to segment multiple classes, especially share very similar mean and variance in the context of clustering a multiclass dataset or two different classes. We demonstrate a clustering network based on VAE with the prior of Gaussian mixture (GM) distribution in terms of the deficiency of AE and VAE. Method The VAE, a maximum likelihood generative model, maximizes evidence lower bound (ELBO) via minimizing model reconstruction errors. Its difference of potential cost is through Kullback-Leibler (KL) divergence between the posterior distribution and the hypothesized prior, and then establishes maximum marginal log-likelihood (LL) of the data observed. Due to the approximate posterior distribution used VAE as a benched Gaussian distribution, it is challenged to match the ground truth posterior and have its priority of the KL term in ELBO, and the latent variable space may be arbitrarily complicated or even multimodal. To further improve the description of latent variables, a VAE is facilitated based on a latent variable prior of GM distribution. Its GM distribution prior linked data representation is approximated using the posterior distribution of the latent variable composed of a GM model, and the reconstruction error and the KL divergence based cost function between posterior and prior distribution is adopted to train the GM model based VAE. Due to the KL divergence between two GM distribution functions without a closed form solution, we use the approximate variational lower bound solution of the cost function with the aid of the fact that the KL divergence between the two single Gaussians has a closed-form solution, and implement the VAE using GM distribution priors optimization to resolve the KL divergence. A VAE based clustering network is constructed through a clustering layer combination behind the VAE. To improve the clustering performance, the STUDENT's t-distribution is used as a kernel to compute the soft assignment of the latent features of the VAE between the embedded point and the cluster centroid. Furthermore, a KL divergence cost is constructed between the soft assignment and its auxiliary target distribution. The commonly used VAE utilizes fully-connected neural networks to compute the latent variable, which generates more over fitted data parameters. Thus, the clustering network is carried out by convolutional neural networks (CNNs), which consist of three convolutional layers and two fully-connected layers, without fully-connected neural networks, and no pooling layers used in the network because it will result in loss of useful information of the data. The network is trained by optimizing the KL divergence cost using stochastic gradient descent (SGD) method with the initial network parameters from the VAE. Our clustering network was obtained by the two-step training mentioned above like acquired VAE, as the initial value to train the following clustering layer. Result To test the effectiveness of the proposed algorithm, our network is evaluated on the multiclass benchmark datasets MNIST(Modified National Institute of Standards and Technology Database) which contains images of 10 categories of handwritten digits, and Fashion-MNIST which consists of grayscale images associated to a 10 segmented label. Our algorithm achieves 95.86% accuracy (ACC) and 91% normalized mutual information (NMI) on MNIST, ACC 61.34% and 62.5% NMI on Fashion-NMIST. Our network demonstration has the similar performance to ClusterGAN with fewer parameters and less memory space. The experimental results illustrate that our network achieves feasible clustering performance. Conclusion We construct a VAE based clustering network with the prior of GM distribution. A novel framed VAE is established to improve the representation ability of the latent variable based on a latent variable prior of GM distribution. The KL divergence between posterior and prior GM distribution is optimized to achieve latent variable features of VAE and reconstruct its input well. To improve the clustering performance, the clustering network is trained by optimizing the KL divergence between the soft distribution of the latent features of the VAE and the auxiliary target distribution of the soft assignment. We focus on the issue of where the number of Gaussian components in prior and posterior is different and the ability of the representation of the model on complex texture features further.

Key words

clustering; Gaussian mixture distribution; variational autoencoder(VAE); soft assignment; Kullback-Leibler(KL) divergence

0 引言

随着科学技术的发展，数据的规模呈指数型增长，这些海量数据往往蕴含着许多极具价值的潜在信息，如何捕获并挖掘这些隐藏的潜在信息是一个亟需解决的问题，具有重要的现实意义。

聚类分析是数据挖掘领域的一种研究方法，也是目前用于发现数据间潜在信息的主要方法之一。聚类分析旨在发现数据间潜在的关系，并根据数据的特征等将数据聚为不同的类别，也称为簇，使得簇内的数据具有较小的差异性，簇间的数据具有较大的差异性。聚类分析已广泛应用于用户画像(张海涛等，2018)、协同过滤推荐系统(吴湖等，2010)、基因分析(岳峰等，2008)、异常检测(成宝芝等，2017)和文本聚类(路荣等，2012)等领域，吸引了越来越多的学者加入到聚类分析的研究队伍中。

1 相关工作

MacQueen(1967)提出的K-means聚类算法和Ester等人(1996)提出的DBSCAN(density-based spatial clustering of applications with noise)聚类算法是两个经典的聚类方法。K-means算法的基本思想是将数据分为不同的类别，每一个类别具有一个聚类中心，根据数据到各聚类中心的距离更新优化聚类划分，反复迭代得到一个最佳的聚类结果。DBSCAN是一种基于密度的聚类算法，根据邻域内的点数判断该区域是否属于密集区域并形成临时聚簇，将相连的临时聚簇进行合并得到更合理的聚簇分配。

然而，随着数据维度的提高，经典的聚类算法在应对高维数据时往往存在维数灾难等问题，使得计算成本大幅增加并且效果不佳。近年来，深度神经网络得到了飞速的发展。因此，越来越多的研究人员将目光转向利用深度学习进行聚类分析。相比于传统方法，深度学习能够更好地得到数据特征的低维表示，提高聚类效果。Yang等人(2017)提出了DCN(deep clustering network)模型，该模型训练一个自编码器AE(autoencoder), 得到数据特征的低维表示，然后结合K-means聚类算法，将重构损失和K-means聚类损失进行联合优化，该模型聚类效果高于同时期的传统算法。Xie等人(2016)提出了DEC(deep embedded clustering)模型，模型训练一个堆叠的自编码器得到数据特征的低维表示，然后在此基础上构建聚类网络。Opochinsky等人(2020)和Chazan等人(2019)用多个自编码器建立深度聚类网络，每个自编码器表示一个簇，取得了更好的聚类效果。Duan等人(2019)采用自编码生成一个深度嵌入网用于数据降维，学习一个softmax自编码器用于估计簇的数目，获得了较好的实验结果。但是自编码器的目的主要是为了降维，网络训练目标是使解码器输出和输入尽可能逼近，当训练样本与预测样本不符合相同分布时，提取的特征往往比较差。与自编码器相关的另一种编码器是变分自编码器(variational autoencoder，VAE)，Jiang等人(2017)和Lim等人(2020)提出了变分自编码器用于聚类，先用VAE生成隐层特征，然后用混合高斯分布拟合隐层特征，聚类效果优于经典的聚类方法和一些生成式聚类方法。变分自编码器采用标准正态分布作为先验，容易导致后验塌陷(Guo等，2020)，对不同类别数据的分布不能较好地逼近，影响编码和解码结果。为此，本文引入混合高斯分布作为先验，构建混合高斯自编码器生成隐层特征，学习数据的特征分布，并以此编码器与聚类层结合形成聚类网络，通过优化编码器隐层特征的软分配分布与软分配概率辅助目标分布之间的KL散度(Kullback-Leibler divergence)对聚类网络进行训练。在基准数据集MNIST(Modified National Institute of Standards and Technology Database)和Fashion-MNIST上的实验结果表明，本文网络取得了较好的聚类效果，且优于当前多种流行的聚类方法。

2 聚类模型

2.1 混合高斯变分自编码器

变分自编码器(Kingma和Welling，2014)是深度学习领域一种近似推理的有向模型，它利用变分推断与深度学习相结合，能够学习到数据结构化的特征表示或分布，是生成式建模中的一种重要方法。标准变分自编码器的目标函数可表示为

$\begin{gathered} L(\theta, \phi ; \boldsymbol{x})=-D_{\mathrm{KL}}\left[q_{\phi}(\boldsymbol{z} \mid \boldsymbol{x}) \| p_{\theta}(\boldsymbol{z})\right]+ \\ E_{q_{\phi}(z \mid x)}\left[\log p_{\theta}(\boldsymbol{x} \mid \boldsymbol{z})\right] \end{gathered} $

(1)

式中，当近似后验分布与假设先验分布之间的KL散度$ D_{\mathrm{KL}}\left[q_{\phi}(\boldsymbol{z} \mid \boldsymbol{x}) \| p_{\theta}(\boldsymbol{z})\right]$最小时，目标函数$L(\theta$, $\phi; \boldsymbol{x})$下限最大，从而使模型达到最优。其中，$E$表示求期望，$\theta$和$\phi$分别是先验分布$p$和后验分布$q$的参数，$\boldsymbol{x}$和$\boldsymbol{z}$分别是变分自编码器的输入和隐层特征。但是，标准变分自编码器中采用标准正态分布作为先验，可能引起后验塌陷(Guo等，2020)，并且容易忽略一些潜在的变量约束，导致对不同类别数据的分布不能较好地逼近，影响自编码网络的编码和解码结果。在这里，本文引入混合高斯分布作为先验，构建变分自编码器。混合高斯先验可表示为

$ p_{\theta}(\boldsymbol{z})=\sum\limits_{i=1}^{M} \pi_{i} N\left(\boldsymbol{z}^{(i)} ; \tilde{\boldsymbol{\mu}}^{(i)}, \tilde{\boldsymbol{\sigma}}^{(i)^{2}} \boldsymbol{I}\right) $

(2)

$\sum\limits_{i=1}^{M} \pi_{i}=1 $

(3)

式中，$\pi_{i}$是第$i$个单高斯分布的系数，$\tilde{\boldsymbol{\mu}}^{(i)}$和$\tilde{\boldsymbol{\sigma}}^{(i)^{2}}$代表第$i$个单高斯分布的均值和方差，$\boldsymbol{z}^{(i)}$表示第$i$个单高斯生成的隐层特征，$\boldsymbol{I}$是单位矩阵，$M$表示高斯分量的个数，$N(\cdot; \cdot)$表示高斯分布。相应地，近似后验$q_{\phi}(\boldsymbol{z} \mid \boldsymbol{x})$可表示为

$ \begin{aligned} q_{\phi}(\boldsymbol{z} \mid \boldsymbol{x})= \sum\limits_{i=1}^{M} \omega_{i} N\left(\boldsymbol{z}^{(i)} ; \boldsymbol{\mu}^{(i)}, \boldsymbol{\sigma}^{(i)^{2}} \boldsymbol{I}\right) \end{aligned} $

(4)

$ \sum\limits_{i=1}^{M} \omega_{i}=1 $

(5)

式中，$\omega_{i}$是第$i$个单高斯分布的系数，$\boldsymbol{\mu}^{(i)}$和$\boldsymbol{\sigma}^{(i)^{2}}$代表第$i$个单高斯分布的均值和方差。式(1)中等号右侧的第1项是混合高斯分布近似后验与先验之间的KL散度，目前尚无高效的算法可以获得它的解析解。为此，Hershey和Olsen(2007)提出了一种近似解法，采用变分推断获得KL散度的上界。于是最小化KL散度转换为最小化其近似上界，该算法简述如下：

设数据$\boldsymbol{x}$的混合高斯分布$f(\boldsymbol{x})$和$g(\boldsymbol{x})$分别可表示为

$ f(\boldsymbol{x})=\sum\limits_{i=1}^{M} \alpha_{i} N\left(\boldsymbol{x} ; \overline{\boldsymbol{\mu}}^{(i)}, \overline{\boldsymbol{\sigma}}^{(i)^{2}} \boldsymbol{I} \right) $

(6)

$ g(\boldsymbol{x})=\sum\limits_{i=1}^{M} \beta_{i} N\left(\boldsymbol{x} ; \boldsymbol{\mu}^{\prime(i)}, \boldsymbol{\sigma}^{(i)^{2}} \boldsymbol{I}\right) $

(7)

式中，$\alpha_{i}$和$\beta_{i}$分别是$f(\boldsymbol{x})$和$g(\boldsymbol{x})$的第$i$个单高斯分布的系数，$\sum\limits_{i=1}^{M} \alpha_{i}=1, \sum\limits_{i=1}^{M} \beta_{i}=1, \bar{\mu}^{(i)}$和$\bar{\sigma}^{(i)^{2}}$分别是$f(\boldsymbol{x})$ 中第$i$个单高斯分布的均值和方差，$\boldsymbol{\mu}^{\prime(i)}$和$\boldsymbol{\sigma}^{\prime(i)^{2}}$分别是$g(\boldsymbol{x})$中第$i$个单高斯分布的均值和方差。于是，$f(\boldsymbol{x})$和$g(\boldsymbol{x})$之间的KL散度上界可以表示为

$ \begin{gathered} D_{\mathrm{KL}}(f(\boldsymbol{x}) \| g(\boldsymbol{x})) \leqslant \sum\limits_{i=1}^{M} \alpha_{i}\left(\log \left(\alpha_{i} / \beta_{i}\right)+\right. \\ \left.D_{\mathrm{KL}}\left(f_{i}(\boldsymbol{x}) \| g_{i}(\boldsymbol{x})\right)\right) \end{gathered} $

(8)

式中，$f_{i}(\boldsymbol{x})$和$g_{i}(\boldsymbol{x})$分别代表$f(\boldsymbol{x})$和$g(\boldsymbol{x})$中等号右侧第$i$个单高斯分布。

因此，式(1)中等号右侧第1项可以表示为

$ \begin{gathered} -D_{\mathrm{KL}}\left[\sum\limits_{i=1}^{M} \omega_{i} N\left(\boldsymbol{z}^{(i)} ; \boldsymbol{\mu}^{(i)}, \boldsymbol{\sigma}^{(i)^{2}} \boldsymbol{I}\right)\right. \\ \left.\| \sum\limits_{i=1}^{M} \pi_{i} N\left(\boldsymbol{z}^{(i)} ; \tilde{\boldsymbol{\mu}}^{(i)}, \tilde{\boldsymbol{\sigma}}^{(i)^{2}} \boldsymbol{I}\right)\right] \geqslant \\ -\sum\limits_{i=1}^{M} \omega_{i}\left(\log \frac{\omega_{i}}{\pi_{i}}+D_{\mathrm{KL}}\left(N\left(\boldsymbol{z}^{(i)} ; \boldsymbol{\mu}^{(i)}, \boldsymbol{\sigma}^{(i)^{2}} \boldsymbol{I}\right) \|\right.\right. \\ \left.\left.N\left(\boldsymbol{z}^{(i)} ; \tilde{\boldsymbol{\mu}}^{(i)}, \tilde{\boldsymbol{\sigma}}^{(i)^{2}} \boldsymbol{I}\right)\right)\right)=-\sum\limits_{i=1}^{M} \omega_{i}\left(\log \frac{\omega_{i}}{\pi_{i}}+\right. \\ \left.\sum\limits_{j=1}^{J} \frac{1}{2}\left(\log \frac{\tilde{\boldsymbol{\sigma}}_{j}^{(i)^{2}}}{\sigma_{j}^{(i)^{2}}}+\frac{\left(\mu_{j}^{(i)}-\tilde{\mu}_{j}^{(i)}\right)^{2}+\sigma_{j}^{(i)^{2}}}{\tilde{\sigma}_{j}^{(i)^{2}}}-1\right)\right) \end{gathered} $

(9)

式中，$\tilde{\mu}_{j}^{(i)}$表示$\tilde{\boldsymbol{\mu}}^{(i)}$的第$j$个元素，$\tilde{\boldsymbol{\sigma}}_{j}^{(i)^{2}}$表示$\tilde{\boldsymbol{\sigma}}^{(i)^{2}}$的第$j$个元素，$\tilde{\boldsymbol{\mu}}^{(i)}$和$\tilde{\boldsymbol{\sigma}}^{(i)^{2}}$ 的维数。

式(1)中等号右侧的第2项是重构项，它的计算方式与标准变分自编码器类似，具体为

$ E_{q_{\phi}(\boldsymbol{z} \mid \boldsymbol{x})}\left[\log p_{\theta}(\boldsymbol{x} \mid \boldsymbol{z})\right] \approx \frac{1}{L} \sum\limits_{l=1}^{L} \log p_{\theta}\left(\boldsymbol{x} \mid \boldsymbol{z}_{l}\right) $

(10)

式中，$L$是采样的数量，$\boldsymbol{z}_l$的下标$l$表示第$l$次采样。由式(9)和式(10)可得到混合高斯分布后验与先验的变分下界，即

$ \begin{gathered} L(\theta, \phi ; \boldsymbol{x}) \geqslant-\sum\limits_{i=1}^{M} \omega_{i}\left(\log \frac{\omega_{i}}{\pi_{i}}+\sum\limits_{j=1}^{J} \frac{1}{2} \times\right. \\ \left.\left(\log \frac{\widetilde{\boldsymbol{\sigma}}_{j}^{(i)^{2}}}{\sigma_{j}^{(i)^{2}}}+\frac{\left(\mu_{j}^{(i)}-\tilde{\mu}_{j}^{(i)}\right)^{2}+\sigma_{j}^{(i)^{2}}}{\tilde{\sigma}_{j}^{(i)^{2}}}-1\right)\right)+ \\ \frac{1}{L} \sum\limits_{l=1}^{L} \log p_{\theta}\left(\boldsymbol{x} \mid \boldsymbol{z}_{l}\right) \end{gathered} $

(11)

式(11)即为混合高斯分布自编码器的目标函数。

2.2 聚类网络

本文在混合高斯变分自编码器的基础上，使用编码器部分作为数据空间和特征空间之间的初始映射，将编码器和聚类层组合成聚类网络，如图 1所示。图 1中，以学习获得的混合高斯分布自编码器部分作为聚类网络的编码器部分，结合聚类层通过最小化辅助目标分布和软分配分布之间的KL散度学习聚类网络。具体过程如下：

图 1 聚类网络结构图

Fig. 1 Structure of the clustering network

假设存在数据集$\left\{\boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \cdots, \boldsymbol{x}^{(N)}\right\}, N$为数据样本的数量，输入数据$\boldsymbol{x}^{(i)}$到混合高斯变分自编码器中得到隐层特征$\boldsymbol{z}^{(i)}$，使用欧氏距离计算隐层特征$\boldsymbol{z}^{(i)}$到聚类中心$\boldsymbol{c}^{(t)}$的距离，$\boldsymbol{c}^{(t)}$表示第$t$个聚类中心，并使用$t$分布(van der Maaten和Hinton，2008)衡量隐层特征$\boldsymbol{z}^{(i)}$到聚类中心$\boldsymbol{c}^{(t)}$之间的相似度$s_{i t}$，也即特征$\boldsymbol{z}^{(i)}$分配到聚类类别$t$的软分配概率(Dempster等，1977)，$s_{i t}$计算为

$s_{i t}=\frac{\left(1+\left\|\boldsymbol{z}^{(i)}-\boldsymbol{c}^{(t)}\right\|^{2} / \chi\right)^{-\frac{\chi+1}{2}}}{\sum\limits_{t=1}^{T}\left(1+\left\|\boldsymbol{z}^{(i)}-\boldsymbol{c}^{(t)}\right\|^{2} / \chi\right)^{-\frac{\chi+1}{2}}} $

(12)

式中，$\chi$代表$t$分布的自由度，由于聚类属于无监督学习，无法交叉验证$\chi$的取值，因此本文取$\chi$=1。将数据输入到训练获得的混合高斯变分自编码器得到隐层特征，然后在特征空间采用混合高斯模型进行聚类，得到$T$个聚类中心$\left\{\boldsymbol{c}^{(1)}, \boldsymbol{c}^{(2)}, \cdots, \boldsymbol{c}^{(T)}\right\}$作为初始化的聚类中心。

本文采用最小化辅助目标分布和软分配之间的KL散度实现对模型的优化，因此辅助目标分布的选择对聚类效果至关重要。Xie等人(2016)提出辅助目标分布应该更加注重高置信度的数据点，以提高聚类的准确性。同时，需要归一化代价函数对每个聚类中心的贡献，防止出现过大的聚簇导致隐层的特征空间扭曲。因此，将软分配概率的辅助目标分布$p_{i t}$定义为

$ p_{i t}=\frac{s_{i t}^{2} / f_{t}}{\sum\limits_{t=1}^{T} s_{i t}^{2} / f_{t}} $

(13)

式中，$f_{t}=\sum\limits_{i} s_{i t}$代表软聚类中类别$t$的频率。

由此，得到聚类层的损失$L_{\text {cluster }}$，定义为软分配分布$s$与辅助目标分布$p$之间的KL散度，采用随机梯度下降法优化损失函数。具体为

$ L_{\text {cluster }}=D_{\mathrm{KL}}(p \| s)=\sum\limits_{i}^{M} \sum\limits_{t}^{T} p_{i t} \log \frac{p_{i t}}{s_{i t}} $

(14)

2.3 网络结构

本文中的变分自编码器采用卷积神经网络实现，实现的网络结构如图 1所示。网络的编码器部分，首先是核大小为3×3、步长为2、通道数为64的卷积层，激活函数采用ReLU(rectified linear unit)函数。考虑到池化层在实现下采样时存在丢失有用信息的不足(Sabour等，2017)，本文采用核大小为3×3、步长为2的卷积层实现下采样以保留重要信息。然后级联大小为3 × 3、步长为2、通道数为128的卷积层，再级联大小为3 × 3、步长为2、通道数为256的卷积层，然后连接两个全连接层，维数分别是2 304和10。

解码器部分在结构上与编码器部分是对称的，首先是两个级联的全连接层，它们的维数分别是10和2 304，然后级联一个维数变形层(reshape层)，将数据维度从2 304转换为3 × 3 × 256，然后级联3个卷积核大小为3 × 3的反卷积层，它们分别具有128通道、64通道、1通道。与编码器类似，解码器采用步长为2的反卷积层实现上采样。解码器在最后一层的反卷积层使用sigmoid函数作为激活函数，其余卷积层、反卷积层的激活函数都采用ReLU函数。

聚类网络中，自编码器的结构直接采用混合高斯分布自编码器的编码部分，并以学习获得的编码器参数作为初始值，进一步按式(14)所示目标函数优化学习聚类网络。

3 实验结果分析

3.1 实验配置

实验采用的计算机配置如下：Intel(R)Core(TM)i5-7300HQ@2.50 GHz CPU，Windows 10，编译器Python3.6，内存为8 GB，编程环境为TensorFlow和Keras，编程语言为Python。网络训练参数值采用正态分布随机初始化，batch size为100，优化方法为Adam优化器，学习率为0.000 1。

为评估本文方法的有效性，采用了聚类分析中常用的MNIST(LeCun等，1998)数据集和Fashion-MNIST(Xiao等，2017)数据集分别进行实验评测。MNIST数据集是LeCun等人(1998)在美国国家标准与技术研究院提供的手写数据集的基础上筛选，并进行了尺寸标准化及数字中心化等处理的标准数据集，由60 000个训练样本和10 000个测试样本组成，每个样本都是28 × 28像素的灰度图像；Fashion-MNIST是一个替代MNIST手写数字集的图像数据集，由德国科技公司Zalando旗下的研究部门提供，涵盖了10种类别共70 000个不同时尚商品的正面灰度图像，图像大小为28 × 28像素，包括T恤、裤子、套衫、裙子、外套、凉鞋、衬衫、运动鞋、包和靴子等10类商品。为简化问题求解，实验中使用的混合高斯中高斯分量的个数$M$ = 10，各分量的混合系数取经验值为1/10，初始化聚类中心的个数$T$ = 10。

3.2 评价指标

采用聚类精度(accuracy，ACC)和标准互信息(normalized mutual information，NMI)作为评估指标。

聚类精度ACC用于衡量算法得到的聚类标签准确性，计算为

$A C C=\max\limits _{m} \frac{1}{N} \sum\limits_{i=1}^{N} 1\left(\boldsymbol{l}^{(i)}=m\left(\hat{\boldsymbol{l}}^{(i)}\right)\right) $

(15)

式中，$\boldsymbol{l}^{(i)}$为给定真实数据类别标签，$\hat{\boldsymbol{l}}^{(i)}$为算法得到的聚类标签，$m(\cdot)$是聚类标签和真实数据类别标签之间所有可能的一对一的映射，目的是在聚类标签与真实数据类别标签之间找到最佳映射，这里采用匈牙利算法(Kuhn，2005)得到最佳映射。1(·)表示指示函数，当括号内的表示为真时值为1，否则为0。

标准化互信息NMI是衡量两个随机事件之间相关性的重要指标，也是常用的聚类评估指标之一，这里用做衡量聚类标签与真实数据类别标签的契合程度。标准化互信息计算为

$ N M I(\boldsymbol{l}, \hat{\boldsymbol{l}})=\frac{M I(\boldsymbol{l}, \hat{\boldsymbol{l}})}{\sqrt{H(\boldsymbol{l}) H(\hat{\boldsymbol{l}})}} $

(16)

式中，$M I(\boldsymbol{l}, \hat{\boldsymbol{l}})$是真实数据类别标签$\boldsymbol{l}$和网络得到的聚类标签$\hat{\boldsymbol{l}}$的互信息，$H(\boldsymbol{l})$和$H(\hat{\boldsymbol{l}})$分别是它们的信息熵。

3.3 实验结果

本文模型分别在MNIST和Fashion-MNIST数据集上进行了聚类实验。为了进一步验证本文方法的有效性，与高斯混合模型(Gaussian mixture model，GMM)(Fraley和Raftery，1998)、VAE+K-means(Kingma和Welling，2014)、DEC(deep embedded clustering)(Xie等，2016)、IDEC(improved DEC)(Guo等，2017)、GMVAE(Dilokthanakul等，2017)、KADC (K-autoencoders deep clustering)(Opochinsky等，2020)、VaDE(variational deep embedding)(Jiang等，2017)、DCVA(deep clustering with VAE)(Lim等，2020)和ClusterGAN(clustering in generative adversarial networks)(Mukherjee等，2019)等算法进行对比，对比结果如表 1和表 2所示。

表 1 不同算法在MNIST和Fashion-MNIST数据集上的ACC比较
Table 1 Comparison of ACC among different algorithms on MNIST and Fashion-MNIST datasets

下载CSV

/%
算法	MNIST	Fashion-MNIST
GMM	52.73	49.37
VAE+K-means	70.92	45.57
DEC	84.30	58.6
IDEC	88.06	58.9
GMVAE	88.54	56.35
KADC	88	60
VaDE	94.46	-
DCVA	88.75	-
ClusterGAN	95	63
本文	95.86	61.34
注：加粗字体表示各列最优结果；“-”表示原文献未做评价。

表 2 不同算法在MNIST和Fashion-MNIST数据集上的NMI比较
Table 2 Comparison of NMI among different algorithms on MNIST and Fashion-MNIST datasets

下载CSV

/%
算法	MNIST	Fashion-MNIST
GMM	49.26	48.69
VAE+K-means	70.34	46.74
DEC	83.64	57.5
IDEC	86.7	55.7
GMVAE	87.03	53.59
KADC	86	65
VaDE	-	-
DCVA	81.9	-
ClusterGAN	89	64
本文	91	62.5
注：加粗字体表示各列最优结果；“-”表示原文献未做评价。

从表 1和表 2可以看出，本文模型与GMM、VAE + K-means、DEC、IDEC、GMVAE、VaDE和DCVA相比，在MNIST和Fashion-MNIST数据集上的聚类精度ACC和标准互信息NMI均有较大提升。与KADC相比，本文方法在Fashion-MNIST数据集上的标准互信息NMI低于KADC。与ClusterGAN算法相比，本文算法在MNST数据集上的聚类ACC和标准互信息NMI优于ClusterGAN，但在Fashion-MNIST数据集上略逊于ClusterGAN算法。除了GMM和ClusterGAN方法，其他方法都是从AE或VAE基础上发展起来的，其中性能最好的VaDE方法对聚簇采用单个质心表示嵌入空间中的向量，而本文方法采用一个混合高斯自动编码器网络表示嵌入空间中的向量，这样对每个聚簇可以实现更丰富的表示，同时辅助目标分布的引入提高了聚类的准确性，防止了隐层的特征空间扭曲，使得算法具有良好的聚类表现；ClusterGAN和本文方法虽然都属于生成式聚类方法，但是两者无论是从网络结构还是实现思路上属于两个不同的方法类别，ClusterGAN是目前聚类性能最好的生成对抗网络，本文方法与该方法的性能相当。总体而言，本文算法具有较好的聚类结果。

此外，由表 1和表 2还可知，尽管MNIST和Fashion-MNIST都是28 × 28像素的高维图像数据，具有相同的数据维度，但是无论是ACC指标还是NMI指标，各方法除了GMM方法在改变数据集时指标下降比较小，其他方法都出现了大幅度下降，这是因为MNIST数据集由灰度变化范围小的图像组成，纹理特征信息比较单一，主要是字符的边界信息，在描述数据特征时对各模型的特征表述能力要求相对较低，而Fashion-MNIST数据集由灰度变化范围大的图像组成，纹理特征信息比较丰富，如衣服、鞋子和包等丰富的内部纹理和边缘，在描述数据特征时对各模型的特征表述能力要求较高，而目前的方法对Fashion-MNIST数据集的特征信息表达均不是很强，导致聚类性能明显劣于MNIST数据集。

图 2列出了本文模型在MNIST和Fashion-MNIST数据集中每个类的10幅图像，其中每一行对应一个聚簇。由图 2(a)可知，本文模型对MNIST数据集的聚类结果较为准确，但出现了若干“4”和“9”混淆的情况，这一结果与“4”和“9”的外观特征相似有关。由图 2(b)可知，Fashion-MNIST数据集下的每个聚类依次为凉鞋、外套、靴子、套衫、裤子、运动鞋、T恤、衬衫、包和裙子。聚类结果中的凉鞋、靴子、裤子、运动鞋、T恤衫、包和裙子等类别的聚类较为准确，在外套、套衫和衬衫这3类中出现了若干次混淆的情况，这一结果的产生与这3类物体的外观较为相似有关，区分这3类物品更依赖于内部纹理特征的差异。

图 2 MNIST和Fashion-MNIST数据集的聚类结果

Fig. 2 Clustering results on MNIST and Fashion-MNIST datasets

((a) MNIST; (b) Fashion-MNIST)

图 3列出了本文模型与ClusterGAN算法在Fashion-MNIST数据集下的重建结果。由图 3可知，ClusterGAN算法重建得到图像的纹理特征较本文模型清晰，其在纹理特征的提取与重建上优于本文方法。因此，ClusterGAN的聚类效果在Fashion-MNIST数据集上略优于本文算法。

图 3 本文模型与ClusterGAN算法的重建结果

Fig. 3 Reconstruction of the proposed network and ClusterGAN

((a) ours; (b) ClusterGAN)

3.4 结构复杂度分析

本文模型在Fashion-MNIST数据集上的聚类指标略低于ClusterGAN算法。本文模型和ClusterGAN算法都属于生成式聚类方法，不同的是ClusterGAN采用生成对抗网络用于聚类，而本文采用混合高斯分布的变分自编码器。对这两种算法的模型参数量进行了比较分析，如表 3所示。

表 3 本文模型与ClusterGAN算法的模型参数量比较
Table 3 Comparison of the number of parameters in the proposed network and ClusterGAN

下载CSV

算法	模型参数量
ClusterGAN	13 199 785
本文	1 225 417
注：加粗字体表示最优结果。

由表 3可知，本文模型参数量不及ClusterGAN算法的1/10，远小于ClusterGAN算法，是一个更轻量级的网络模型，这使得本文模型占用更小的存储空间，降低了对内存的需求，同时能够实现更快的运行速度，在Fashion-MNIST数据集上的性能差异小于2%。

4 结论

本文提出了一种基于混合高斯变分自编码器的聚类网络模型，以混合高斯分布为先验建立变分自编码器，学习数据的特征分布，然后将编码器与聚类层结合构建聚类网络，采用编码器隐层特征的软分配分布与软分配概率辅助目标分布之间的KL散度作为目标函数，对网络进行训练和优化。在基准数据集MNIST和Fashion-MNIST上进行了评价和比较，对比实验结果表明，采用混合高斯自动编码器网络对每个聚簇可以实现更丰富的表示，辅助目标分布的引入使得算法具有良好的聚类表现，使得本文方法在聚类ACC和标准互信息NMI指标都优于当前的一些聚类算法，取得了较好的聚类效果。

但是本文算法也存在两个不足：1)在模型建立上，先验和后验中的高斯分量个数设为相同，虽然实验结果优于当前的一些聚类算法，但是处理方法只是对实际情况的简化处理，与实际情况存在一些差别，实际更一般的情况中，先验和后验中的高斯分量个数并不相同，如何优化求解这个问题是个难题；同时，本文中各高斯分量的混合系数不是通过数学优化求得的最佳混合系数，而是根据经验设定为等概率混合的，缺乏完善的理论支持。2)在模型对信息的表达能力上，当处理复杂纹理特征时，纹理特征的重建效果有待提高。上述问题将是下一步研究的重点。

参考文献

Chazan S E, Gannot S and Goldberger J. 2019. Deep clustering based on a mixture of autoencoders//Proceedings of the 29th IEEE International Workshop on Machine Learning for Signal Processing. Pittsburgh, USA: IEEE: 1-6[DOI: 10.1109/MLSP.2019.8918720]

Cheng B Z, Zhao C H, Zhang L L, Zhang J P. 2017. Joint spatial preprocessing and spectral clustering based collaborative sparsity anomaly detection for hyperspectral images. Acta Optica Sinica, 37(4): 296-306 (成宝芝, 赵春晖, 张丽丽, 张健沛. 2017. 联合空间预处理与谱聚类的协同稀疏高光谱异常检测. 光学学报, 37(4): 296-306) [DOI:10.3788/AOS201737.0428001]

Dempster A P, Laird N M, Rubin D B. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1): 1-22 [DOI:10.2307/2984875]

Dilokthanakul N, Mediano P A M, Garnelo M, Lee M C H, Salimbeni H, Arulkumaran K and Shanahan M. 2017. Deep unsupervised clustering with Gaussian mixture variational autoencoders[EB/OL]. [2020-05-06]. https://arxiv.org/pdf/1611.02648.pdf

Duan L, Aggarwal C, Ma S and Sathe S. 2019. Improving spectral clustering with deep embedding and cluster estimation//Proceedings of 2019 IEEE International Conference on Data Mining. Beijing, China: IEEE: 170-179[DOI: 10.1109/ICDM.2019.00027]

Ester M, Kriegel H P, Sander J and Xu X W. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise[EB/OL]. [2020-05-06]. https://max.book118.com/html/2017/0725/124226331.shtm

Fraley C, Raftery A E. 1998. How many clusters? Which clustering method? Answers via model-based cluster analysis. The Computer Journal, 41(8): 578-588 [DOI:10.1093/comjnl/41.8.578]

Guo C S, ZHOU J L, Chen H H, Ying N, Zhang J W, Zhou D. 2020. Variational autoencoder with optimizing Gaussian mixture model priors. IEEE Access, 8: 43992-44005 [DOI:10.1109/ACCESS.2020.2977671]

Guo X F, Gao L, Liu X W and Yin J P. 2017. Improved deep embedded clustering with local structure preservation//Proceedings of the 26th International Joint Conference on Artificial Intelligence. Melbourne, Australia: AAAI: 1753-1759[DOI: 10.24963/ijcai.2017/243]

Hershey J R and Olsen P A. 2007. Approximating the Kullback Leibler divergence between Gaussian mixture models//Proceedings of 2007 IEEE International Conference on Acoustics, Speech and Signal Processing. Honolulu, USA: IEEE: IV-317-IV-320[DOI: 10.1109/ICASSP.2007.366913]

Jiang Z X, Zheng Y, Tan H C, Tang B S and Zhou H N. 2017. Variational deep embedding: an unsupervised and generative approach to clustering//Proceedings of the 26th International Joint Conference on Artificial Intelligence. Melbourne, Australia: AAAI: 1965-1972[DOI: 10.24963/ijcai.2017/273]

Kingma D P and Welling M. 2014. Auto-encoding variational Bayes[EB/OL]. [2020-05-06]. https://arxiv.org/pdf/1312.6114.pdf

Kuhn H W. 2005. The Hungarian method for the assignment problem. Naval Research Logistics, 52(1): 7-21 [DOI:10.1002/nav.20053]

LeCun Y, Bottou L, Bengio Y, Haffner P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11): 2278-2324 [DOI:10.1109/5.726791]

Lim K L, Jiang X D, Yi C Y. 2020. Deep clustering with variational autoencoder. IEEE Signal Processing Letters, 27: 231-235 [DOI:10.1109/LSP.2020.2965328]

Lu R, Xiang L, Liu M R, Yang Q. 2012. Discovering news topics from microblogs based on hidden topics analysis and text clustering. Pattern Recognition and Artificial Intelligence, 25(3): 382-387 (路荣, 项亮, 刘明荣, 杨青. 2012. 基于隐主题分析和文本聚类的微博客中新闻话题的发现. 模式识别与人工智能, 25(3): 382-387) [DOI:10.3969/j.issn.1003-6059.2012.03.004]

MacQueen J. 1967. Some methods for classification and analysis of multivariate observations[EB/OL]. [2020-05-06]. http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=4172CEB4912D3E21EF68579C8888BA56?doi=10.1.1.308.8619&rep=rep1&type=pdf

Mukherjee S, Asnani H, Lin E and Kannan S. 2019. ClusterGAN: latent space clustering in generative adversarial networks[EB/OL]. [2020-05-06]. https://arxiv.org/pdf/1809.03627v1.pdf

Opochinsky Y, Chazan S E, Gannot S and Goldberger J. 2020. K-autoencoders deep clustering//Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Barcelona, Spain: IEEE: 4037-4041[DOI: 10.1109/ICASSP40776.2020.9053109]

Sabour S, Frosst N and Hinton G E. 2017. Dynamic routing between capsules[EB/OL]. [2020-05-06]. https://arxiv.org/pdf/1710.09829.pdf

van der Maaten L, Hinton G. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research, 9: 2579-2605

Wu H, Wang Y J, Wang Z, Wang X L, Du S Z. 2010. Two-phase collaborative filtering algorithm based on co-clustering. Journal of Software, 21(5): 1042-1054 (吴湖, 王永吉, 王哲, 王秀利, 杜拴柱. 2010. 两阶段联合聚类协同过滤算法. 软件学报, 21(5): 1042-1054) [DOI:10.3724/SP.J.1001.2010.03758]

Xiao H, Rasul K and Vollgraf R. 2017. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms[EB/OL].[2020-05-06]. https://arxiv.org/pdf/1708.07747.pdf

Xie J Y, Girshick R and Farhadi A. 2016. Unsupervised deep embedding for clustering analysis[EB/OL].[2020-05-06]. https://arxiv.org/pdf/1511.06335.pdf

Yang B, Fu X, Sidiropoulos N D and Hong M Y. 2017. Towards K-means-friendly spaces: simultaneous deep learning and clustering[EB/OL].[2020-05-06]. https://arxiv.org/pdf/1610.04794.pdf

Yue F, Sun L, Wang K Q, Wang Y J, Zuo W M. 2008. State-of-the-art of cluster analysisof gene expression data. Acta Automatica Sinica, 34(2): 113-120 (岳峰, 孙亮, 王宽全, 王永吉, 左旺孟. 2008. 基因表达数据的聚类分析研究进展. 自动化学报, 34(2): 113-120) [DOI:10.3724/SP.J.1004.2008.00113]

Zhang H T, Cui Y, Wang D, Song T. 2018. Study of online healthy community user profile based on concept lattice. Journal of the China Society for Scientific and Technical Information, 37(9): 912-922 (张海涛, 崔阳, 王丹, 宋拓. 2018. 基于概念格的在线健康社区用户画像研究. 情报学报, 37(9): 912-922) [DOI:10.3772/j.issn.1000-0135.2018.09.006]