Current Issue Cover
面向视觉数据处理与分析的解耦表示学习综述

李雅婷1, 肖晶1, 廖良2, 王正1, 陈文益1, 王密3(1.武汉大学计算机学院国家多媒体软件工程技术研究中心, 武汉 430072;2.日本国立信息学研究所数字内容和媒体科学研究部, 东京 101-8430, 日本;3.武汉大学测绘遥感信息工程国家重点实验室, 武汉 430079)

摘 要
表示学习是机器学习研究的核心问题之一。机器学习算法的输入表征从过去主流的手工特征过渡到现在面向多媒体数据的潜在表示,使算法性能获得了巨大提升。然而,视觉数据的表示通常是高度耦合的,即输入数据的所有信息成分被编码进同一个特征空间,从而互相影响且难以区分,使得表示的可解释性不高。解耦表示学习旨在学习一种低维的可解释性抽象表示,可以识别并分离出隐藏在高维观测数据中的不同潜在变化因素。通过解耦表示学习,可以捕获到单个变化因素信息并通过相对应的潜在子空间进行控制,因此解耦表示更具有可解释性。解耦表征可用于提高样本效率和对无关干扰因素的容忍度,为数据中的复杂变化提供一种鲁棒性表示,提取的语义信息对识别分类、域适应等人工智能下游任务具有重要意义。本文首先介绍并分析解耦表示的研究现状及其因果机制,总结解耦表示的3个重要性质。然后,将解耦表示学习算法分为4类,并从数学描述、类型特点及适用范围3个方面进行归纳及对比。随后,分类总结了现有解耦表示工作中常用的损失函数、数据集及客观评估指标。最后,总结了解耦表示学习在实际问题中的各类应用,并对其未来发展进行了探讨。
关键词
A review of disentangled representation learning for visual data processing and analysis

Li Yating1, Xiao Jing1, Liao Liang2, Wang Zheng1, Chen Wenyi1, Wang Mi3(1.National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan 430072, China;2.Digital Content and Media Sciences Research Division, National Institute of Informatics, Tokyo 101-8430, Japan;3.State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan 430079, China)

Abstract
Representation learning is essential for machine learning technique nowadays. The transition of input representations have been developing intensively in algorithm performance benefited from the growth of hand-crafted features to the representation for multi-media data. However,the representations of visual data are often highly entangled. The interpretation challenges are to be faced because all information components are encoded into the same feature space. Disentangled representation learning(DRL)aims to learn a low-dimensional interpretable abstract representation that can sort the multiple factors of variation out in high-dimensional observations. In the disentangled representation,we can capture and manipulate the information of a single factor of variation through the corresponding latent subspace,which makes it more interpretable. DRL can improve sample efficiency and tolerance to the nuisance variables and offer robust representation of complex variations. Their semantic information is extracted and beneficial for artificial intelligence(AI)downstream tasks like recognition,classification and domain adaptation. Our summary is focused on brief introduction to the definition, research development and applications of DRL. Some of independent component analysis(ICA)-nonlinear DRL researches are covered as well since the DRL is similar to the identifiability issue of nonlinear independent component analysis(nonlinear ICA). The cause and effects mechanism of DRL as high-dimensional ground truth data is generated by a set of unobserved changing factors(generating factors). The DRL can be used to model the factors of variation in terms of latent representation,and the observed data generation process is restored. We summarize the key elements that a well-defined disentangled representation should be qualified into three aspects,which are 1)modularity,2)compactness,and 3)explicitness. First,explicitness consists of the two sub-requirements of completeness and informativeness. Then,current DRL types are categorized into 1)dimension-wise disentanglement,2)semantic-based disentanglement,3)hierarchical disentanglement,and 4) nonlinear ICA four types in terms of its formulation,characteristics,and scope of application. Dimension-wise disentanglement is assumed that the generative factors are solely and each dimension of latent vector can be separated and mapped,which is suitable for learning the disentangled representation of simple synthetic visual data. Semantic-based disentanglement is hypnotized that some semantic information is solely as well. The generative factors are group-disentangled in terms of specific semantics and they are mapped to different latent spaces,which is suitable for complicated ground truth data. Hierarchical disentanglement is based on the assumption that there is a correlation between generative factors at different levels of abstraction. The generative factors are disentangled by group from the bottom up and they can be mapped to latent space of different semantic abstraction levels to form a hierarchical disentangled representation. Nonlinear ICA provides an identifiable method for observed data-mixed disentangling unknown generative factors through a nonlinear reversible generator. For the motivation of loss functions,the loss functions can be commonly used in disentangled representation learning,which are grouped into three categories:1)modularity constraint:a single latent variable-constrained in the disentangled representation to capture only a single or a single group of factors of variation,and it promotes the separation of factors of variation mutually;2)explicitness constraint:current latent variable of the latent representation is activated to encode the ground truth of the corresponding generating factor effectively,and the entire latent representation contains complete information about all generative factors;and 3)multi-purpose constraint:lossrelated can optimize multiple disentangled representation,including modularity,compactness,and explicitness of the disentangled representation at the same time. The model-relevant can combine multiple loss constraint terms to form the final hybrid objective function. We compare the scope of application and limitations of each type of loss functions and summarize the classical disentangled representation works using the hybrid objective function further.
Keywords

订阅号|日报