多媒体技术研究:2014——深度学习与媒体计算

吴飞; 朱文武; 于俊清

doi:10.11834/jig.20151101

综述 | 浏览量 : 0 下载量: 384 CSCD: 5

PDF
导出
分享
收藏
专辑

多媒体技术研究:2014——深度学习与媒体计算
Researches on multimedia technology 2014—deep learning and multimedia computing
2015年20卷第11期页码：1423-1433
网络出版：2015-11-02，

纸质出版：2015
DOI： 10.11834/jig.20151101
稿件说明：

移动端阅览

吴飞, 朱文武, 于俊清. 多媒体技术研究:2014——深度学习与媒体计算[J]. 中国图象图形学报, 2015,20(11):1423-1433. DOI： 10.11834/jig.20151101.

Wu Fei, Zhu Wenwu, Yu Junqing. Researches on multimedia technology 2014—deep learning and multimedia computing[J]. Journal of Image and Graphics, 2015, 20(11): 1423-1433. DOI： 10.11834/jig.20151101.

摘要

海量数据的快速增长给多媒体计算带来了深刻挑战。与传统以手工构造为核心的媒体计算模式不同

数据驱动下的深度学习(特征学习)方法成为当前媒体计算主流。重点分析了深度学习在检索排序与标注、多模态检索与语义理解、视频分析与理解等媒体计算方面的最新进展和所面临的挑战

并对未来的发展趋势进行展望。在检索排序与标注方面

基于深度学习的神经编码等方法取得了很好的效果;在多模态检索与语义理解方面

深度学习被用于弥补不同模态间的“异构鸿沟“以及底层特征与高层语义间的”语义鸿沟“

基于深度学习的组合语义学习成为研究热点;在视频分析与理解方面

深度神经网络被用于学习视频的有效表示方式及动作识别

并取得了很好的效果。然而

深度学习是一种数据驱动的方法

易受数据噪声影响

对于在线增量学习方面还不成熟

如何将深度学习与众包计算相结合是一个值得期待的问题。该综述在深入分析现有方法的基础上

对深度学习框架下为解决异构鸿沟和语义鸿沟给出新的思路。

Abstract

The increasing large scale data puts forth a great challenge to multimedia computing. Different from traditional multimedia computing which is heavily based on hand-crafted features

deep learning (feature learning) recently achieves noticeable advance in multimedia computing. This paper presents the details of deep learning on multimedia retrieval and annotation

multi-modal semantic understanding as well as the video analysis and understanding

which tend to overcome the heterogeneity gap and semantic gap of multimedia computing in the setting of deep learning framework. On multimedia retrieval and annotation

deep learning-based "neural-codes" has been proposed and proves effective. Besides

deep learning is used for multi-modal semantic understanding to bridge the heterogeneity gap between different modals and the semantic gap between the bottom features and top semantic and deep learning-based compositional semantic learning is attracting increasing focus. Moreover

deep learning proves effective for video action recognition and for achieving a good representation of videos. However

the data-driven deep learning is easily affected by the noise in the data and is not ripe for online incremental learning. How to combine deep learning with crowdsourcing computing is a challenge and may be a future research direction. We analyze the existing methods of deep learning

and provide a new way to overcome the heterogeneity gap and semantic gap in deep learning framework.