The increasing large scale data puts forth a great challenge to multimedia computing. Different from traditional multimedia computing which is heavily based on hand-crafted features
deep learning (feature learning) recently achieves noticeable advance in multimedia computing. This paper presents the details of deep learning on multimedia retrieval and annotation
multi-modal semantic understanding as well as the video analysis and understanding
which tend to overcome the heterogeneity gap and semantic gap of multimedia computing in the setting of deep learning framework. On multimedia retrieval and annotation
deep learning-based "neural-codes" has been proposed and proves effective. Besides
deep learning is used for multi-modal semantic understanding to bridge the heterogeneity gap between different modals and the semantic gap between the bottom features and top semantic and deep learning-based compositional semantic learning is attracting increasing focus. Moreover
deep learning proves effective for video action recognition and for achieving a good representation of videos. However
the data-driven deep learning is easily affected by the noise in the data and is not ripe for online incremental learning. How to combine deep learning with crowdsourcing computing is a challenge and may be a future research direction. We analyze the existing methods of deep learning
and provide a new way to overcome the heterogeneity gap and semantic gap in deep learning framework.