并行交叉的深度卷积神经网络模型

汤鹏杰; 王瀚漓; 左凌轩

doi:10.11834/jig.20160308

第24届全国多媒体学术会议栏目 | 浏览量 : 0 下载量: 398 CSCD: 8

PDF
导出
分享
收藏
专辑

并行交叉的深度卷积神经网络模型
Parallel cross deep convolution neural networks model
2016年21卷第3期页码：339-347
网络出版：2016-03-07，

纸质出版：2016
DOI： 10.11834/jig.20160308
稿件说明：

移动端阅览

汤鹏杰, 王瀚漓, 左凌轩. 并行交叉的深度卷积神经网络模型[J]. 中国图象图形学报, 2016,21(3):339-347. DOI： 10.11834/jig.20160308.

Tang Pengjie, Wang Hanli, Zuo Lingxuan. Parallel cross deep convolution neural networks model[J]. Journal of Image and Graphics, 2016, 21(3): 339-347. DOI： 10.11834/jig.20160308.

摘要

图像分类与识别是计算机视觉领域的经典问题

是图像检索、目标识别及视频分析理解等技术的基础。目前

基于深度卷积神经网络(CNN)的模型已经在该领域取得了重大突破

其效果远远超过了传统的基于手工特征的模型。但很多深度模型神经元和参数规模巨大

训练困难。为此根据深度CNN模型和人眼视觉原理

提出并设计了一种深度并行交叉CNN模型(PCCNN模型)。该模型在Alex-Net基础上

通过两条深度CNN数据变换流

提取两组深度CNN特征;在模型顶端

经过两次混合交叉

得到1024维的图像特征向量

最后使用Softmax回归对图像进行分类识别。与同类模型相比

该模型所提取的特征更具判别力

具有更好的分类识别性能;在Caltech101上top1识别精度达到63%左右

比VGG16高出近5%

比GoogLeNet高出近10%;在Caltech256上top1识别精度达到46%以上

比VGG16高出近5%

比GoogLeNet高出2.6%。 PCCNN模型用于图像分类与识别效果显著

在中等规模的数据集上具有比同类其他模型更好的性能

在大规模数据集上其性能有待于进一步验证;该模型也为其他深度CNN模型的设计提供了一种新的思路

即在控制深度的同时

提取更多的特征信息

提高深度模型性能。

Abstract

The classification and recognition of images play an important role in a number of applications

such as image retrieval

object detection

and video content analysis. Nowadays

a major breakthrough has been obtained based on deep convolution neural network (CNN) model

which has surpassed state-of-the-art methods for image classification and recognition

because the features extracted by CNN models are more discriminative and contain more semantic information than the traditional approaches. However

such CNN models as Alex-Net and ZFCNN-Net are extremely simple and incapable of extracting more information for representing images

while other models such as VGG16/VGG19 and GoogLeNet always have a huge number of neurons and parameters. In this work

a novel model named deep parallel cross CNN (PCCNN) is proposed

which can extract more effective information from images and has less neurons and parameters than other models. Inspired by the mechanism of human vision

which has two visual pathways and optic chiasma

the proposed PCCNN is designed based on the Alex-Net

which extracts two groups of CNN features in parallel through a couple of deep CNN data transform flows. Moreover

after the first fully connected layers in each stream

the information of two streams are fused together. The fused information is forwarded to the next two fully connected layers

and then the output information is fused again for more power representation features. Finally

for image classification

the Softmax regression function is employed with a 1024D image feature vector from the fusion of the two feature groups. Note that Alex-Net is used as the base model because of its simple architecture and its need to use fewer neurons. In the PCCNN model

the first stream is the original Alex-Net

and in the second stream

6 instead of 4 is used as the stride in the first convolutional layer. The larger stride in convolutional layer has worse performance if only a single stream is used because of the greater number of missing information. However

when the two streams are combined

the proposed model has better performance than all the other models. In addition

because a larger stride is used in the second stream

the feature maps are smaller

and the number of neurons and parameters are not greatly increased. Some popular public datasets

namely Caltech101

Caltech256

and Scene15

are selected to evaluate the performance of our model. At the same time

some state-of-the-art models are implemented with the same settings for comparison. Experimental results demonstrate that the proposed PCCNN model achieves better performance in terms of image classification than these models

indicating that the features extracted with the PCCNN model are more discriminative and have stronger presentation ability. On the Caltech101 dataset

the accuracy reaches approximately 63% on top1 with PCCNN model

exceeding that of the VGG16 model by about 5% and that of the GoogLeNet model by about 10%. Moreover

in terms of the Caltech256 dataset

our model also has better performance than the other models with accuracy of 46.4% on top1

surpassing those of the VGG16 and GoogLeNet models by 5% and 2.6%

respectively. However

our model has worse performance on Scene15 dataset than GoogLeNet

but still has higher accuracy than when only a single Alex-Net is used. The proposed PCCNN model has better performance than several state-of-the-art CNN models in terms of image classification and recognition

particularly on the medium-scale datasets

but on the small-scale dataset

the proposed model does not exhibit better performance. Hence

the model should be further tested on large-scale vision tasks

such as Imagenet or SUN dataset

which is the next work that the authors are planning to do. In fact

the PCCNN model is not only applicable to image classification and recognition

but it can also provide a novel thinking methodology for deep CNN model designing. In the deep CNN model

the deeper the architecture is

the more neurons and parameters exist

and the complexity also significantly increases. Thus

increase the width of the model can be increased to match the features and obtain better performance. Although this method also leads to an increase in the number of neurons and parameters

the rate of increase is slower than when more layers are added in the single model; furthermore

the model is more in line with the human visual physiological mechanism. Finally

the PCCNN model had great extendibility.