双视图三维卷积网络的工业装箱行为识别
Dual-view 3D ConvNets based industrial packing action recognition
- 2022年27卷第8期 页码:2368-2379
收稿:2021-02-18,
修回:2021-4-19,
录用:2021-4-26,
纸质出版:2022-08-16
DOI: 10.11834/jig.210064
移动端阅览

浏览全部资源
扫码关注微信
收稿:2021-02-18,
修回:2021-4-19,
录用:2021-4-26,
纸质出版:2022-08-16
移动端阅览
目的
2
在自动化、智能化的现代生产制造过程中,行为识别技术扮演着越来越重要的角色,但实际生产制造环境的复杂性,使其成为一项具有挑战性的任务。目前,基于3D卷积网络结合光流的方法在行为识别方面表现出良好的性能,但还是不能很好地解决人体被遮挡的问题,而且光流的计算成本很高,无法在实时场景中应用。针对实际工业装箱场景中存在的人体被遮挡问题和光流计算成本问题,本文提出一种结合双视图3D卷积网络的装箱行为识别方法。
方法
2
首先,通过使用堆叠的差分图像(residual frames,RF)作为模型的输入来更好地提取运动特征,替代实时场景中无法使用的光流。原始RGB图像和差分图像分别输入到两个并行的3D ResNeXt101中。其次,采用双视图结构来解决人体被遮挡的问题,将3D ResNeXt101优化为双视图模型,使用一个可学习权重的双视图池化层对不同角度的视图做特征融合,然后使用该双视图3DResNeXt101模型进行行为识别。最后,为进一步提高检测结果的真负率(true negative rate,TNR),本文在模型中加入降噪自编码器和two-class支持向量机(support vector machine,SVM)。
结果
2
本文在实际生产环境下装箱场景进行了实验,采用准确率和真负率两个指标进行评估,得到的装箱行为识别准确率为94.2%、真负率为98.9%。同时在公共数据集UCF(University of Central Florida)101上进行了评估,以准确率为评估指标,得到的装箱行为识别准确率为97.9%。进一步验证了本文方法的有效性和准确性。
结论
2
本文提出的人体行为识别方法能够有效利用多个视图中的人体行为信息,结合传统模型和深度学习模型,显著提高了行为识别准确率和真负率。
Objective
2
The action recognition technology is proactive in computer vision contexts
such as intelligent video surveillance
human-computer interaction
virtual reality
and medical image analysis. It plays an important role in the automated and intelligent modern manufacturing process
but the complexity of the actual manufacturing environment has still been challenging. The research direction is attributed to the deep neural networks largely
especially the 3D convolutional networks
which mainly use 3D convolution to capture temporal information. The 3D convolutional networks can extract the spatio-temporal features of videos better with the added temporal dimension compared to 2D convolutional networks. At present
it shows good performance in action recognition through the optical flow melting in 3D convolutional network
but it still cannot solve the problem of human body being occluded
and the computational cost of optical flow is complicated and cannot be applied in real-time scenes. The product qualification rate is required to be satisfied in the context of action recognition application in production scenes. It is necessary to rank out the unqualified products as much as possible while ensuring high accuracy and high true negative rate (TNR) of detection results. It is challenged to optimize the true negative rate among the existing action recognition methods. Our analysis facilitates a packing action recognition method based on dual-view 3D convolutional network.
Method
2
First
we extract motion features better through stacked residual frames as inputs
replacing optical flow that is not available in the real-time scene. The original RGB images and the residual frames are input to two parallel 3D ResNeXt101
and a concatenation layer is used to concatenate the features extracted in the last convolution layer of the two 3D ResNext101. Next
we adopts a dual-view structure to resolve the issue of human body being occluded
optimizes 3D ResNeXt101 into a dual-view model
builds up a learnable dual-view pooling layer for multifaceted feature fusion of views
and then uses this dual-view 3D ResNeXt101 model for action recognition. Finally
a noise-reducing self-encoder and two-class support vector machine (SVM) are added in our model to improve the true negative rate (TNR) of the detection results further. The dual-view pooling derived features are input to the noise-reducing self-encoder in the model
and the features are optimized and downscaled by the trained noise-reducing self-encoder
and then the two-class SVM model is used for secondary recognition.
Result
2
We conducted experiments in a packing scenario and evaluated using two metrics like accuracy rate and true-negative rate. The accuracy of our packing action recognition model is 94.2%
and the true negative rate is 98.9%
which optimizes current action recognition methods. Our accuracy is increased from 91.1% to 95.8% via the dual-view structure. The accuracy of the model is increased from 88.2% to 95.8% based on the residual frames module. If the residual frames module is altered by optical flow module
the accuracy rate is 96.2%
which is equivalent to the model using the residual frames module. The accuracy is only 91.5% that the unique two-class SVM structure added to the model without the denoising autoencoder. Thanks to the optimization and dimensionality reduction of the feature vectors by the denoising autoencoder
the accuracy reaches 94.2% via the combination of the denoising autoencoder and the two-class SVM both
the highest true negative rate of 98.9% obtained. After adding denoising autoencoder and two-class SVM to the model
the true negative rate of the model increased from 95.7% to 98.9%
while the accuracy rate decreased by 1.6%. Our demonstrated result is evaluated in the public dataset UCF (University of Central Florida) 101.Our single-view model obtained an accuracy of 97.1%
which achieved the second highest accuracy among all compared methods
second only to 3D ResNeXt101's 98.0%.
Conclusion
2
We use a dual-view 3D ResNeXt101 model for effective packing action recognition. To obtain richer features from RGB images and differential images
two parallel 3D ResNeXt101 are used to learn spatio-temporal features and a dual-view feature fusion is accomplished using a learnable view pooling layer. In addition
a stacked denoising autoencoder is trained to optimize and downscale the features extracted in terms of the dual-view 3D ResNeXt101 model. To improve the true negative rate
a two-class SVM model is used for secondary detection. Our method can recognize the boxing action of the packing workers accurately and realize the high true negative rate (TNR) of the recognition results.
Budiman A, Fanany M I and Basaruddin C. 2014. Stacked denoising autoencoder for feature representation learning in pose-based action recognition//Processings of the 3rd IEEE Global Conference on Consumer Electronics. Tokyo, Japan: IEEE: 684-688 [ DOI: 10.1109/GCCE.2014.7031302 http://dx.doi.org/10.1109/GCCE.2014.7031302 ]
Cao Z, Simon T, Wei S E and Sheikh Y. 2017. Realtime multi-person 2D pose estimation using part affinity fields//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 1302-1310 [ DOI: 10.1109/CVPR.2017.143 http://dx.doi.org/10.1109/CVPR.2017.143 ]
Carreira J and Zisserman A. 2017. Quo vadis, action recognition? A new model and the kinetics dataset//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 4724-4733 [ DOI: 10.1109/CVPR.2017.502 http://dx.doi.org/10.1109/CVPR.2017.502 ]
Crasto N, Weinzaepfel P, Alahari K and Schmid C. 2019. MARS: motion-augmented RGB stream for action recognition//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 7874-7883 [ DOI: 10.1109/CVPR.2019.00807 http://dx.doi.org/10.1109/CVPR.2019.00807 ]
Das Dawn D and Shaikh S H. 2016. A comprehensive survey of human action recognition with spatio-temporal interest point (STIP) detector. The Visual Computer, 32(3): 289-306 [DOI: 10.1007/s00371-015-1066-2]
Feichtenhofer C, Pinz A and Zisserman A. 2016. Convolutional two-stream network fusion for video action recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 1933-1941 [ DOI: 10.1109/CVPR.2016.213 http://dx.doi.org/10.1109/CVPR.2016.213 ]
Girdhar R, Tran D, Torresani L and Ramanan D. 2019. Distinit: learning video representations without a single labeled video//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 852-861 [ DOI: 10.1109/ICCV.2019.00094 http://dx.doi.org/10.1109/ICCV.2019.00094 ]
Hara K, Kataoka H and Satoh Y. 2018. Can spatiotemporal 3D CNNs retrace the history of 2D cnns and ImageNet?//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 6546-6555 [ DOI: 10.1109/CVPR.2018.00685 http://dx.doi.org/10.1109/CVPR.2018.00685 ]
Hong J, Cho B, Hong Y W and Byun H. 2019. Contextual action cues from camera sensor for multi-stream action recognition. Sensors, 19(6): #1382 [DOI: 10.3390/s19061382]
Huang G X and Bors A G. 2020. Learning spatio-temporal representations with temporal squeeze pooling//Proceedings of ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Barcelona, Spain: IEEE: 2103-2107 [ DOI: 10.1109/ICASSP40776.2020.9054200 http://dx.doi.org/10.1109/ICASSP40776.2020.9054200 ]
Ji S W, Xu W, Yang M and Yu K. 2013. 3Dconvolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1): 221-231 [DOI: 10.1109/TPAMI.2012.59]
Kuanar S, Athitsos V, Mahapatra D, Rao K R, Akhtar Z and Dasgupta D. 2019. Low dose abdominal CT image reconstruction: an unsupervised learning based approach//Proceedings of 2019 IEEE International Conference on Image Processing (ICIP). Taipei, China: IEEE: 1351-1355 [ DOI: 10.1109/ICIP.2019.8803037 http://dx.doi.org/10.1109/ICIP.2019.8803037 ]
Kuanar S, Athitsos V, Pradhan N, Mishra A and Rao K R. 2018. Cognitive analysis of working memory load from eeg, by a deep recurrent neural network//Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary, Canada: IEEE: 2576-2580 [ DOI: 10.1109/ICASSP.2018.8462243 http://dx.doi.org/10.1109/ICASSP.2018.8462243 ]
Laptev I. 2005. On space-time interest points. International Journal of Computer Vision, 64(2/3): 107-123 [DOI: 10.1007/s11263-005-1838-7]
Liu Z and Hu H F. 2019. Spatiotemporal relation networks for video action recognition. IEEE Access, 7: 14969-14976 [DOI: 10.1109/ACCESS.2019.2894025]
Peng X J, Wang L M, Wang X X and Qiao Y. 2016. Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Computer Vision and Image Understanding, 150: 109-125 [DOI: 10.1016/j.cviu.2016.03.013]
Qiu Z F, Yao T and Mei T. 2017. Learning spatio-temporal representation with pseudo-3 d residual networks//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 5534-5542 [ DOI: 10.1109/ICCV.2017.590 http://dx.doi.org/10.1109/ICCV.2017.590 ]
Rastgoo R, Kiani K and Escalera S. 2020. Hand sign language recognition using dual-view hand skeleton. Expert Systems with Applications, 150: #113336 [DOI: 10.1016/j.eswa.2020.113336]
Reddy K K and Shah M. 2013. Recognizing 50 human action categories of web videos. Machine Vision and Applications, 24(5): 971-981 [DOI: 10.1007/500138012-04504]
Simonyan K and Zisserman A. 2014. Two-stream convolutional networks for action recognition in videos//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada: MIT Press: 568-576
Su H, Maji S, Kalogerakis E and Learned-Miller E. 2015. Multi-view convolutional neural networks for 3D shape recognition//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 945-953 [ DOI: 10.1109/ICCV.2015.114 http://dx.doi.org/10.1109/ICCV.2015.114 ]
Tao L, Wang X T and Yamasaki T. 2020. Rethinking motion representation: residual frames with 3D convnets for better action recognition[EB/OL]. [2021-02-18]. https://arxiv.org/pdf/2001.05661.pdf
Tran D, Bourdev L, Fergus R, Torresani L and Paluri M. 2015. Learning spatiotemporal features with 3D convolutional networks//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 4489-4497 [ DOI: 10.1109/ICCV.2015.510 http://dx.doi.org/10.1109/ICCV.2015.510 ]
Ullah A, Muhammad K, Haq I U and Baik S W. 2019. Action recognition using optimized deep autoencoder and CNN for surveillance data streams of non-stationary environments. Future Generation Computer Systems, 96: 386-397 [DOI: 10.1016/j.future.2019.01.029]
Vincent P, Larochelle H, Bengio Y and Manzagol P A. 2008. Extracting and composing robust features with denoising autoencoders//Proceedings of the 25th International Conference on Machine Learning. Helsinki, Finland: ACM: 1096-1103 [ DOI: 10.1145/1390156.1390294 http://dx.doi.org/10.1145/1390156.1390294 ]
Wang H and Schmid C. 2013. Action recognition with improved trajectories//Proceedings of 2013 IEEE International Conference on Computer Vision. Sydney, Australia: IEEE: 3551-3558 [ DOI: 10.1109/ICCV.2013.441 http://dx.doi.org/10.1109/ICCV.2013.441 ]
Wang Y S, Liu J Y, Zeng Q H and Liu S. 2015. Visual pose measurement based on structured light for MAVs in non-cooperative environments. Optik, 126(24): 5444-5451 [DOI: 10.1016/j.ijleo.2015.09.041]
Wu C Y, Zaheer M, Hu H X, Manmatha R, Smola A J and Krähenbühl P. 2018. Compressed video action recognition//Proceedings of 2018 IEEE/CVF Confere nce on Computer Vision and Pattern Recognition. Salt Lake City, USA: CVPR: 6026-6035 [ DOI: 10.1109/CVPR.2018.00631 http://dx.doi.org/10.1109/CVPR.2018.00631 ]
Wu Q Y, Zhu A C, Cui R, Wang T, Hu F Q, Bao Y P and Snoussi H. 2021. Pose-guided inflated 3D convnet for action recognition in videos. Signal Processing: Image Communication, 91: #116098 [DOI: 10.1016/j.image.2020.116098]
Zeng H, Wang Q, Li C and Song W. 2019. Learning-based multiple pooling fusion in dual-view convolutional neural network for 3D model classification and retrieval. Journal of Information Processing Systems, 15(5): 1179-1191 [DOI: 10.3745/JIPS.02.0120]
相关作者
相关机构
京公网安备11010802024621