多媒体技术研究:2017——记忆驱动的媒体学习与创意
Researches on multimedia technology: 2017——memory-augmented media learning and creativity
- 2018年23卷第11期 页码:1617-1634
收稿:2018-09-10,
修回:2018-9-14,
纸质出版:2018-10-30
DOI: 10.11834/jig.180558
移动端阅览

浏览全部资源
扫码关注微信
收稿:2018-09-10,
修回:2018-9-14,
纸质出版:2018-10-30
移动端阅览
目的
2
借鉴大脑的工作机理来发展人工智能是当前人工智能发展的重要方向之一。注意力与记忆在人的认知理解过程中扮演了重要的角色。由于"端到端"深度学习在识别分类等任务中表现了优异性能,因此如何在深度学习模型中引入注意力机制和外在记忆结构,以挖掘数据中感兴趣的信息和有效利用外来信息,是当前人工智能研究的热点。
方法
2
本文以记忆和注意力等机制为中心,介绍了这些方面的3个代表性工作,包括神经图灵机、记忆网络和可微分神经计算机。在这个基础上,进一步介绍了利用记忆网络的研究工作,其分别是记忆驱动的自动问答、记忆驱动的电影视频问答和记忆驱动的创意(文本生成图像),并对国内外关于记忆网络的研究进展进行了比较。
结果
2
调研结果表明:1)在深度学习模型中引入注意力机制和外在记忆结构,是当前人工智能研究的热点;2)关于记忆网络的研究越来越多。国内外关于记忆网络的研究正在蓬勃发展,每年发表在机器学习与人工智能相关的各大顶级会议上的论文数量正在逐年攀升;3)关于记忆网络的研究越来越热。不仅每年发表的论文数量越来越多,且每年的增长趋势并没有放缓,2015年增长了9篇,2016年增长了4篇,2017年增长了9篇,2018年增长了14篇;4)基于记忆驱动的手段和方法十分通用。记忆网络已成功地运用于自动问答、视觉问答、物体检测、强化学习、文本生成图像等领域。
结论
2
数据驱动的机器学习方法已成功运用于自然语言、多媒体、计算机视觉、语音等领域,数据驱动和知识引导将是人工智能未来发展的趋势之一。
Objective
2
The human brain that has evolved over a million years is perhaps the most complex and sophisticated machine in the world
carrying all the intelligent activities of human beings
such as attention
learning
memory
intuition
insight and decision making. The core of the human brain consists of billions of neurons and synapses. Each neuron "receives" information from some neurons through synapses
and then passes the processed information to other neurons through the synapse. In this way
external sensory information (i.e.
visual
auditory
olfactory
taste
touch) is analyzed and processed in the brain in a complex way to form perception and cognition. Attention and memory play an important role in the cognitive process of human understanding. The development of artificial intelligence based on the memory mechanism of the brain is an advanced aspect of research. Given that "end-to-end" deep learning enables an excellent performance in tasks
such as recognition and classification
introducing attention mechanism and external memory in the deep learning model to mine information of interest in data and effectively use auxiliary information is a popular research area in artificial intelligence.
Method
2
This report focuses on the external memory and attention mechanism of the brain. Firstly
three representative works
namely
neural turing machine
memory networks
and differentiable neural computer
are introduced. Neural turing machine is analogous to a Turing Machine or Von Neumann architecture but is differentiable end-to-end
allowing it to be efficiently trained with gradient descent. Memory networks can reason with inference components combined with a long-term memory component and they learn how to use these components jointly. Differentiable neural computer
which consists of a neural network that can read from and write to an external memory matrix
analogous to the random-access memory in a conventional computer. Secondly
several specific applications
such as knowledge memory network for question answering
memory-driven movie question answering
and memory-driven creativity (text-to-image)
are presented. For answering the factoid questions
this report present the temporality-enhanced knowledge memory network (TE-KMN)
which encodes not only the content of questions and answers
but also the temporal cues in a sequence of ordered sentences that gradually remark the answer. Moreover
TE-KMN collaboratively uses external knowledge for a better understanding of a given question. For answering questions about movies
the layered memory network (LMN) that represents frame-level and clip-level movie content by the static word memory module and the dynamic subtitle memory module respectively
is introduced. To generate images depending on their corresponding narrative sentences
this report presents the visual-memory Creative Adversarial Network (vmCAN)
which appropriately leverages an external visual knowledge memory in both multi-modal fusion and image synthesis. Finally
research progress of memory networks at home and abroad is compared.
Result
2
Research results show that 1) introducing attention mechanism and external memory structure in the deep learning model is a current hotspot in artificial intelligence research. 2) Research that focuses on memory networks at home and abroad has been intensified
and literature related to machine learning and artificial intelligence has been published at top conferences and has been increasing annually. 3) Research on memory networks is gaining popularity. An increasing number of papers have been published yearly
and this trend has been constantly growing. Thus far
9
4
9
and 14 articles have been published from 2015 to 2018
respectively. 4) Memory-driven methods and approaches are general
and memory networks have been successfully used in areas
such as question answering
visual question answering
object detection
reinforcement learning
and text-to-images.
Conclusion
2
This report shows a future work on media learning and creativity. The next generation of artificial intelligence should be never-ending learning from data
experience
and automatic reasoning. In the future
artificial intelligence should be integrated organically with human knowledge through methods such as attention mechanism
memory network
transfer learning
and reinforcement learning
so as to achieve from shallow computing to deep reasoning
from simple data-driven to data-driven combined with logic rules
from vertical domain intelligence to more general artificial intelligence.
Baddeley A. Working memory[J]. Science, 1992, 255(5044):556-559.[DOI:10.1126/science.1736359]
Goller C, Kuchler A. Learning task-dependent distributed representations by backpropagation through structure[C]//Proceedings of International Conference on Neural Networks. Washington, DC, USA: IEEE, 1996: 347-352.[ DOI:10.1109/ICNN.1996.548916 http://dx.doi.org/10.1109/ICNN.1996.548916 ]
Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 1997, 9(8):1735-1780.[DOI:10.1162/neco.1997.9.8.1735]
Chung J, Gulcehre C, Cho K H, et al. Empirical evaluation of gated recurrent neural networks on sequence modeling[EB/OL].[2018-08-15] . https://arxiv.org/pdf/1412.3555.pdf https://arxiv.org/pdf/1412.3555.pdf
Graves A, Wayne G, Danihelka I. Neural turing machines[EB/OL].[2018-08-15] . https://arxiv.org/pdf/1410.5401.pdf https://arxiv.org/pdf/1410.5401.pdf .
Jason W, Sumit C, Antoine B. Memory networks[C]//Proceedings of the International Conference on Representation Learning. San Diego, USA. 2015.
Graves A, Wayne G, Reynolds M, et al. Hybrid computing using a neural network with dynamic external memory[J]. Nature, 2016, 538(7626):471-476.[DOI:10.1038/nature20101]
Hopfield J J. Neural networks and physical systems with emergent collective computational abilities[J]. Proceedings of the National Academy of Sciences of the United States of America, 1982, 79(8):2554-2558.[DOI:10.1073/pnas.79.8.2554]
Mikolov T, Karafiát M, Burget L, et al. Recurrent neural network based language model[C]//Proceedings of the 11th Annual Conference of the International Speech Communication Association. Makuhari, Chiba, Japan: ISCA, 2010.
Zaremba W, Sutskever I. Learning to execute[EB/OL].[2018-08-15] . https://arxiv.org/pdf/1410.4615.pdf https://arxiv.org/pdf/1410.4615.pdf .
Weston J, Bengio S, Usunier N. WSABIE: scaling up to large vocabulary image annotation[C]//Proceedings of the 22nd International Joint Conference on Artificial Intelligence. Barcelona, Catalonia, Spain: ACM, 2011: 2764-2770.[ DOI:10.5591/978-1-57735-516-8/IJCAI11-460 http://dx.doi.org/10.5591/978-1-57735-516-8/IJCAI11-460 ]
Krizhevsky A,Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks[C]//Proceedings of the 25th International Conference on Neural Information Processing Systems. Lake Tahoe, Nevada: ACM, 2012: 1097-1105.
Graves A. Generating sequences with recurrent neural networks[EB/OL].[2018-08-15] . https://arxiv.org/pdf/1308.0850.pdf https://arxiv.org/pdf/1308.0850.pdf .
Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada: ACM, 2014: 3104-3112.
Fusi S, Drew P J, Abbott L F. Cascade models of synaptically stored memories[J]. Neuron, 2005, 45(4):599-611.[DOI:10.1016/j.neuron.2005.02.001]
Ganguli S, Huh D, Sompolinsky H. Memory traces in dynamical systems[J]. Proceedings of the National Academy of Sciences of the United States of America, 2008, 105(48):18970-18975.[DOI:10.1073/pnas.0804451105]
Kanerva P. Sparse Distributed Memory[M]. Cambridge:MIT Press, 1988.
Amari S I. Characteristics of sparsely encoded associative memory[J]. Neural Networks, 1989, 2(6):451-457.[DOI:10.1016/0893-6080(89)90043-9]
Vinyals O, Fortunato M, Jaitly N. Pointer networks[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal, Canada: ACM, 2015: 2692-2700.
Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate[EB/OL].[2018-08-15] . https://arxiv.org/pdf/1409.04735.pdf https://arxiv.org/pdf/1409.04735.pdf .
Gregor K, Danihelka I, Graves A, et al. Draw: a recurrent neural network for image generation[C]//Proceedings of the 32nd International Conference on International Conference on Machine Learning. Lille, France: ACM, 2015.
Hintzman D L. MINERVA 2:a simulation model of human memory[J]. Behavior Research Methods, Instruments,&Computers, 1984, 16(2):96-101.[DOI:10.3758/BF03202365]
Kumar A, Irsoy O, Ondruska P, et al. Ask me anything: dynamic memory networks for natural language processing[C]//Proceedings of the 33rd International Conference on International Conference on Machine Learning. New York, NY, USA: ACM, 2016: 1378-1387.
Sukhbaatar S, Szlam A, Weston J, et al. End-to-end memory networks[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal, Canada: ACM, 2015: 2440-2448.
Magee J C, Johnston D. A synaptically controlled, associative signal for Hebbian plasticity in hippocampal neurons[J]. Science, 1997, 275(5297):209-213.[DOI:10.1126/science.275.5297.209]
Johnston S T, Shtrahman M, Parylak S, et al. Paradox of pattern separation and adult neurogenesis:a dual role for new neurons balancing memory resolution and robustness[J]. Neurobiology of Learning and Memory, 2016, 129:60-68.[DOI:10.1016/j.nlm.2015.10.013]
O'Reilly R C, McClelland J L. Hippocampal conjunctive encoding, storage, and recall:avoiding a trade-off[J]. Hippocampus, 1994, 4(6):661-682.[DOI:10.1002/hipo.450040605]
Howard M W, Kahana M J. A distributed representation of temporal context[J]. Journal of Mathematical Psychology, 2002, 46(3):269-299.[DOI:10.1006/jmps.2001.1388]
Jurczyk P, Agichtein E. Discovering authorities in question answer communities by using link analysis[C]//Proceedings of the 16th ACM conference on Conference on information and knowledge management. Lisbon, Portugal: ACM, 2007: 919-922.[ DOI:10.1145/1321440.1321575 http://dx.doi.org/10.1145/1321440.1321575 ]
Li B C, Lyu M R, King I. Communities of Yahoo! answers and Baidu Zhidao: complementing or competing?[C]//Proceedings of 2012 International Joint Conference on Neural Networks. Brisbane, QLD, Australia: IEEE, 2012: 1-8.[ DOI:10.1109/IJCNN.2012.6252435 http://dx.doi.org/10.1109/IJCNN.2012.6252435 ]
Bilotti M W, Elsas J, Carbonell J, et al. Rank learning for factoid question answering with linguistic and semantic constraints[C]//Proceedings of the 19th ACM International Conference on Information and knowledge Management. Toronto, ON, Canada: ACM, 2010: 459-468.[ DOI:10.1145/1871437.1871498 http://dx.doi.org/10.1145/1871437.1871498 ]
Wang M Q. A survey of answer extraction techniques in factoid question answering[J]. Computational Linguistics, 2006, 1(1):1-14.
Boyd-Graber J, Satinoff B, He H, et al. Besting the quiz master: crowdsourcing incremental classification games[C]//Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Jeju Island, Korea: ACM, 2012: 1290-1301.
Iyyer M, Boyd-Graber J, Claudino L, et al. A neural network for factoid question answering over paragraphs[C]//Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing. Doha, Qatar: ACL, 2014: 633-644.
Zheng S C, Bao H Y, Zhao J, et al. A novel hierarchical convolutional neural network for question answering over paragraphs[C]//Proceedings of 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. Singapore: IEEE, 2015, 1: 60-66.[ DOI:10.1109/WI-IAT.2015.20 http://dx.doi.org/10.1109/WI-IAT.2015.20 ]
Iyyer M, Manjunatha V, Boyd-Graber J, et al. Deep unordered composition rivals syntactic methods for text classification[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Beijing, China: ACL, 2015: 1681-1691.
Carr C E. Processing of temporal information in the brain[J]. Annual Review of Neuroscience, 1993, 16:223-243.[DOI:10.1146/annurev.ne.16.030193.001255]
Ivry R B. The representation of temporal information in perception and motor control[J]. Current Opinion in Neurobiology, 1996, 6(6):851-857.[DOI:10.1016/S0959-4388(96)80037-7]
Schweppe J, Rummer R. Attention, working memory, and long-term memory in multimedia learning:an integrated perspective based on process models of working memory[J]. Educational Psychology Review, 2014, 26(2):285-306.[DOI:10.1007/s10648-013-9242-2]
Minsky M. The Society of Mind[M]. New York:Simon and Schuster, 1988.
Duan X Y, Tang S L, Zhang S Y, et al. Temporality-enhanced knowledgememory network for factoid question answering[J]. Frontiers of Information Technology&Electronic Engineering, 2018, 19(1):104-115.[DOI:10.1631/FITEE.1700788]
Malinowski M, Rohrbach M, Fritz M. Ask your neurons: a neural-based approach to answering questions about images[C]//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2015: 1-9.[ DOI:10.1109/ICCV.2015.9 http://dx.doi.org/10.1109/ICCV.2015.9 ]
Li L H, Tang S, Deng L X, et al. Image caption with global-local attention[C]//Proceedings of the 31st AAAI Conference on Artificial Intelligence. San Francisco, California, USA: AAAI, 2017: 4133-4139.
Pan P B, Xu Z W, Yang Y, et al. Hierarchical recurrent neural encoder for video representation with application to captioning[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 1029-1038.[ DOI:10.1109/CVPR.2016.117 http://dx.doi.org/10.1109/CVPR.2016.117 ]
Wang M, Hong R C, Li G D, et al. Event driven web video summarization by tag localization and key-shot identification[J]. IEEE Transactions on Multimedia, 2012, 14(4):975-985.[DOI:10.1109/TMM.2012.2185041]
Yang Z W, Han Y H, Wang Z. Catching the temporal regions-of-interest for video captioning[C]//Proceedings of the 2017 ACM on Multimedia Conference. Mountain View, California, USA: ACM, 2017: 146-153.[ DOI:10.1145/3123266.3123327 http://dx.doi.org/10.1145/3123266.3123327 ]
Xiong C M, Merity S, Socher R. Dynamic memory networks for visual and textual question answering[C]//Proceedings of the 33rd International Conference on Machine Learning. New York, NY, USA: PMLR, 2016: 2397-2406.
Zhu Y K, Groth O, Bernstein M, et al. Visual7w: grounded question answering in images[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 4995-5004.[ DOI:10.1109/CVPR.2016.540 http://dx.doi.org/10.1109/CVPR.2016.540 ]
Wang B, Xu Y J, Han Y H, et al. Movie question answering: remembering the textual cues for layered visual contents[EB/OL].[2018-08-15] . https://arxiv.org/pdf/1804.09412.pdf https://arxiv.org/pdf/1804.09412.pdf .
Tapaswi M, Zhu Y K, Stiefelhagen R, et al. Movieqa: understanding stories in movies through question-answering[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 4631-4640.[ DOI:10.1109/CVPR.2016.501 http://dx.doi.org/10.1109/CVPR.2016.501 ]
Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space[J].[EB/OL] .[2018-08-15]. https://arxiv.org/pdf/1301.3781.pdf https://arxiv.org/pdf/1301.3781.pdf .
Goodfellow I J, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada: ACM, 2014: 2672-2680.
Nilsback M E, Zisserman A. Automated flower classification over a large number of classes[C]//Proceedings of the 6th Indian Conference on Computer Vision, Graphics & Image Processing. Bhubaneswar, India: IEEE, 2008: 722-729.[ DOI:10.1109/ICVGIP.2008.47 http://dx.doi.org/10.1109/ICVGIP.2008.47 ]
Lin T Y, Maire M, Belongie S, et al. Microsoft coco: common objects in context[C]//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer, 2014: 740-755.[ DOI:10.1007/978-3-319-10602-1_48 http://dx.doi.org/10.1007/978-3-319-10602-1_48 ]
Reed S, Akata Z, Yan X C, et al. Generative adversarial text to image synthesis[C]//Proceedings of the 33rd International Conference on International Conference on Machine Learning. New York, NY, USA: ACM, 2016.
Zhang H, Xu T, Li H S, et al. StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks[C]//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017.
Xu T, Zhang P C, Huang Q Y, et al. Attngan: Fine-grained text to image generation with attentional generative adversarial networks[EB/OL].[2018-08-15] . https://arxiv.org/pdf/1711.10485.pdf https://arxiv.org/pdf/1711.10485.pdf
Hong S, Yang D, Choi J, et al. Inferring semantic layout for hierarchical text-to-image synthesis[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 7986-7994.
Ghazvininejad M, Brockett C, Chang M W, et al. A knowledge-grounded neural conversation model[EB/OL].[2018-08-15] . https://arxiv.org/pdf/1702.01932.pdf https://arxiv.org/pdf/1702.01932.pdf
Zhang S, Dong H, Hu W, et al. Text-to-Image synthesis via visual-memory creative adversarial network[C]//Proceedings of the 2018 Pacific-Rim Conference on Multimedia. Heifei, China: Springer, 2018.
Ren S Q, He K M, Girshick R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal, Canada: ACM, 2015: 91-99.
Dong H, Yu S M, Wu C, et al. Semantic image synthesis via adversarial learning[EB/OL].[2018-08-15] . https://arxiv.org/pdf/1707.06873.pdf https://arxiv.org/pdf/1707.06873.pdf .
Zhuang Y T, Wu F, Chen C, et al. Challenges and opportunities:from big data to knowledge in AI 2.0[J]. Frontiers of Information Technology&Electronic Engineering, 2017, 18(1):3-14.[DOI:10.1631/FITEE.1601883]
相关作者
相关机构
京公网安备11010802024621