Current Issue Cover
MAP-Vis:基于MAP模型的时空点状大数据可视化方案

谢冲, 关雪峰, 周炜轩, 吴华意(武汉大学测绘遥感信息工程国家重点实验室, 武汉 430079)

摘 要
目的 对于大数据挖掘,可视分析是一种非常重要的研究手段,有助于快速、直观地理解分析大数据蕴含的价值信息。但因其海量、时空、高维等特征,大数据可视化存在内存消耗大、渲染延迟高、可视效果差等问题。针对上述问题,以海量时空点数据为例,采用预处理可视化方案,设计并实现了一套高可扩展的分布式可视分析框架。方法 借鉴瓦片金字塔模型提出一种多维度聚合金字塔模型(MAP),将瓦片金字塔的2D空间层级聚合扩展到时间/空间/属性多维度,同时支持时间、空间、属性的多维层级聚合。进而以Spark集群作为并行预处理工具,以HBase分布式数据库持久化存储MAP模型数据,实现了一套开源的分布式可视化框架(MAP-Vis)。结果 以纽约出租车数据集为例,本研究实验证明能够支持时间/空间/属性多尺度、多维度联动的交互式可视化,同时具有高可扩展的预处理能力和存储能力。结论 在分布式处理能力支持下,系统能实现亚秒级的查询响应,达到良好的交互式可视化效果,证明MAP-Vis是一种有效的大数据交互式可视化方案。
关键词
MAP-Vis: a spatio-temporal big data visualization method based on multi-dimensional aggregation

Xie Chong, Guan Xuefeng, Zhou Weixuan, Wu Huayi(State Key Laboratory of Information Engineering in Surveying Mapping and Remote Sensing, Wuhan University, Wuhan 430079, China)

Abstract
Objective As data collection methods mature and diversify, data sources such as personal smart devices, floating car GPS, internet of things, and social media are becoming increasingly abundant, and the amount of data have been accumulating in an explosive manner. Big data hold spatio-temporal information and high-dimensional features. Spatial and temporal features refer to attribute fields with spatial position and time tags. High dimensional features mean that the target data often contain other valuable attributes. Visual analysis is a highly important method for big data research as it can quickly and intuitively help researchers analyze and understand intrinsic values. However, because of its massive volume, spatio-temporal correlation, and high dimensions, big data visualization poses many challenges to current implementations, including large memory consumption, high rendering delay, and poor visual effects. Method In this study, we propose a generic multi-dimension aggregation pyramid (MAP) model on the basis of the well-known 2D tile pyramid model. This MAP model can support the hierarchical aggregation of time, space, and attributes simultaneously and transform the aggregated results into discrete key-value pairs for scalable storage and efficient retrieval. Then, we use the high-performance Spark cluster as a parallel preprocessing platform and the distributed HBase as final storage to store the generated MAP data. Finally, with the generated MAP datasets, we design and implement an open-source distributed visualization framework (MAP-Vis). Result The experiments use the open New York taxi data, which cover 30 months from January 2014 to June 2016. A single record contains trip-related information, including the location and time of the taxi origin/destination, trip duration, and distance. The visualization interface is implemented on the MAP-Vis framework, which uses HTML, CSS, and JavaScript. Leaflet and OpenStreetMap are used for road network display; the timeline and attribute histogram sections use the d3 library to support user interaction. Three efficiency metrics are collected to evaluate the performance of the MAP model and MAP-Vis system in terms of model validation, storage scalability, and system scalability. In the experiment of model validation, as the size of the raw data increases, the response time curve remains flat and does not show a significant linear increase; the values slightly fluctuate between 0.7 s and 1 s. This result indicates that the MAP model can scale well with the size of spatio-temporal data sets, guarantee a sub-second response, and achieve a smooth interactive visualization experience. In the experiment of storage scalability, as the number of clusters increases, the overall response time decreases dramatically from 3.2 s to 0.9 s, and the parallel efficiency is improved by approximately 2.4 times. This finding can be attributed to distributed storage. More storage nodes are used and the possibility of access to only one region and the access queue time are reduced. Therefore, by increasing the number of HBase storage regions, the proposed framework enhances query efficiency, fully exploits the parallelism of distributed clusters, and significantly improves the visual interactive experience. In the experiment of system scalability, the number of worker nodes in the Spark cluster is changed to measure how the pre-processing time changes (excluding the time of importing the HBase database). An increase in the number of nodes leads to the reduction of pre-processing time from 360 min to 160 min, and the efficiency is improved by approximately 1.3 times. Therefore, with computation nodes, the Spark cluster uses worker nodes and executor processes to share pre-processing tasks, thereby significantly improving the pre-processing efficiency. Conclusion Given its large size, space-time properties, high dimension, and other characteristics, spatial-temporal big data face various challenges such as large memory consumption, high rendering delay, and poor visual effect. To solve this problem, we first propose a spatio-temporal big data organization model, namely, the MAP, which integrates the tile pyramid model and the key-value matching method. The MAP model can consider the time and space dimensions, attribute information, and the three aggregate aggregations step by step, thereby adapting to the rapid and high visualization of time and space big data. On the basis of the MAP model, an open-source visualization framework, MAP-Vis, is implemented on a Linux cluster. The MAP-Vis system uses Spark as a pre-processing tool and HBase as a distributed storage platform. Experiments validate the efficiency of the proposed MAP model, and the undrerlying distributed platforms provide high scalability for visualization and processing. With the cluster, the MAP-Vis realizes sub-second data query and achieves good interactive visualization. Future work can be conducted in the following aspects. 1) This framework has strong support for point type data, but visual elements, including line type elements, polygon type elements, images, etc. should be considered compatible with other data types as much as possible. 2) A simple visual display cannot fully explore the law and value of big data. Hence, joining data analysis modules could be taken into consideration to make the MAP-Vis framework function complete.
Keywords

订阅号|日报