针对非平衡警情数据改进的K-Means-Boosting-BP模型

李卫红; 童昊昕

发布时间： 2017-08-25
摘要点击次数： 2565
全文下载次数： 379
DOI: 10.11834/jig.170106
2017 | Volume 22 | Number 9

针对非平衡警情数据改进的K-Means-Boosting-BP模型

李卫红¹, 童昊昕²(1.华南师范大学, 广州 510631;2.广东精一规划信息科技股份有限公司, 广州 510665)

摘要

目的掌握警情的时空分布规律，通过机器学习算法建立警情时空预测模型，制定科学的警务防控方案，有效抑制犯罪的发生，是犯罪地理研究的重点。已有研究表明，警情时空分布多集中在中心城区或居民密集区，在时空上属于非平衡数据，这种数据的非平衡性通常导致在该数据上训练的模型成为弱学习器，预测精度较低。为解决这种非平衡数据的回归问题，提出一种基于KMeans均值聚类的Boosting算法。方法该算法以Boosting集成学习算法为基础，应用GA-BP神经网络生成基分类器，借助KMeans均值聚类算法进行基分类器的集成，从而实现将弱学习器提升为强学习器的目标。结果与常用的解决非平衡数据回归问题的Synthetic Minority Oversampling Technique Boosting算法，简称SMOTEBoosting算法相比，该算法具有两方面的优势：1）在降低非平衡数据中少数类均方误差的同时也降低了数据的整体均方误差，SMOTEBoosting算法的整体均方误差为2.14E-04，KMeans-Boosting算法的整体均方误差达到9.85E-05；2）更好地平衡了少数类样本识别的准确率和召回率，KMeans-Boosting算法的召回率约等于52%，SMOTEBoosting算法的召回率约等于91%；但KMeans-Boosting算法的准确率等于85%，远高于SMOTEBoosting算法的19%。结论 KMeans-Boosting算法能够显著的降低非平衡数据的整体均方误差，提高少数类样本识别的准确率和召回率，是一种有效地解决非平衡数据回归问题和分类问题的算法，可以推广至其他需要处理非平衡数据的领域中。

关键词

非平衡数据 Synthetic Minority Oversampling Technique算法 Boosting算法 KMeans聚类算法警情时空预测

Improved K-Means-Boosting BP model that targets unbalanced police intelligence data

Li Weihong¹, Tong Haoxin²(1.South China Normal University, Guangzhou 510631, China;2.Guangdong Finest Planning Information Technology Co., Ltd., Guangzhou 510665, China)

Abstract

Objective Crime geology research focuses on identifying the spatiotemporal crime distribution pattern,establishing spatiotemporal crime-forecasting models through machine learning,formulating efficient police prevention and control protocols,and effectively preventing the occurrence of crimes.Existing research has shown that crime data are significantly unbalanced in space and time.Most crimes are focused on the central urban area or a densely inhabited district.Unbalanced data will typically lead to a bias toward the majority class.Consequently,a classifier will display a poor recognition rate for the minority class,whereas minority class areas are usually hot spotswhere crimes frequentlyoccur.Accordingly,training models based on crime data (e.g.,genetic algorithm (GA)-back propagation (BP) neural network) become weak learners,which makes the desired prediction accuracy difficult to achieve.Method This study presents a novel algorithm based on a boosting ensemble learning algorithm to solve the aforementioned problem.The boosting ensemble learning algorithm utilizes more than one predictor for decision making,and thus,provides several advantages as follows:1) The design of a classifier ensemble aims to create a set of complementary/diverse classifiers and to apply an appropriate fusion method to merge their decisions.2) The ensemble learning algorithm may exhibit an improved performance compared with a standard single classifier approach because it canapply the unique strengths of each individual classifier in the pool.3) Ensembles may be robust and insignificantly prone to overfitting because they adopt mutually complementary models with different strengths.Simultaneously,a number of issues have to be considered when using the ensemble learning algorithm:1) how to select a pool of diverse and mutually complementary individual classifiers;2) how to design inter connections among the classifiers in the ensemble,i.e.,how to determine ensemble topology;and 3) how to conduct a fusion step to control the degree of influence of each classifier on the final decision.In consideration of the aforementioned issues,the new algorithm utilizes the GA-BP neural network to make a base classifier and K-means clustering to integrate the base classifier,thereby realizing the objective of converting weak learners into strong learners.The fusion step through K-means clustering is based on the theory of one base classifier performance on one data point that is frequently related to base classifier performance among the data points around thatdata point,which can fully consider the spatial relations among all the base classifiers.The proposed algorithm has two key steps as follows:1) Training data are resampled using a boosting-by-reweighting method that trains a designated number of base classifiers with a GA-BP neural network algorithm and then store all classfiers in a base classifier pool.2) Sample data are classified into several clusters using the K-means algorithm,and a base classifier with the highest forecasting accuracy is dynamically selected for each cluster to predict all data points that belong to the cluster.Result Experimental result demonstrates that the new algorithm has two advantages over the Synthetic Minority Oversampling Technique Boosting Boosting algorithm,which is widely used to solve the problem of regression and classification of unbalanced data:1) The proposed algorithm reduces the mean squared error of the minority class in data and the overall mean squared error.The overall mean squared error of the K-means-boosting algorithm is 9.85E-05,which outperforms the SMOTEBoosting algorithm with an overall mean squared error of 2.14E-04.2) The K-means-boosting algorithm can maintain the balance of minority precision and recall better than the SMOTEBoosting algorithm.The minority recall of the K-means-boosting algorithm is approximately equal to 52%,whereas that of the SMOTEBoosting algorithm is approximately equal to 91%.However,the minority precision of the K-means-boosting algorithm reaches 85%,which is more accurate than that of the SMOTEBoosting algorithm (i.e.,19%).The K-means-boosting algorithm also outperforms the AdaBoost algorithm,which is a classical ensemble learning algorithm because of two reasons:1)The K-means-boosting algorithm reduces the mean squared error of the minority class in unbalanced data and the overall mean squared error compared with the AdaBoost algorithm.2)The K-means-boosting algorithm improves minority recall and precision.The experimental result indicates that the number of clusters plays an important role in the algorithm.The overall mean squared error and the mean squared error of the minority class decrease with an increase in the number of clusters.Nevertheless,the rate of decline decreases gradually and will eventually approach a limit,whereas the computational cost required for classifier integration will continue to increase.No consensus rule currently exists to determine the number of clusters,and this variable can only be determined manually based on data.Conclusion This study proposes a boosting ensemble learning algorithm that integrates base classifiers using the k-means clustering algorithm to address the problem of unbalanced data regression.The application effect of the algorithm on the spatiotemporal prediction of police intelligence data proves that the method can deal with the prediction of grid-based spatiotemporal intelligence data.Compared with traditional boosting algorithms that integrate base classifiers into a weighted average method,the proposed method significantly reduces the overall mean squared error and the mean squared error of minority classes.Similarly,compared with the SMOTEBoosting algorithm,which is commonly used to solve the problem of unbalanced data regression,the proposed method does not only reduce the overall mean squared error of the sample data while reducing the mean squared error of the minority classes,but also maintain balance between minority precision and recall.The algorithm can be extended to other similar areas where unbalanced data regression or classification problems should be addressed.

Keywords

unbalanced data Synthetic Minority Oversampling Technique Boosting algorithm boosting algorithm K-means clustering algorithm spatio-temporal prediction of police intelligence

在线采编平台

在线出版

年度会议

下载中心

年度信息