标题：Efficient Snapshot KNN Join Processing for Large Data Using MapReduce
作者：Hu, Yupeng; Yang, Chong; Ji, Cun; Xu, Yang; Li, Xueqing
作者机构：[Hu, Yupeng; Yang, Chong; Ji, Cun; Xu, Yang; Li, Xueqing] Shandong Univ, Sch Comp Sci & Technol, Jinan, Peoples R China.; [Hu, Yupeng] Nanjing Univ, 更多
会议名称：22nd IEEE International Conference on Parallel and Distributed Systems (ICPADS)
会议日期：DEC 13-16, 2016
来源：2016 IEEE 22ND INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS)
关键词：KNN Join; Data Streams; MapReduce; Large Data
摘要：The kNN join problem, denoted by R x(KNN) S, is to find the k nearest neighbors from a given dataset S for each point in the query set R. It is an operation required by many big data applications. As large volume of data are continuously generated in more and more real-life cases, we address the problem of monitoring kNN join results on data streams. Specifically, we are concerned with answering kNN join periodically at each snapshot which is called snapshot kNN join. Existing kNN join solutions mainly solve the problem on static datasets, or on a single centralized machine, which are difficult to scale to large data on data streams. In this paper, we propose to incrementally calculate the kNN join results of time t(i) from the results of snapshot t(i-1). Typically, for the data continuously generated on the data stream, we can get S-i = Si-1 + Delta S-i for the valid datasets of adjacent snapshots, where Delta S-i denotes the updated points between time t(i -1) and t(i). Our basic idea is to first find the queries in R whose kNN results can be affected by the updated points in Delta S-i, and then update the kNN results of these small part of queries respectively. In this way, we can avoid calculating the kNN join results on the whole dataset S-i in time t(i). We propose an implementation of searching for affected query points in MapReduce to scale to large volume of data. In brief, the mappers partition the datasets into groups, and the reducers search for affected queries separately on each group of points. Furthermore, we present the enhanced strategies of data partitioning and grouping to reduce the shuffling cost and computational cost. Extensive experiments on real-world datasets demonstrate that our proposed methods are efficient, robust, and scalable.