Data Mining Itemset Parallelization and Distribution Using Mapreduce Approach
Keywords:
Frequent itemsets, frequent items ultrametric tree (FIU-tree), Hadoop cluster, load balance, MapReduceAbstract
Existing parallel mining algorithms for frequent itemsets unavailable for the mechanism that renders
automatic parallelization, load balancing, data distribution, and fault tolerance on large clusters. As a solution to this
problem, we build a parallel frequent itemsets mining algorithm called FiDoop using the MapReduce programming
model. To achieve compressed storage and keep away from Sbuilding conditional pattern bases, FiDoop introduce the
frequent items ultrametric tree, rather than conventional FP trees. In FiDoop, three MapReduce jobs are
implemented to complete the mining task. In the importance of third MapReduce job, the mappers independently
decompose itemsets, the reducers perform combination operations by constructing small ultrametric trees, and the
actual mining of these trees separately. We implement FiDoop on our in-house Hadoop cluster. We prove that FiDoop
on the cluster is sensitive to information distribution and dimensions, because itemsets with distinct lengths have
distinct decomposition and construction costs. To improve FiDoop’s performance, we develop a workload balance
metric to measure load balance across the cluster’s computing nodes. We develop FiDoop-HD, an extension of
FiDoop, to speed up the mining performance for high-dimensional data analysis.