Tuesday, April 11, 2017

Get Top, Bottom N Elements by Key

Given a set of (key-as-string, value-as-integer) pairs, say we want to create a
top N (where N > 0) list. Top N is a design pattern. For example, if key-as-string is a URL
and value-as-integer is the number of times that URL is visited, then you might
ask: what are the top 10 URLs for last week? This kind of question is common for
these types of key-value pairs. Finding a top 10 list is categorized as a filtering pattern
(i.e., you filter out data and find the top 10 list).

Approach 1 -
val inputRDD = sc.parallelize(Array(("A", 5), ("A",10), ("A", 6), ("B", 67), ("B", 78), ("B", 7)))
val resultsRDD = inputRDD.groupByKey.map{case (key, numbers) => key -> numbers.toList.sortBy(x => x).take(2)}
resultsRDD .collect()

Approach 2 -
val inputRDD  = sc.parallelize(Array(("A", 5), ("A",10), ("A", 6), ("B", 67), ("B", 78), ("B", 7)))
inputRDD .groupByKey.mapValues(number => number.toList.sortBy(x => x).take(2)).collect()

We can add negation at x => -x for bottom N elements.

Apache Spark – Catalyst Optimizer

Optimizer is the one that automatically finds out the most efficient plan to execute data operations specified in the user’s program. In...