Given a set of (key-as-string, value-as-integer) pairs, say we want to create a
top N (where N > 0) list. Top N is a design pattern. For example, if key-as-string is a URL
and value-as-integer is the number of times that URL is visited, then you might
ask: what are the top 10 URLs for last week? This kind of question is common for
these types of key-value pairs. Finding a top 10 list is categorized as a filtering pattern
(i.e., you filter out data and find the top 10 list).
Approach 1 -
val inputRDD = sc.parallelize(Array(("A", 5), ("A",10), ("A", 6), ("B", 67), ("B", 78), ("B", 7)))
val resultsRDD = inputRDD.groupByKey.map{case (key, numbers) => key -> numbers.toList.sortBy(x => x).take(2)}
resultsRDD .collect()
Approach 2 -
val inputRDD = sc.parallelize(Array(("A", 5), ("A",10), ("A", 6), ("B", 67), ("B", 78), ("B", 7)))
inputRDD .groupByKey.mapValues(number => number.toList.sortBy(x => x).take(2)).collect()
We can add negation at x => -x for bottom N elements.