Search This Blog

Apache Spark - Word Count Example (using RDDs)

This is a simple PySpark script which counts how many times each word occurs in a given text file. Thе script is not really perfectly polished or production-ready. Why? Well, for example words like "don't" are treated as two words "don" and "t". For producing a more professional version of this code we might want to use some specialized Python NLP library like e.g. NLTK which should be able to extract the words from the text file in a better way. Still, this script is quite useful for illustration purposes.


import re

from pyspark import SparkConf, SparkContext



def get_all_words_in_lowercase(textLine):

    # >>> The input value is a line of text from the book

    # >>> The returned value is a list of all the words from that line in lowercase

    return re.compile(r'\W+').split(textLine.lower())



conf = SparkConf().setMaster("local").setAppName("WordCount")

sc = SparkContext(conf = conf)



input = sc.textFile("file:///D:/PlayGround/book.txt")

# Now in input we have all lines from the text file. Each element of the RDD is a line of text.



rdd1 = input.flatMap(get_all_words_in_lowercase);

# We map each line to a list of all words which occur in that line.

# OK, now in rdd1 we have all words from the text file.



rdd2 = rdd1.map(lambda word : (word, 1));

# We map each word to the couple (word, 1)

# This way we produce rdd2



rdd3 = rdd2.reduceByKey(lambda x,y : x+y);

# We now count how many times each word occurs in rdd2.

# This way we produce rdd3 which contains

# couples of the form (word, word_count)



rdd4 = rdd3.map(lambda couple : (couple[1], couple[0]))

# We now change the roles of key and value.

# This way from rdd3 we get rdd4 which contains

# couples of the form (word_count, word)



rdd5 = rdd4.sortByKey(ascending=True)


lst5 = rdd5.collect()


print("Number of unique words ---> " + str(len(lst5)))


for pair in lst5:

    print(str(pair[1]) + " ---> " + str(pair[0]) + " occurrences")




No comments:

Post a Comment