Distributed computing frameworks

Page content

Distributed computing

1. Hadoop

  • It is a set of tools that support distributed maps and reduce style programming through Hadoop MapReduce.
  • It is for batch processing of big datasets.
  • It uses the distributed file system for storing big data and MapReduce to process it.
  • It excels in storing and processing of huge data of various formats such as arbitrary, semi-, or even unstructured.

mrjob library

  • It is used to write Hadoop jobs
  • Prerequisites:
    • pip install mrjob
    • brew install hadoop

2. Spark

  • It is an analytics toolkit designed to modernize Hadoop.
  • We will apply Spark in analytics and machine learning use cases.

mapreduce

  • Map transforms the data by dividing it into key/value pairs, getting the output from a map as an input.
  • Reduce aggregates data together.
  • It is written in Java, but it can run in Python, C++, etc.
import functools
from typing import Dict

def map_frequency(text: str) -> Dict[str, int]:
 frequencies = {}
 for word in text.split(' '): # words
  if word in frequencies:
   frequencies[word] += 1
  else:
   frequencies[word] = 1

 return frequencies

def merge_dictionaries(first: Dict[str, int], second: Dict[str, int]) -> Dict[str, int]:
 merged = first
 for key in second:
  if key in merged:
   merged[key] += second[key]
  else:
   merged[key] = second[key]
 return merged

if __name__ == "__main__":
 lines = [ "I know what I know",
       "I know that I know",
        "I don't know much",
        "They don't know much" ]

 mapped_results = [ map_frequency(line) for line in lines ]

 for result in mapped_results:
  print(result)

 print(functools.reduce(merge_dictionaries, mapped_results))