Distributed computing frameworks

Fatma Kahveci

Coding

Page content

Distributed computing

1. Hadoop

It is a set of tools that support distributed maps and reduce style programming through Hadoop MapReduce.
It is for batch processing of big datasets.
It uses the distributed file system for storing big data and MapReduce to process it.
It excels in storing and processing of huge data of various formats such as arbitrary, semi-, or even unstructured.

`mrjob` library

It is used to write Hadoop jobs
Prerequisites:
- pip install mrjob
- brew install hadoop

2. Spark

It is an analytics toolkit designed to modernize Hadoop.
We will apply Spark in analytics and machine learning use cases.

mapreduce

Map transforms the data by dividing it into key/value pairs, getting the output from a map as an input.
Reduce aggregates data together.
It is written in Java, but it can run in Python, C++, etc.

import functools
from typing import Dict

def map_frequency(text: str) -> Dict[str, int]:
 frequencies = {}
 for word in text.split(' '): # words
  if word in frequencies:
   frequencies[word] += 1
  else:
   frequencies[word] = 1

 return frequencies

def merge_dictionaries(first: Dict[str, int], second: Dict[str, int]) -> Dict[str, int]:
 merged = first
 for key in second:
  if key in merged:
   merged[key] += second[key]
  else:
   merged[key] = second[key]
 return merged

if __name__ == "__main__":
 lines = [ "I know what I know",
       "I know that I know",
        "I don't know much",
        "They don't know much" ]

 mapped_results = [ map_frequency(line) for line in lines ]

 for result in mapped_results:
  print(result)

 print(functools.reduce(merge_dictionaries, mapped_results))

Distributed computing

1. Hadoop

mrjob library

2. Spark

mapreduce

`mrjob` library