Map Reduce example (Hadoop)

Updated: Jul 23, 2018

Introduction

Hadoop MapReduce is a software that easily writes applications that handle large amounts of data in parallel across distributed clusters of nodes (AKA computers). MapReduce is a single concept associated with Hadoop’s Distributed File System (HDFS). MapReduce usually splits the input data-set into independent subsets which are processed in a parallel manner. The outputs of the each map, which are then input into the reduce task. Both the input and output are both stored in a ﬁle-system. Typically the computing nodes and the storage nodes are the same. So the MapReduce framework and the HDFS are running on the same nodes. The concept of MapReduce is diﬀerent from that of Spark, for example, simply because MapReduce utilizes disk space to partition and perform computation whereas Spark uses RAM and residual disk space if needed.

The concept of distributed ﬁle systems is important to understand when dealing with big data because traditionally data was small enough to be handled on personal machines in single ﬁles. Now, data can be so large that the data itself has to be housed on a sever and computed across distributed systems and allocated disk space (MapReduce) or RAM space (Spark) when computations are performed. Below I demonstrate how to get started with the concept of MapReduce in Hadoop using Ubuntu OS. MapReduce is a very fundamental concept to master in data science because all other distributed ﬁle systems and memory allocated techniques stem from MapReduce.

Tutorial

1. Open Command Terminal

2. Head over to https://drive.google.com/drive/folders/1dXBCs_VUi9z3bhcDrtpvxxQLLg9miIac?usp=sharing and download the following ﬁles needed for the example. The ﬁles are compiled jar ﬁles that contain the MapReduce java code needed for Hadoop.

3. Create the following folders on your home directory: Hadoop_MapReduceExamples and the sub folder MapReduceTutorial and MapReduceData

4. Inside the MapReduceData folder place the following jar ﬁles inside: SaleCountry.jar, SalesCountry2.jar, and SalesJan2009.csv

5. Change your working directory similar to what is below

6. Get a list of the files in your newly made MapReduceData folder. Similar to what is below:

7. Start Hadoop by running the following line of code in the command terminal (direct to your local folder where Hadoop is installed):

8. Make the following directory inside of MapReduceData folder:

9. Inside of the newly created inputSalesExample folder we want to insert the following ﬁles inside: SalesJan2009.csv

10. Run yarn by calling the following code:

11. Change directories to a new folder on your desktop called: mapreduce_output_sales

12. Next you will run Mapreduce using the code inside of the SalesCountry2.jar ﬁle located in the MapReduceData folder. The output will be sent to the mapreduce_output_sales folder you created.