Introduction
Hadoop MapReduce is a software that easily writes applications that handle large amounts of data in parallel across distributed clusters of nodes (AKA computers). MapReduce is a single concept associated with Hadoop’s Distributed File System (HDFS). MapReduce usually splits the input data-set into independent subsets which are processed in a parallel manner. The outputs of the each map, which are then input into the reduce task. Both the input and output are both stored in a file-system. Typically the computing nodes and the storage nodes are the same. So the MapReduce framework and the HDFS are running on the same nodes. The concept of MapReduce is different from that of Spark, for example, simply because MapReduce utilizes disk space to partition and perform computation whereas Spark uses RAM and residual disk space if needed.
The concept of distributed file systems is important to understand when dealing with big data because traditionally data was small enough to be handled on personal machines in single files. Now, data can be so large that the data itself has to be housed on a sever and computed across distributed systems and allocated disk space (MapReduce) or RAM space (Spark) when computations are performed. Below I demonstrate how to get started with the concept of MapReduce in Hadoop using Ubuntu OS. MapReduce is a very fundamental concept to master in data science because all other distributed file systems and memory allocated techniques stem from MapReduce.
Tutorial
1. Open Command Terminal
2. Head over to https://drive.google.com/drive/folders/1dXBCs_VUi9z3bhcDrtpvxxQLLg9miIac?usp=sharing and download the following files needed for the example. The files are compiled jar files that contain the MapReduce java code needed for Hadoop.
3. Create the following folders on your home directory: Hadoop_MapReduceExamples and the sub folder MapReduceTutorial and MapReduceData
4. Inside the MapReduceData folder place the following jar files inside: SaleCountry.jar, SalesCountry2.jar, and SalesJan2009.csv
5. Change your working directory similar to what is below
6. Get a list of the files in your newly made MapReduceData folder. Similar to what is below:
7. Start Hadoop by running the following line of code in the command terminal (direct to your local folder where Hadoop is installed):
8. Make the following directory inside of MapReduceData folder:
9. Inside of the newly created inputSalesExample folder we want to insert the following files inside: SalesJan2009.csv
10. Run yarn by calling the following code:
11. Change directories to a new folder on your desktop called: mapreduce_output_sales
12. Next you will run Mapreduce using the code inside of the SalesCountry2.jar file located in the MapReduceData folder. The output will be sent to the mapreduce_output_sales folder you created.
13. You should see output similar to this:
14. Lets review the results inside the mapreduce_output_sales folder.
15. Output should look similar to below
16. The results from Map reduce are located in the part-00000 file inside of the mapreduce_output_sales folder.
Breya Walker
Data Scientist - HeySoftware!
Comments