How to "Hello World" your first EMR application

This tutorial will explain how to create your first AWS EMR application in 4 simple steps.

So far we have written and tested a simple word-count PySpark application in a local environment. Now let us run it across multiple servers using AWS EMR.

Step 1. Modify the word-count application for EMR

Our existing word-count application will run totally fine in an EMR cluster but it simply outputs its result to the standard output, which will make it harder for us to find the result (when there are multiple servers involved, there will be multiple standard outs). Let us modify the script just a little bit to see the result in a single file in a S3 directory we want. Save the following as word_count_emr.py in a local directory of your choice. Note that you should change the S3 bucket name to that of yours.

from pyspark import SparkContext  
from operator import add  
   
sc = SparkContext()  
data = sc.parallelize(list("Hello World"))  
counts = data.map(lambda x: (x, 1)).reduceByKey(add).sortBy(lambda x: x[1], ascending=False).coalesce(1).saveAsTextFile('s3://hello-world-spark-emr/result')  
sc.stop()

Step 2. Upload the script to S3

Now let us upload the script to a S3 bucket so that our EMR cluster can run it.

In AWS Console Home, click the S3 button to move to S3 Console Home.

In S3 Console Home, click the "Create bucket" button. Enter a bucket name and click the "Create" button.

In S3 Console Home, move to the bucket we just created and click the "Upload" button. Upload our word_count_emr.py.

Step 3. Launch an EMR cluster

Now let us run our PySpark application using AWS EMR.

In AWS Console Home, click the EMR button to move to EMR Console Home.

In EMR Console Home, click the "Create cluster" button to create a cluster.

In the "Create Cluster - Quick Options" page, choose "Step execution" for Launch mode. Select "Spark application" for Step type. Then click the "Configure" button.

Click the folder icon next to Application location. Choose the word_count_emr.py we just have uploaded.

Click the folder icon next to Application location. Choose the word_count_emr.py we just have uploaded. Click the "Add" button in the bottom. In the "Create Cluster - Quick Options" page, adjust the number of instances and click the "Create cluster" button.

After EC2 instances are provisioned and bootstrapped, our step will start running.

After some minutes, our step has completed.

Step 4. Check the result

We can check the result in the S3 location that we specified in our word_count_emr.py.

We can see that a new directory result is created in our S3 bucket.

Inside the folder, we see two files _SUCCESS and part-00000. Download the part-00000 file to check our result.

Yeah, our PySpark application correctly worked in an EMR environment!

For those who want to optimize EMR applications further, the following two blog posts will be definitely useful:

Those who have not run PySpark applications in a local environment yet will find the below two posts helpful:

All source code is available in our Github. Happy using EMR!

Tags

Comments