How to "Hello World" your first Spark application

BigData is getting more accessible thanks to new technologies such as Apache Spark. Here we present 4 simple steps to write your first Spark application.

Knowing how to write and run Spark applications in a local environment is both essential and crucial because it allows us to develop and test your applications in a cost-effective way. This tutorial will show how to achieve the goal with a simple PySpark script in an Ubuntu environment as an example.

Step 0. Install Java

Check Java is installed in your system.

~$ java -version

If it says The program 'java' can be found in the following packages, then Java is not installed. Install Java by running the following commands:

~$ sudo add-apt-repository ppa:webupd8team/java
~$ sudo apt-get update
~$ sudo apt-get install oracle-java8-installer

Type java -version to see Java is properly installed. You should see the message like below:

java version "1.8.0_144"  
Java(TM) SE Runtime Environment (build 1.8.0_144-b01)  
Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)

References

Step 1. Install PySpark

We will install PySpark in a virtual environment so that your other projects that do not require PySpark do not have it as a dependency.

~$ mkdir hello_world_spark  
~$ cd hello_world_spark  
~/hello_world_spark$ virtualenv env  
~/hello_world_spark$ source env/bin/activate  
~/hello_world_spark$ pip install pyspark

References

Step 2. Configure your virtual environment

We have to let your virtual environment the location of spark-submit, which is an executable file that already exists in your system by installing PySpark. To do so, first open a text editor, open ~/hello_world_spark/env/bin/activate and edit the following like.

  • From: PATH="$VIRTUAL_ENV/bin:$PATH"
  • To: PATH="$VIRTUAL_ENV/bin:$VIRTUAL_ENV/lib/python2.7/site-packages/pyspark/bin:$PATH"

Save the change and see if spark-submit is on PATH

~/hello_world_spark$ deactivate  
~/hello_world_spark$ activate  
~/hello_world_spark$ spark-submit  
   
# You will see messages like below:  
Usage: spark-submit [options] <app jar="" |="" python="" file=""> [app arguments]  
Usage: spark-submit --kill [submission ID] --master [spark://...]  
Usage: spark-submit --status [submission ID] --master [spark://...]  
Usage: spark-submit run-example [options] example-class [example args]  
  Options:  
--master MASTER_URL         spark://host:port, mesos://host:port, yarn, or local.  
(truncated by author)</app>

Step 3. Write your PySpark application

Here, we are going to write a simple PySpark application that counts how many times a character appears in the sentence "Hello World." It is simple but yet illustrative. Open a text editor and save the following content in a file named word_count.py in our project directory ~/hello_world_spark.

from pyspark import SparkContext  
from operator import add  
   
sc = SparkContext()  
data = sc.parallelize(list("Hello World"))  
counts = data.map(lambda x: (x, 1)).reduceByKey(add).sortBy(lambda x: x[1], ascending=False).collect()  
for (word, count) in counts:  
    print("{}: {}".format(word, count))  
sc.stop()

References

Step 4. Run your PySpark application

In your project directory run the following command.

/hello_world_spark$ spark-submit word_count.py

You will see a long list of messages, one of which you will happily read:

17/09/17 16:44:07 INFO DAGScheduler: Job 0 finished: collect at /home/ubuntu/hello_world_pyspark/word_count.py:6, took 2.899118 s  
l: 3  
o: 2  
 : 1  
e: 1  
d: 1  
H: 1  
r: 1  
W: 1  
17/09/17 16:44:07 INFO SparkUI: Stopped Spark web UI at http://192.168.148.128:4040

Welcome to the world of BigData! Before we deploy this simple PySpark application to a cluster of servers using AWS EMR, please find this blog post on "How to unittest PySpark applications" to familiarize yourself with how to unittest PySpark applications. All the source code is available in our Github as well.

Tags

Comments