A great reason to jump into Spark on Mesos on Google Cloud Platform is because you can quickly spin up a development environment to work with Spark, Mesos, Google Cloud, and Marathon together very quickly. A great way to set this up is to follow the steps in Paco Nathan’s (@pacoid) great blog post Spark atop Mesos on Google Cloud Platform.
But what’s missing from this configuration is the ability to connect to Google Cloud Storage (GCS) so you can run your Spark queries off of a persistent elastic storage. As noted in the diagram below, you will first install Spark onto the development Mesos cluster which contains a master node with three slave nodes.By installing the GCS connector, Spark can now communicate with GCS. Below are the steps to follow to create to connect Spark to Google Cloud Storage:
You will first need to install and configure the Google Cloud Storage (GCS) Connector:
- Service account, key, and configurations to allow connections between Google Compute VMs to Storage.
- Spark utilizes parts of the Hadoop infrastructure which connects to the GCS connector to your Google Cloud Storage.
- The Mesosphere installation via mesosphere.google.io automatically pre-installs Hadoop 2.4 which works in a different location than the Spark bits you had installed as noted in Paco’s blog post.
Note, for my installation, I used the Spark 1.1 with Hadoop 2.4 build (Paco’s blog post had utilized Spark 1.0.1 with Hadoop 2.4). Therefore the Spark Home folder is:
You can download the latest version of the Apache Spark binaries; to make it easier, grab the Pre-build for Hadoop 2.4. To assist the configuration and installations below, in addition to setting the MESOS_MASTER environment variable, it may be helpful to create a SPARK_HOME variable:
Installing the GCS Connector
For Spark (and Hadoop) to connect to Google Cloud Storage, with the Mesosphere Google Cloud spin up, you will need to install the GCS connector. Note that if you had created your own Google Cloud Hadoop install, typically you will use “bdutil” [https://cloud.google.com/hadoop/setting-up-a-hadoop-cluster] which automatically includes the installation of this connector. To install the connector:
cd $SPARK_HOME/lib wget https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar
For more information, you can also reference the Google Cloud Storage Connector documentation.
Create a Google Service Account
In order for Spark or Hadoop to connect to Google Cloud Storage from the Google Compute VM, they will utilize OAuth 2.0 Authentication. To properly configure this, you may want to create a service account and an associated private key. The instructions to complete this step can be found at Generating a Private Key.
If you already have a service account, you can obtain the necessary credentials information in the Google Developers’ Console via Project > APIs & Auth > Credentials.An example of one of my service accounts is provided below.
Ensure you generate and save a P12 key as we will be using this key to help the compute VM to the storage.
Push the Private Key up to the Google Compute VMs
As previously noted, the key will be part of the component which will allow you to connect compute to storage. SCP the .p12 key file up to a location that all of your masters and slaves can access.In the below code example, I had pushed the file into the /home/jclouds/.credentials folder for each and every one of the master and slave nodes.
ssh jclouds@$MESOS_MASTER$ 'mkdir -p ~/.credentials'
scp $MY_KEY$.p12 jclouds@$mesosnode$:~/.credentials/
In my development Mesosphere environment there are four nodes, hence the need to push this up four times. You can find the external IP addresses that make up the $mesosnode$ by going to your Mesosphere Cluster > Cluster > Topology section.
Test GCS Connection with Hadoop
A quick way to validate that the GCS connection works is to first test it within Hadoop. But to do this, you will need to update HADOOP_CLASSPATH in the hadoop-env.sh, as well as modify core-site.xml.
The Hadoop shell will use the hadoop-env.sh bash script to set the environment variables to be used. If hadoop-env.sh is not within the /etc/hadoop/conf folder, you can copy it from the example-confs folder, e.g.
cp /usr/lib/hadoop-0.20-mapreduce/example-confs/conf.secure/hadoop-env.sh /etc/hadoop/conf/hadoop-env.sh
Then update the hadoop-env.sh file you just copied to include the following HADOOP_CLASSPATH statement.
Update the core-site.xml The core-site.xml defines the various Hadoop configurations including the connection parameters to Google Cloud Storage.
<description>The FileSystem for gs: (GCS) uris.</description>
The AbstractFileSystem for gs: (GCS) uris. Use with Hadoop 2.
The fs.gs.impl and fs.AbstractFileSystem.gs.impl properties denote the use of Google Cloud Storage using the gs schema. The fs.gs.project.id corresponds to the project id for your Mesosphere cluster.The fs.gs.auth.service.account.email and fs.gs.auth.service.account.keyfile properties correspond to the Google Service account you had created earlier in this post.
Test Hadoop / GCS Connectivity
To test whether Hadoop can connect to Google Cloud Storage, run the following command:
hadoop fs -ls gs://$mybucket
For example, I have a sparkdemo bucket within my project; the command and results are below. Troubleshooting Connectivity Issue
If you are not able to connect, you may need to configure your GCE VM to access GCS via the service account using the following command:
gcloud auth login
Connecting Spark to GCS
We can get Spark to connect to GCS through the Hadoop. To do this, copy the core-site.xml into the Spark conf folder, i.e.:
cp /etc/hadoop/conf/core-site.xml $SPARK_HOME/conf
Finally, you can start your spark-shell using --jars to reference the GCS connector jar to enable Spark to be able to utilize and distribute the jar to all of the other nodes:
export CLASSPATH=$SPARK_HOME/lib/gcs-connector-latest-hadoop2.jar ./bin/spark-shell --master mesos://$MESOS_MASTER:5050 --jars $CLASSPATH
Within the Spark shell, you can now connect to GCS. The script below connects to a file within my SparkDemo bucket and performs a row count.
val mobiletxt = sc.textFile("gs://sparkdemo/mobile/HiveSampleData.txt")
With the shell results below.
14/10/05 23:12:56 INFO SparkContext: Job finished: count at <console>:15, took 2.636260399 s
res1: Long = 59793
And there you have it! Now that it's configured you can have your Spark Mesosphere cluster execute against GCS. Enjoy!