Big Data is all the rage. It’s being discussed by industry experts, business analysts, and executives. However, it’s not enough to have access to your Big Data- to showcase the true value it needs to be presented visually. In fact, it’s been proven that 90% of information is transmitted to the brain is visual.
As a data scientist and engineer, I appreciate that Apache Spark has many components to make it easy to analyze, gain insight, and to generate recommendations from my data.However, as noted within my previous presentation, one of the things missing is an easy way for analysts to visualize their data.
The good news is there is an easy way to gain visuals of your data by connecting Tableau to SparkSQL!
Before you get started ensure you have these prerequisites. My own configuration includes:
- a MacBook Pro Retina running OSX Yosemite
- Spark 1.1 with VMWare Fusion running Tableau 8.2 on Windows 7 Ultimate.
- Hadoop 2.5.0 and Hive 0.12 - On OSX, an easy way to perform this installation is to use HomeBrew. A handy reference to configure Hadoop 2.5 is Hadoop MapReduce Next Generation - Setting up a Single Node Cluster.
- Tableau 8.2 on Windows - This is because connectivity from Tableau to SparkSQL is through the Simba Spark ODBC Driver, which is only available on Windows (for now).
- Databricks ODBC Driver for Apache Spark - currently you can still get the driver from Datastax Driver downloads - 32-bit | 64-bit. For more information about the Simba Spark ODBC driver, check out the Simba Technologies’ blog post.
Sidebar: For more information about the Simba Spark ODBC driver, check out the Simba Technologies’ blog post.
Configuring Hive To connect Tableau to SparkSQL (via the Simba Spark ODBC driver), the connection will be submitted through the Thrift Server (more on how to get this running later in the post). For Spark 1.1, the Thrift Server is compiled against Hive 0.12 hence the reason you will want to ensure that your own Hive configuration matches. Note: Hive 0.13 will be back ported to work against Spark 1.1 in the near future). When connecting to the Thrift Server, you will ultimately connect to the Hive metastore which requires remote access. To simplify connectivity, you may want to upgrade the Hive metastore from the default derby database to MySQL. A good reference is Changing Default metastore [derby] of Hive to MySQL.Note: PostgreSQL is also a great option, I just happened to have MySQL already installed; below are some quick configuration tips. MySQL Configuration When you create a user account so that Hive (for both Hadoop and Spark) can connect to the MySQL database (e.g. “hiveuser”), ensure that this user account is accessible not only by the localhost but also by the hostname (in my case, my hostname for my laptop is gallifrey.local)
MySQL Java Connector
Once you have configured the MySQL database, ensure that you have placed the MySQL Java Connector into the Hive Classpath. For example, I used Homebrew to install Hive and my version of the MySQL Java Connector is “mysql-connector-java-5.1.27.jar”. I placed the jar file into the /usr/local/Cellar/hive/0.12.0/libexec/lib folder to ensure that Hive is able to connect to the MySQL database.
Many of the issues surrounding Spark’s interaction with Hive (via the HiveContext) can be attributed to with the initial Hive configuration to Hadoop/YARN itself.Below are some common configuration issues:
Ensure You Have Set Your Java Connection URL to a Hostname
The default hive-site.xml has a Java ConnectionURL that points to the localhost derby database. When you switch over to another database (e.g. MySQL), ensure you are using a hostname instead of “localhost”. This way you can make remote connections to the Hive metastore (not just local ones)
<property> <name>javax.jdo.option.ConectionURL</name> <value>jdbc:MySQL://gallifrey.local:3306/metastore_db?CreateDataba seIfNotExist=true</value> <description>JDBC connect string metastrore</description> </property>
Take Note of the javax.jdo.option.ConnectionUserName
More on this further down the blog post, but ensure you know the javax.jdo.option.ConnectionUserName value from the hive-site.xml that you use to connect Hive to the Hive metastore.In my case, my user is “hiveuser”. I’ve included additional troubleshooting steps at the end of this article. For more information, please reference the Hive Metastore Administration manual to properly configure Hive.
Spark and the Hive Context
Now, we can finally go back to working within Spark 1.1. To get SparkSQL / Tableau connectivity, you will need to have the Spark Thrift Server running. Follow the instructions within the Spark 1.1 - Spark SQL Programming Guide.The key call outs are noted below:
Rebuilding the Assembly Jar
As noted under the “Hive Tables”section of the Spark SQL Programming Guide, you will need to run “sbt/sbt -Dhadoop.version=2.5.0 -Phive assembly/assembly”to build a new assembly jar that includes Hive (i.e. Hive 0.12) via -Phive.To get Spark to utilize this Hive context, you will need to copy the hive-site.xml that you had configured in the above steps into the $SPARK_HOME/conf folder where $SPARK_HOME is the root folder for Spark (e.g. /usr/local/Cellar/spark). Note that this command also uses -Dhadoop.version=2.5.0 to correspond to the Hadoop version that you are using. Multiple -Dhadoop.version can be customized to any Hadoop version that you are using.
Connecting the Spark Hive Context to the Hive Metastore
In the previous section, you had created a new MySQL Hive Metastore - now you will need to allow Spark to connect to the same metastore. In addition to copying the hive-site.xml to the $SPARK_HOME/conf folder, you will need to copy the MySQL Java Connector to a folder that is accessible by the Spark classpath.For example, I had placed mysql-connector-java-5.1.27.jarin the $SPARK_HOME/lib folder. This will ensure that the Spark Hive Context has the necessary jars to connect to the MySQL database.
This contains various environment variables that Spark uses to configure its assorted operations. Important ones are SPARK_PUBLIC_DNS and SPARK_MASTER_IP - ensure they are set to the hostname so the various Spark operations understand which server to execute against.
Start the Spark Service
You can quickly start up the Spark Service from the bash terminal by using the following commands. It will set the SPARK_HOME and CLASSPATH variables and start the various services to run Spark.
Once the service is started, you will be able to see the service and any running applications by going to: [hostname]:8080
While it is not necessary for the purpose of running the Spark Thrift Service, you can use the following command to interact with the spark-shell to run the various samples and exercises within the Spark SQL Programming Guide.
./bin/spark-shell--master spark://gallifrey.local:7077--driver-class-path $CLASSPATH
Note: It also includes the --driver-class-path parameter so that Spark will be able to utilize the MySQL Java connector to connect to the Hive metastore.
Starting Spark Thrift Server
To connect Spark with Tableau, you will use the Simba Spark ODBC Driver on Tableau 8.2 on Windows. The Thrift Service allows you to interact with the Spark within the Hive Context, but the SQL Context requires a separate process. While running in localhost mode isn’t necessary, for this setup where the ODBC client and SparkSQL are on separate systems, you can configure the listening host and port in the same way as HiveServer2 of Hive:
Via environment variables
Via system properties
For the latter, you may pass the properties via command line:
$ ./sbin/start-thriftserver --hiveconf
This behavior is inherited from Hive since Spark SQL Thrift Server is a variant of HiveServer2. Thanks to Cheng Lian for clarifying this!
Continuing with my configuration, the bash commands I use are:
Classpath = $SPARK_HOME/lib/MySQL- connector-java-5.1.27.jar
spark://gallifrey.local:7077 --driver-class-path$CLASSPATH --
hiveconf hive.server2.thrift.bind.host gallifrey.local --hiveconf
I’m directing the Spark Thrift Server to connect to the Spark service and also utilize the --driver-class-path in order for Spark to be able to utilize the MySQL Java connector. And then ultimately connect to the Hive metastore.
Sidebar: The reason for using --driver-class-path instead of --jar to load the MySQL connector jar is because the --jar is a special class loading for Spark.But the MySQL connector jar is loaded by Hive and it is not aware of the --jar special class loader.On the other hand, the --driver-class-path is captured by the startup scripts and appended to classpath settings that is used to start the JVM running the driver.
Connecting Tableau to SparkSQL
Finally! At this point, it’s all about getting Tableau to connect to SparkSQL.To do this, ensure that you’ve already installed the Databricks ODBC Driver for Apache Spark. Currently, you can still get the driver from Datastax Driver downloads - 32-bit and 64-bit. Create ODBC connection You will be creating a new ODBC DSN setup within the ODBC Administrator. Be sure to choose the right bit-ness (i.e. 64-bit or 32-bit to match driver and Tableau version). The key parameters as noted below are:
- Host: The IP address of your Spark Server / OSX box
- Port: 10000 (this is the default value of the Thrift Server port)
- Database: Which database you are connecting to, typically this is default
- Shark Server: Specify “Shark Server 2”
- Authentication: For the authentication, choose “User Name”and for the username, type in the user account that you use to connect to Hive metastore (e.g. hiveuser).
Connect from Tableau
Once the ODBC connection is configured, you can now connect to SparkSQL via the Spark Thrift Server by connecting through the “Connected to Other Database (ODBC)”and choosing the Simba Spark ODBC connection you had made previously.
The key parameters are:
- Server: This will be defined by the ODBC connection you had just made. In my case its gallifrey.local.
- Database: Choose SHARK
- Schema: Choose default (typically)
At this point, you will be able to choose the tables that exist within the Hive Context through Spark. It is important to note that when you create Spark tables (for example, via.registerTempTable) these are operating within the Spark environment, which resides in a separate process from the Hive Metastore. This means that tables created within the Spark context are not available through the Thrift Server. To achieve this, within the Spark context save your temporary table into Hive - then the Spark Thrift Server will be able to see your table. Note, as part of Spark 1.2, there is work on pure SQL alternatives for accessing many of these data sources as part of the external datasource API.
Once the Spark Thrift Server is able to “see”this table, it will then be able to utilize the advantages and optimizations of Spark, to execute your queries. For example, the screenshot below is of Tableau 8.2 on Windows connecting to a sample dataset of mobile platforms (i.e. iOS, Android, etc.) and is displaying the results on a map.
And there you have it! The good news is that while the initial configuration steps take a little while to get used to, this is a one-time setup, which will allow you to use Tableau to connect to SparkSQL.
Last but not least, I wanted to give my sincere thanks to Cheng Lian, Michael Armbrust, Reynold Xin, Arsalan Tavakoli-Shiraji, and Pat McDonough at DataBricks for helping me put this method together. I would love to see what visuals you are able to create with your data!
If you run into any issues, trouble shoot with these steps:
Starting Hive / Schema Verification Issues
After configuring Hive, if you run into the issue below when starting Hive:
FAILED: Execution Error, return code 1from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
The solution is to ensure that within the hive-site.xml, the parameter hive.metastore.schema.verification is set to false.
Thanks to the StackOverflow post How to Configure MySQL 5.5.35 with hive 0.12.
Specified key was too long; max key length is 767 bytes error:
If your Hive metastore is a MySQL database and you run into the Specified key was too long; max key length is 767 bytes issue, it may be due to Latin1 not being the character set.To fix this, execute the following within MySQL:
ALTER DATABASE hive_metastore CHARACTER SET Latin1
Attribution to this solution can be found at Hive check execute failed HDP1.3.
For more information, please reference the Hive Metastore Administration manual to properly configure Hive.