Wednesday, September 23, 2015

Spark client configuration in a Custom Hadoop Client Environment

Hi,

In this post, i will show how to communicate spark to a Hadoop Cluster located on Oracle BDA. Currently , in Cloudera Manager version CDH 5.4.0 spark client is not easily configurable . When you download client configuration and see spark related files are not relevant . After a huge effort (!) i managed to start spark-shell with preconfigured to target BDA environment.


Previously , i configured hadoop and hdfs client configs in this blog.

But i think spark client configuration is not well supported by CM. So it took long time and i saw many bugs with spark so far, thankfully they are all resolved in my version :)

In this post, i assume you downloaded and installed spark rpms. And system has binaries spark-shell and spark-submit working.

Lets start with downloading spark client configuration from CM and see files.









As you see yarn-conf directory has the necessary files . So lets check spark-conf directory.

Here is the output of spark-defaults.conf and as you see it contains default values not real values with target configuration.

spark.eventLog.dir=/user/spark/applicationHistory
spark.eventLog.enabled=true
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.shuffle.service.enabled=true
spark.shuffle.service.port=7337

And the spark-env.sh also has default parameters.

#!/usr/bin/env bash
##
# Generated by Cloudera Manager and should not be modified directly
##

if [ -z "$SPARK_CONF_DIR" ]; then
  export SPARK_CONF_DIR=$(cd $(dirname $BASH_SOURCE) && pwd)
fi

export SPARK_HOME={{SPARK_HOME}}
export DEFAULT_HADOOP_HOME={{HADOOP_HOME}}

### Path of Spark assembly jar in HDFS
export SPARK_JAR_HDFS_PATH=${SPARK_JAR_HDFS_PATH:-'{{SPARK_JAR_HDFS_PATH}}'}

### Extra libraries needed by some Spark subsystems.
CDH_HIVE_HOME=${HIVE_HOME:-'{{HIVE_HOME}}'}
CDH_FLUME_HOME=${FLUME_HOME:-'{{FLUME_HOME}}'}
CDH_PARQUET_HOME=${PARQUET_HOME:-'{{PARQUET_HOME}}'}
CDH_AVRO_HOME=${AVRO_HOME:-'{{AVRO_HOME}}'}
HADOOP_EXTRA_CLASSPATH=${HADOOP_CLASSPATH:-'{{HADOOP_EXTRA_CLASSPATH}}'}

export HADOOP_HOME=${HADOOP_HOME:-$DEFAULT_HADOOP_HOME}

if [ -n "$HADOOP_HOME" ]; then
  LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${HADOOP_HOME}/lib/native
fi

SPARK_EXTRA_LIB_PATH="{{SPARK_EXTRA_LIB_PATH}}"
if [ -n "$SPARK_EXTRA_LIB_PATH" ]; then
  LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$SPARK_EXTRA_LIB_PATH
fi

export LD_LIBRARY_PATH
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-$SPARK_CONF_DIR/yarn-conf}

# This is needed to support old CDH versions that use a forked version
# of compute-classpath.sh.
export SCALA_LIBRARY_PATH=${SPARK_HOME}/lib

# Set distribution classpath. This is only used in CDH 5.3 and later.
SPARK_DIST_CLASSPATH="$HADOOP_HOME/client/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:$($HADOOP_HOME/bin/hadoop --config $HADOOP_CONF_DIR classpath)"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:$CDH_HIVE_HOME/lib/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:$CDH_FLUME_HOME/lib/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:$CDH_PARQUET_HOME/lib/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:$CDH_AVRO_HOME/*"
if [ -n "$HADOOP_EXTRA_CLASSPATH" ]; then
  SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:$HADOOP_EXTRA_CLASSPATH"
fi
export SPARK_DIST_CLASSPATH


As we do in custom client environment we copy those files to /tmp/PRODCONF/spark .

$ls /tmp/PRODCONF/spark/spark-conf
log4j.properties  spark-defaults.conf  spark-env.sh  yarn-conf

After that we export classpath as we do previously for target BDA environment.

export HADOOP_CONF_DIR=/tmp/PRODCONF/hdfs/hadoop-conf
export HIVE_CONF_DIR=/tmp/PRODCONF/hive/hive-conf
export YARN_CONF_DIR=/tmp/PRODCONF/yarn/yarn-conf
export SPARK_CONF_DIR=/tmp/PRODCONF/spark/spark-conf

After that when we issue a spark-shell.


$spark-shell --verbose

/tmp/PRODCONF/spark/spark-conf/spark-env.sh: line 43: {{HADOOP_HOME}}/bin/hadoop: No such file or directory
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream
        at org.apache.spark.deploy.SparkSubmitDriverBootstrapper$.main(SparkSubmitDriverBootstrapper.scala:71)
        at org.apache.spark.deploy.SparkSubmitDriverBootstrapper.main(SparkSubmitDriverBootstrapper.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStream
        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
        ... 2 more

Remember ! 

If you remember i talked about irrelevant spark files at the beginning. As you see in downloaded spark-env.sh we see following lines.

export SPARK_HOME={{SPARK_HOME}}
export DEFAULT_HADOOP_HOME={{HADOOP_HOME}}

So eventhough i added hadoop binaries to classpath with exporting HADOOP_CONF_DIR , it is not using it. 

Do the following. 

export SPARK_HOME=/usr/lib/spark
export HADOOP_HOME=/usr/lib/hadoop

After that we see spark started succesfully.

$spark-shell --verbose

Using properties file: /tmp/PRODCONF/spark/spark-conf/spark-defaults.conf
...
Spark properties used, including those specified through
--conf and those from the properties file /tmp/PRODCONF/spark/spark-conf/spark-defaults.conf:
  spark.eventLog.enabled -> true
  spark.serializer -> org.apache.spark.serializer.KryoSerializer
  spark.shuffle.service.enabled -> true
  spark.shuffle.service.port -> 7337
  spark.eventLog.dir -> /user/spark/applicationHistory
....
15/09/23 17:44:55 INFO Utils: Successfully started service 'HTTP class server' on port 40226.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.3.0
      /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_65)
Type in expressions to have them evaluated.
Type :help for more information.
15/09/23 17:44:58 INFO SparkContext: Running Spark version 1.3.0
15/09/23 17:44:58 INFO SecurityManager: Changing view acls to: erkanul,erkanul@KRBHOST
....
scala>

Here we see that spark-shell successfully started but as we see in values imported from configuration files , those are not relevant with target system.

Remember! 

As i say at the beginning spark-defaults.conf file is different on target BDA and you should copy real values manually to your client configuration.

After that you shot a spark-shell command and it starts a dummy spark-submit command for connection to target BDA.

15/09/23 17:51:35 INFO YarnClientImpl: Submitted application application_1441267974514_0928
15/09/23 17:51:36 INFO Client: Application report for application_1441267974514_0928 (state: ACCEPTED)
15/09/23 17:51:36 INFO Client:
         client token: Token { kind: YARN_CLIENT_TOKEN, service:  }
         diagnostics: N/A
         ApplicationMaster host: N/A
         ApplicationMaster RPC port: -1
         queue: root.erkanul
         start time: 1443019895200
         final status: UNDEFINED
         tracking URL: BDAURL/proxy/application_1441267974514_0928/
         user: erkanul
15/09/23 17:51:37 INFO Client: Application report for application_1441267974514_0928 (state: ACCEPTED)
...
15/09/23 17:51:38 INFO JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
15/09/23 17:51:38 INFO Client: Application report for application_1441267974514_0928 (state: RUNNING)
15/09/23 17:51:38 INFO Client:
         client token: Token { kind: YARN_CLIENT_TOKEN, service:  }
         diagnostics: N/A
         ApplicationMaster host: BDAURL
         ApplicationMaster RPC port: 0
         queue: root.erkanul
         start time: 1443019895200
         final status: UNDEFINED
         tracking URL:BDAURL/proxy/application_1441267974514_0928/
         user: erkanul
15/09/23 17:51:38 INFO YarnClientSchedulerBackend: Application application_1441267974514_0928 has started running.
...
15/09/23 17:51:41 INFO SparkILoop: Created spark context..
Spark context available as sc.
...
SQL context available as sqlContext.


scala>




As you see , our spark client connected to target BDA system finally.

After that we can use spark with another target BDA with following changes in environment . 
Remember to make spark-defaults.conf files same with the one on target BDA system.

export HADOOP_CONF_DIR=/tmp/TESTCONF/hdfs/hadoop-conf
export HIVE_CONF_DIR=/tmp/TESTCONF/hive/hive-conf
export YARN_CONF_DIR=/tmp/TESTCONF/yarn/yarn-conf
export SPARK_CONF_DIR=/tmp/TESTCONF/spark/spark-conf

Ok, that is all

Thanks for reading.

Enjoy & share.

Source:
http://blog.cloudera.com/blog/category/spark/



No comments :

Post a Comment