Installing application

Download and Install Docker

Download docker toolbox from https://www.docker.com/toolbox and install docker. After installation you will see Docker CLI and Kitematic icons on your desktop. Before clicking any of these icons set following values in bash profile and environment configuration.

Type

docker info

to make sure that docker is running or not.

Open .bash_profile and enter

export DOCKER_TLS_VERIFY="1"
export DOCKER_HOST="tcp://192.168.99.100:2376"
export DOCKER_CERT_PATH="/Users/<Username>/.docker/machine/machines/default"
export DOCKER_MACHINE_NAME="default"

Goto .ssh/config and add following lines into the configuration file.

Host *

ServerAliveInterval 60

Build Docker Container

$ git clone https://github.com/fluxcapacitor/pipeline.git
$ cd ~/pipeline

[... Make Changes ...]

Click on Docker CLI icon from desktop. It will open a terminal and start docker.

Type

$ docker build -t fluxcapacitor/pipeline .

It will take 30 minutes to complete the build.

Once the build is successful type following command to check if the image is created properly or not.

.ssh arivolit$ docker images

REPOSITORY               TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
fluxcapacitor/pipeline   latest              1b291babb48f        4 hours ago         5.15 GB
ubuntu                   14.04               0a17decee413        6 days ago          188.4 MB

Run Docker

To run Docker type below command.

$ docker run -it -h docker -m 8g -v ~/pipeline/notebooks:/root/pipeline/notebooks -p 30080:80 -p 34042:4042 -p 39160:9160 -p 39042:9042 -p 39200:9200 -p 37077:7077 -p 38080:38080 -p 38081:38081 -p 36060:6060 -p 36061:6061 -p 32181:2181 -p 38090:8090 -p 30000:10000 -p 30070:50070 -p 30090:50090 -p 39092:9092 -p 36066:6066 -p 39000:9000 -p 39999:19999 -p 36081:6081 -p 35601:5601 -p 37979:7979 -p 38989:8989 -p 34040:4040 -p 36379:6379 -p 38888:8888 -p 34321:54321 -p 38099:8099 fluxcapacitor/pipeline bash

this will launch the docker container.

Starting the services

At this point, you are inside of the Docker Container. Keep an eye on the prompt: root@docker$ means that you're inside docker, otherwise, you're on your local laptop. Setup the Environment

You must source the following script to setup and start the pipeline services. Don't forget the source below!

root@docker$ cd ~/pipeline && source ~/pipeline/flux-one-time-setup.sh

^^ Don't forget the source above! ^^

Verify that Setup Worked Correctly

Verify that the output of export contains $PIPELINE_HOME among many other new exports

root@docker$ export

If not, you'll need to do the following again:

root@docker source ~/.profile

Verify the output of jps -l looks something like this

root@docker jps -l

2082 play.core.server.NettyServer
2914 sun.tools.jps.Jps
2114 io.confluent.kafka.schemaregistry.rest.Main
2115 io.confluent.kafkarest.Main
1816 org.apache.zeppelin.server.ZeppelinServer
778 org.elasticsearch.bootstrap.Elasticsearch
1528 org.apache.zookeeper.server.quorum.QuorumPeerMain
1795 kafka.Kafka
779 org.jruby.Main
2045 org.apache.spark.deploy.history.HistoryServer



root@docker:~/pipeline# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                1
On-line CPU(s) list:   0
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 42
Stepping:              7
CPU MHz:               1999.999
BogoMIPS:              3999.99
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              6144K

~~Below is the difference between my local mac and the documentation standard ~~

Install Using Amazon AWS

    6  sudo yum update
    7  curl -sSL https://get.docker.com/ | sh
    8  sudo service docker start
    9  sudo docker run hello-world
   10  pwd
   11  mkdir DeepLearning
   12  cd DeepLearning/
   13  pwd
   14  git clone https://github.com/fluxcapacitor/pipeline.git
   15  sudo yum install git
   16  git clone https://github.com/fluxcapacitor/pipeline.git
   17  pwd
   18  ls
   19  pwd
   20  mkdir fluxcapacitor
   21  mv pipeline/ fluxcapacitor/
   22  cd fluxcapacitor/
   23  cd pipeline/
   24  ls
   25  cd ../
   26  pwd
   27  cd ../
   28  docker build -t fluxcapacitor/pipeline .

Start Services

At this point, you are inside of the Docker Container. Keep an eye on the prompt: root@docker$ means that you're inside docker, otherwise, you're on your local laptop. Setup the Environment

You must source the following script to setup and start the pipeline services. Don't forget the source below!

root@docker$ cd ~/pipeline && source ~/pipeline/flux-one-time-setup.sh

Don't forget the source above!

Verify that Setup Worked Correctly

Verify that the output of export contains $PIPELINE_HOME among many other new exports

root@docker$ export

If not, you'll need to do the following again:

root@docker source ~/.profile

Verify the output of jps -l looks something like this

root@docker$ jps -l

2374 kafka.Kafka <-- Kafka Server
3764 io.confluent.kafka.schemaregistry.rest.Main <-- Kafka Schema Registry
2373 org.apache.zookeeper.server.quorum.QuorumPeerMain <-- ZooKeeper
95 -- process information unavailable <-- Either ElasticSearch or Cassandra*
3765 io.confluent.kafkarest.Main <-- Kafka Rest Proxy
3762 play.core.server.NettyServer <-- Spark-Notebook
919 -- process information unavailable <-- Either ElasticSearch or Cassandra*
2435 org.apache.zeppelin.server.ZeppelinServer <-- Zeppelin WebApp
2743 org.apache.spark.deploy.master.Master <-- Spark Master
4074 sun.tools.jps.Jps <-- This jps Process
3599 tachyon.master.TachyonMaster <-- Tachyon Master
3718 tachyon.worker.TachyonWorker <-- Tachyon Worker
2908 org.apache.spark.deploy.worker.Worker <-- Spark Worker
Note that the "process information unavailable" message appears to be an OpenJDK bug.

Verify the number of CPU cores matches what you expect (at least 4 CPU cores or things won't work right) Note: Knowing this number will help you troubleshoot Spark problems later as you may hit Spark Job resource starvation issues if you run too many long-running jobs at the same time (ie. Hive ThriftServer, Spark Streaming, Spark Job Server).

root@docker$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8   <----------- Number of CPU Cores
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Stepping:              4
CPU MHz:               2500.094
BogoMIPS:              5000.18
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              25600K
NUMA node0 CPU(s):     0-7ß

Test from Inside the Docker Container

Kafka Native

root@docker$ kafka-topics --zookeeper 127.0.0.1:2181 --list
_schemas
likes
ratings
Spark Submit

root@docker$ cd ~/pipeline && spark-submit --class org.apache.spark.examples.SparkPi --master spark://127.0.0.1:7077 $SPARK_EXAMPLES_JAR 10 

Spark SQL

root@docker$ cd ~/pipeline && spark-sql --jars $MYSQL_CONNECTOR_JAR

... spark-sql> show tables; Cassandra

root@docker$ cqlsh
cqlsh> use fluxcapacitor;

cqlsh:fluxcapacitor> select fromuserid, touserid, rating, batchtime from ratings;

 fromuserid | touserid | rating | batchtime
------------+----------+--------+-----------

(0 rows)

cqlsh:fluxcapacitor> select fromuserid, touserid, batchtime from likes;

 fromuserid | touserid | batchtime
------------+----------+-----------

(0 rows)

cqlsh> describe fluxcapacitor;
...

cqlsh:fluxcapacitor> exit;
ZooKeeper

root@docker$ zookeeper-shell 127.0.0.1:2181

Connecting to 127.0.0.1:2181
Welcome to ZooKeeper!
JLine support is disabled

WATCHER::

WatchedEvent state:SyncConnected type:None path:null

quit
MySQL

root@docker$ mysql -u root -p 
Enter password:  password

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 47
Server version: 5.5.44-0ubuntu0.14.04.1 (Ubuntu)

Copyright (c) 2000, 2015, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql>
Redis

root@docker$ redis-cli
127.0.0.1:6379> ping
PONG
JDBC ODBC Hive ThriftServer

Start the Hive ThriftServer Service

root@docker$ cd ~/pipeline && ./flux-start-hive-thriftserver.sh
Verify that the following 2 new processes have been added:

root@docker$ jps -l

3641 org.apache.spark.executor.CoarseGrainedExecutorBackend <-- Long-running Executor for ThriftServer 3047 org.apache.spark.deploy.SparkSubmit <-- Long-running Spark Submit Process for ThriftServer Run the following to query with the Beeline Hive client

    root@docker$ beeline -u jdbc:hive2://127.0.0.1:10000 -n hiveuser -p ''
    0: jdbc:hive2://127.0.0.1:10000> SELECT id, gender FROM genders LIMIT 10;
+-----+---------+
| id  | gender  |
+-----+---------+
| 1   | F       |
| 2   | F       |
| 3   | U       |
| 4   | F       |
| 5   | F       |
| 6   | F       |
| 7   | F       |
| 8   | M       |
| 9   | M       |
| 10  | M       |

+-----+---------+ Make sure that you stop the Hive Thrift Server before continuing as this process occupies Spark CPU cores which may cause CPU starvation later in your exploration:

root@docker$ cd ~/pipeline && $SPARK_HOME/sbin/flux-spark-submitted-job.sh

Verify that the 2 processes identified above for the Hive ThriftServer have been removed with jps -l.

Start Spark Streaming

his Spark Streaming App receives data off of the ratings Kafka topic and writes to the ratings table in Cassandra.

root@docker$ cd $PIPELINE_HOME && ./flux-start-kafka-streaming-ratings.sh

...Starting Ratings Spark Streaming App...