Installing application
Download and Install Docker
Download docker toolbox from https://www.docker.com/toolbox and install docker. After installation you will see Docker CLI and Kitematic icons on your desktop. Before clicking any of these icons set following values in bash profile and environment configuration.
![]()
Type
docker info
to make sure that docker is running or not.
Open .bash_profile and enter
export DOCKER_TLS_VERIFY="1"
export DOCKER_HOST="tcp://192.168.99.100:2376"
export DOCKER_CERT_PATH="/Users/<Username>/.docker/machine/machines/default"
export DOCKER_MACHINE_NAME="default"
Goto .ssh/config and add following lines into the configuration file.
Host *
ServerAliveInterval 60
Build Docker Container
$ git clone https://github.com/fluxcapacitor/pipeline.git
$ cd ~/pipeline
[... Make Changes ...]
Click on Docker CLI icon from desktop. It will open a terminal and start docker.
Type
$ docker build -t fluxcapacitor/pipeline .
It will take 30 minutes to complete the build.
Once the build is successful type following command to check if the image is created properly or not.
.ssh arivolit$ docker images
REPOSITORY TAG IMAGE ID CREATED VIRTUAL SIZE
fluxcapacitor/pipeline latest 1b291babb48f 4 hours ago 5.15 GB
ubuntu 14.04 0a17decee413 6 days ago 188.4 MB
Run Docker
To run Docker type below command.
$ docker run -it -h docker -m 8g -v ~/pipeline/notebooks:/root/pipeline/notebooks -p 30080:80 -p 34042:4042 -p 39160:9160 -p 39042:9042 -p 39200:9200 -p 37077:7077 -p 38080:38080 -p 38081:38081 -p 36060:6060 -p 36061:6061 -p 32181:2181 -p 38090:8090 -p 30000:10000 -p 30070:50070 -p 30090:50090 -p 39092:9092 -p 36066:6066 -p 39000:9000 -p 39999:19999 -p 36081:6081 -p 35601:5601 -p 37979:7979 -p 38989:8989 -p 34040:4040 -p 36379:6379 -p 38888:8888 -p 34321:54321 -p 38099:8099 fluxcapacitor/pipeline bash
this will launch the docker container.
Starting the services
At this point, you are inside of the Docker Container. Keep an eye on the prompt: root@docker$ means that you're inside docker, otherwise, you're on your local laptop. Setup the Environment
You must source the following script to setup and start the pipeline services. Don't forget the source below!
root@docker$ cd ~/pipeline && source ~/pipeline/flux-one-time-setup.sh
^^ Don't forget the source above! ^^
Verify that Setup Worked Correctly
Verify that the output of export contains $PIPELINE_HOME among many other new exports
root@docker$ export
If not, you'll need to do the following again:
root@docker source ~/.profile
Verify the output of jps -l looks something like this
root@docker jps -l
2082 play.core.server.NettyServer
2914 sun.tools.jps.Jps
2114 io.confluent.kafka.schemaregistry.rest.Main
2115 io.confluent.kafkarest.Main
1816 org.apache.zeppelin.server.ZeppelinServer
778 org.elasticsearch.bootstrap.Elasticsearch
1528 org.apache.zookeeper.server.quorum.QuorumPeerMain
1795 kafka.Kafka
779 org.jruby.Main
2045 org.apache.spark.deploy.history.HistoryServer
root@docker:~/pipeline# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 1
On-line CPU(s) list: 0
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 42
Stepping: 7
CPU MHz: 1999.999
BogoMIPS: 3999.99
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 6144K
~~Below is the difference between my local mac and the documentation standard ~~

Install Using Amazon AWS
6 sudo yum update
7 curl -sSL https://get.docker.com/ | sh
8 sudo service docker start
9 sudo docker run hello-world
10 pwd
11 mkdir DeepLearning
12 cd DeepLearning/
13 pwd
14 git clone https://github.com/fluxcapacitor/pipeline.git
15 sudo yum install git
16 git clone https://github.com/fluxcapacitor/pipeline.git
17 pwd
18 ls
19 pwd
20 mkdir fluxcapacitor
21 mv pipeline/ fluxcapacitor/
22 cd fluxcapacitor/
23 cd pipeline/
24 ls
25 cd ../
26 pwd
27 cd ../
28 docker build -t fluxcapacitor/pipeline .
Start Services
At this point, you are inside of the Docker Container. Keep an eye on the prompt: root@docker$ means that you're inside docker, otherwise, you're on your local laptop. Setup the Environment
You must source the following script to setup and start the pipeline services. Don't forget the source below!
root@docker$ cd ~/pipeline && source ~/pipeline/flux-one-time-setup.sh
Don't forget the source above!
Verify that Setup Worked Correctly
Verify that the output of export contains $PIPELINE_HOME among many other new exports
root@docker$ export
If not, you'll need to do the following again:
root@docker source ~/.profile
Verify the output of jps -l looks something like this
root@docker$ jps -l
2374 kafka.Kafka <-- Kafka Server
3764 io.confluent.kafka.schemaregistry.rest.Main <-- Kafka Schema Registry
2373 org.apache.zookeeper.server.quorum.QuorumPeerMain <-- ZooKeeper
95 -- process information unavailable <-- Either ElasticSearch or Cassandra*
3765 io.confluent.kafkarest.Main <-- Kafka Rest Proxy
3762 play.core.server.NettyServer <-- Spark-Notebook
919 -- process information unavailable <-- Either ElasticSearch or Cassandra*
2435 org.apache.zeppelin.server.ZeppelinServer <-- Zeppelin WebApp
2743 org.apache.spark.deploy.master.Master <-- Spark Master
4074 sun.tools.jps.Jps <-- This jps Process
3599 tachyon.master.TachyonMaster <-- Tachyon Master
3718 tachyon.worker.TachyonWorker <-- Tachyon Worker
2908 org.apache.spark.deploy.worker.Worker <-- Spark Worker
Note that the "process information unavailable" message appears to be an OpenJDK bug.
Verify the number of CPU cores matches what you expect (at least 4 CPU cores or things won't work right) Note: Knowing this number will help you troubleshoot Spark problems later as you may hit Spark Job resource starvation issues if you run too many long-running jobs at the same time (ie. Hive ThriftServer, Spark Streaming, Spark Job Server).
root@docker$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8 <----------- Number of CPU Cores
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 62
Stepping: 4
CPU MHz: 2500.094
BogoMIPS: 5000.18
Hypervisor vendor: Xen
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 25600K
NUMA node0 CPU(s): 0-7ß
Test from Inside the Docker Container
Kafka Native
root@docker$ kafka-topics --zookeeper 127.0.0.1:2181 --list
_schemas
likes
ratings
Spark Submit
root@docker$ cd ~/pipeline && spark-submit --class org.apache.spark.examples.SparkPi --master spark://127.0.0.1:7077 $SPARK_EXAMPLES_JAR 10
Spark SQL
root@docker$ cd ~/pipeline && spark-sql --jars $MYSQL_CONNECTOR_JAR
... spark-sql> show tables; Cassandra
root@docker$ cqlsh
cqlsh> use fluxcapacitor;
cqlsh:fluxcapacitor> select fromuserid, touserid, rating, batchtime from ratings;
fromuserid | touserid | rating | batchtime
------------+----------+--------+-----------
(0 rows)
cqlsh:fluxcapacitor> select fromuserid, touserid, batchtime from likes;
fromuserid | touserid | batchtime
------------+----------+-----------
(0 rows)
cqlsh> describe fluxcapacitor;
...
cqlsh:fluxcapacitor> exit;
ZooKeeper
root@docker$ zookeeper-shell 127.0.0.1:2181
Connecting to 127.0.0.1:2181
Welcome to ZooKeeper!
JLine support is disabled
WATCHER::
WatchedEvent state:SyncConnected type:None path:null
quit
MySQL
root@docker$ mysql -u root -p
Enter password: password
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 47
Server version: 5.5.44-0ubuntu0.14.04.1 (Ubuntu)
Copyright (c) 2000, 2015, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql>
Redis
root@docker$ redis-cli
127.0.0.1:6379> ping
PONG
JDBC ODBC Hive ThriftServer
Start the Hive ThriftServer Service
root@docker$ cd ~/pipeline && ./flux-start-hive-thriftserver.sh
Verify that the following 2 new processes have been added:
root@docker$ jps -l
3641 org.apache.spark.executor.CoarseGrainedExecutorBackend <-- Long-running Executor for ThriftServer 3047 org.apache.spark.deploy.SparkSubmit <-- Long-running Spark Submit Process for ThriftServer Run the following to query with the Beeline Hive client
root@docker$ beeline -u jdbc:hive2://127.0.0.1:10000 -n hiveuser -p ''
0: jdbc:hive2://127.0.0.1:10000> SELECT id, gender FROM genders LIMIT 10;
+-----+---------+
| id | gender |
+-----+---------+
| 1 | F |
| 2 | F |
| 3 | U |
| 4 | F |
| 5 | F |
| 6 | F |
| 7 | F |
| 8 | M |
| 9 | M |
| 10 | M |
+-----+---------+ Make sure that you stop the Hive Thrift Server before continuing as this process occupies Spark CPU cores which may cause CPU starvation later in your exploration:
root@docker$ cd ~/pipeline && $SPARK_HOME/sbin/flux-spark-submitted-job.sh
Verify that the 2 processes identified above for the Hive ThriftServer have been removed with jps -l.
Start Spark Streaming
his Spark Streaming App receives data off of the ratings Kafka topic and writes to the ratings table in Cassandra.
root@docker$ cd $PIPELINE_HOME && ./flux-start-kafka-streaming-ratings.sh
...Starting Ratings Spark Streaming App...