Using Apache Spark in the Jupyter notebook to analyze Twitter
You need several things in place before this repository will be useful.
Visit Oracle's Java SE Downloads page and download the latest Java Platform (JDK) version for your machine.
On my Linux Mint machine, I've downloaded jdk-8u91-linux-x64.tar.gz at 173MB and placed it in /opt
. After downloading, I performed the following.
$ sudo gunzip jdk-8u91-linux-x64.tar.gz
$ sudo tar xvf jdk-8u91-linux-x64.tar
$ sudo chown -R jason:jason jdk1.8.0_91
I now have the directory /opt/jdk1.8.0_91
, which will become my JAVA_HOME directory later inside our configuration.
On the Apache Spark download page, I've downloaded Spark 1.6.0 in the "pre-build for Hadoop 2.6 and later" version (289MB), and I now have spark-1.6.0-bin-hadoop2.6.tgz
also sitting inside my /opt
directory. (I am using version 1.6.0 instead of the recently available 1.6.1 as I've had some memory management issues with the newer version.) I've extracted and chowned this as follows.
$ sudo tar zxvf spark-1.6.0-bin-hadoop2.6.tgz
$ sudo chown -R jason:jason spark-1.6.0-bin-hadoop2.6
I now have the directory /opt/spark-1.6.0-bin-hadoop2.6
, which will become my SPARK_HOME directory later inside our configuration.
On the Continuum Analytics Anaconda download page, download and install the version of Anaconda appropriate for your machine. (This installation typically involves running a Bash script. See Continuum's instructions.)
At this point, you should have conda
callable at the command prompt (from the Anaconda installation), and you should have both Java and Spark installed in directories that we'll call JAVA_HOME and SPARK_HOME, respectively.
I have a conda environment created with the required dependencies. Setting that up looks something like the following.
$ conda create -n spark-jupyter python
...
$ source activate spark-jupyter
...
$ pip install matplotlib, pandas, seaborn, simplejson, requests, requests_oauthlib
See the script in the bin
directory of this repo and set your variables and options as needed. One thing to keep in mind is that Jupyter has many settings of its own (which port to run on, whether or not to open a browser, directory locations, etc.) and you may want to set those in your ~.jupyter
config.