Querying Optimizely Enriched Event data with Apache Spark
In this Lab we'll learn how to query Optimizely Enriched Event data with Apache Spark.
Spark is a powerful, widely-adopted engine for data processing. It's easy to run on a single machine and scales up to handle arbitrarily large workloads. It also works well with the Enriched Events dataset, which is stored using the Apache Parquet format.
In this Lab, we'll assume you're working with a standalone Spark cluster running on your computer. However, this and other Lab notebooks can be modified to work with remote Spark clusters as well.
Running this notebook
There are several ways to run this notebook locally:
- Using the
run.sh
script - Using Docker with the
run-docker.sh
script - Manually, using the
conda
CLI
Running the notebook with run.sh
You can use the run.sh
script to build your environment and run this notebook with a single command.
Prerequisite: conda (version 4.4+)
You can install the conda
CLI by installing Anaconda or Miniconda.
Running Jupyter Lab
This lab directory contains a handy script for building your conda environment and running Jupyter Lab. To run it, simply use
bash bin/run.sh
That's it, you're done!
Running this notebook with Docker
If you have Docker installed, you can run PySpark and Jupyter Lab without installing any other dependencies.
Execute run-docker.sh
in the ./bin
directory to open Jupyter Lab in a Docker container:
bash bin/run-docker.sh
Note: Docker makes it easy to get started with PySpark, but it adds overhead and may require additional configuration to handle large workloads.
Running this notebook manually
If you prefer to build and activate your conda environment manually, you can use the conda
CLI and the environment specification files in the ./lab_env
directory to do so.
Prerequisite: conda (version 4.4+)
You can install the conda
CLI by installing Anaconda or Miniconda.
Building and activating your Aanconda environment
Start by building (or updating) and activating your anaconda environment. This step will install OpenJDK, PySpark, Jupyter Lab, and other necessary dependencies.
conda env update --file lab_env/base.yml --name optimizelylabs
conda env update --file lab_env/labs.yml --name optimizelylabs
conda activate optimizelylabs
Next, install a jupyter kernel for this environment:
python -m ipykernel install --user \
--name optimizelylabs \
--display-name="Python 3 (Optimizely Labs Environment)"
Finally, start Jupyter Lab in your working directory:
jupyter lab .
Specifying a custom data directory
The notebook in this lab will load Enriched Event data from example_data/
in the lab directory. If you wish to load data from another directory, you can use the OPTIMIZELY_DATA_DIR
environment variable. For example:
export OPTIMIZELY_DATA_DIR=~/optimizely_data
Once OPTIMIZELY_DATA_DIR
has been set, launch Jupyter Lab using one of the approaches described above. The Lab notebook should load data from your custom directory.