Jupyter Notebooks are arguably one of the most popular tools used by Data Engineers and Data Scientists worldwide. Data ETLs, machine learning training, experimentation, model testing, model inference – all can be done from Jupyter Notebook interfaces itself. These notebooks are also excellent at generating visual reports, dashboards and training ML models. While Jupyter Notebook is an awesome IDE for the above tasks, it’s not very easy to put these notebooks into some automated pipeline to perform these tasks on a recurring basis. But reports, dashboards and ML models need regular refresh, based on new incoming data.
Often people resort to converting their ipython notebooks (.ipynb) files to python script (.py) files which can be deployed in some pipeline and invoked programmatically, in a recurring fashion. One huge drawback in converting ipynb files to python (.py) scripts, apart from the developer effort needed for this conversion, is the need to maintain and manage duplicate code bases.
Papermill solved this problem by allowing one to run a Jupyter Notebooks (ipynb) file as if it’s a python script (py) file. Netflix is a contributor to this project and a big promoter to this idea of using Jupyter Notebooks in ETL and Data pipelines. Papermill supports notebook parameterization, using which we can override the value of any variable used inside the notebook at the time of invoking it. This opens up a whole new way of running our automated ETL job and ML training where the output notebook becomes a one stop immutable record for our cron job with report, dashboard, logs and error messages, all in one place.
Clouderizer supports deploying Jupyter Notebooks as serverless functions using Papermill. No need to convert your ipynb file to python. Any Jupyter Notebook can be deployed to a scalable serverless infrastructure, with just one CLI command and under 2 minutes !
*Note: Only python notebook support is in production right now. R notebook support is in beta. Please contact us in case you want early access to it. In case you have requirements for other kernels, please send us your request at email@example.com.
Deploy a python ETL notebook (etl.ipynb) to clouderizer as a serverless function. Notebook takes one S3 url as input to load the data. This input parameter is used in a variable in your notebook with name input_dataset_url
Deploy a tensorflow deep learning notebook (tf_deeplearning.ipynb) as a serverless function with GPU support. Deployment will take input an S3 url for input dataset and an integer for batch size. Both inputs are defined as variables in the notebook with names input_dataset_url and batch_size. Notebook also generates a model file exported to local folder path with variable outputDir
*Note the callback url provided in the http header.
Above is an async invocation. It immediately returns back with 202 Accepted http response code. Once execution is complete, the callback url specified in the request is called with the http result. This result will contain the S3 url of the model file generated during notebook invocation.