Table of Contents
Live version: https://bigquery.imadelhanafi.com
Github repo: https://github.com/imadelh/Bigquery-Streamlit
Storing and querying large datasets is an important step for data analysis and predictive modeling. BigQuery is a serverless data warehouse that allows storing data (up to Terabytes) and runs fast SQL queries without worrying about the computing power. In this post, we will discover how to interact with BigQuery and render results in an interactive dashboard built using Streamlit.
BigQuery allows users to leverage a huge amount of computing resources and technologies to run fast SQL queries in a serverless setup. A simple analysis has been conducted in this article and it shows that for a single query, one needs thousands of CPU cores to obtain the same running time as BigQuery.
BigQuery gives access to a large number of public datasets. A complete list is available here. For this project, we will be using the following datasets
|Historical Air pollution||bigquery-public-data.epa_historical_air_quality|
|Climatology data (temperature)||bigquery-public-data:noaa_gsod|
|Climatology data (precipitation)||bigquery-public-data.ghcn_d|
These datasets contain a massive amount of data about air pollution and climatology measurements from around the world. We will focus on aggregated statistics from US states.
Streamlit is an app framework specifically designed for Data Science applications. It is a powerful tool to build rich UIs very quickly. Examples and demos are provided on Streamlit’s GitHub https://github.com/streamlit and the documentation explains the main concepts in detail.
Building the dashboard
The user will interact with the app via the UI created using Streamlit and SQL queries will be run using BigQuery Python API. After obtaining the results of the queries from the data warehouse, we use Altair and Plotly for the plots.
To use BigQuery API in python, we need to create a credential file (JSON file) and declare its path as an environment variable
GOOGLE_APPLICATION_CREDENTIALS. To create the credential file follow steps explained here: https://cloud.google.com/bigquery/docs/quickstarts/quickstart-client-libraries
- Running the app
After preparing the credential file, we can run SQL query on BigQuery from a python script and get results in a Pandas data frame (see modules/data_query.py)
import os from google.cloud import bigquery # Define credentials os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'bigquery-credentials.json' client = bigquery.Client() def _fetch_data_bigquery(query): """ Take SQL query in Standard SQL and returns a Pandas DataFrame of results ref: https://cloud.google.com/bigquery/docs/reference/standard-sql/enabling-standard-sql """ return client.query(query, location="US").to_dataframe()
Since we will be running queries on different datasets (IO tasks), we can take advantage of multi-threading and run all SQL queries concurrently. A helper function
run_concurrent_queries(list_queries_fetchers) is available in modules/utils.py.
A docker image that contains all requirements is created as follow
FROM python:3.7 ADD . /app WORKDIR /app RUN pip install -r requirements.txt # web server EXPOSE 8080 # Default command to run app CMD streamlit run --browser.serverAddress='0.0.0.0' --server.port=8080 app.py
After building and running the container using the following commands, the application will be accessible on http://0.0.0.0:8080.
# Build the image docker build -t app . # Run the container docker run --rm -p 8080:8080 app:latest
The same steps could be followed to run the application on a virtual machine in the cloud. Additional configuration (NGINX) may be needed to set up a domain name.
In this post, we discovered how to use BigQuery to run heavy SQL queries on massive datasets and plot/analyze outputs in a dashboard deployed in a virtual machine.