Table of Contents

Introduction

Storing and querying large datasets is an important step for data analysis and predictive modeling. BigQuery is a serverless data warehouse that allows storing data (up to Terabytes) and runs fast SQL queries without worrying about the computing power. In this post, we will discover how to interact with BigQuery and render results in an interactive dashboard built using Streamlit.

Tools

BigQuery

BigQuery allows users to leverage a huge amount of computing resources and technologies to run fast SQL queries in a serverless setup. A simple analysis has been conducted in this article and it shows that for a single query, one needs thousands of CPU cores to obtain the same running time as BigQuery.

BigQuery gives access to a large number of public datasets. A complete list is available here. For this project, we will be using the following datasets

Name Dataset ID
Historical Air pollution bigquery-public-data.epa_historical_air_quality
Climatology data (temperature) bigquery-public-data:noaa_gsod
Climatology data (precipitation) bigquery-public-data.ghcn_d

These datasets contain a massive amount of data about air pollution and climatology measurements from around the world. We will focus on aggregated statistics from US states.

Streamlit

Streamlit is an app framework specifically designed for Data Science applications. It is a powerful tool to build rich UIs very quickly. Examples and demos are provided on Streamlit’s GitHub https://github.com/streamlit and the documentation explains the main concepts in detail.

streamlit
Streamlit Demo: Uber Pickups in New York City

Building the dashboard

  • Architecture

The user will interact with the app via the UI created using Streamlit and SQL queries will be run using BigQuery Python API. After obtaining the results of the queries from the data warehouse, we use Altair and Plotly for the plots.

arch
Overall architecture

To use BigQuery API in python, we need to create a credential file (JSON file) and declare its path as an environment variable GOOGLE_APPLICATION_CREDENTIALS. To create the credential file follow steps explained here: https://cloud.google.com/bigquery/docs/quickstarts/quickstart-client-libraries

  • Running the app

After preparing the credential file, we can run SQL query on BigQuery from a python script and get results in a Pandas data frame (see modules/data_query.py)

import os
from google.cloud import bigquery

# Define credentials
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'bigquery-credentials.json'
client = bigquery.Client()

def _fetch_data_bigquery(query):
    """
      Take SQL query in Standard SQL and returns a Pandas DataFrame of results
      ref: https://cloud.google.com/bigquery/docs/reference/standard-sql/enabling-standard-sql
    """
    return client.query(query, location="US").to_dataframe()

Since we will be running queries on different datasets (IO tasks), we can take advantage of multi-threading and run all SQL queries concurrently. A helper function run_concurrent_queries(list_queries_fetchers) is available in modules/utils.py.

Deployment

A docker image that contains all requirements is created as follow

FROM python:3.7
ADD . /app
WORKDIR /app
RUN pip install -r requirements.txt

# web server
EXPOSE 8080

# Default command to run app
CMD streamlit run --browser.serverAddress='0.0.0.0' --server.port=8080 app.py

After building and running the container using the following commands, the application will be accessible on http://0.0.0.0:8080.

# Build the image
docker build -t app .

# Run the container
docker run --rm -p 8080:8080 app:latest

The same steps could be followed to run the application on a virtual machine in the cloud. Additional configuration (NGINX) may be needed to set up a domain name.

Conclusion

In this post, we discovered how to use BigQuery to run heavy SQL queries on massive datasets and plot/analyze outputs in a dashboard deployed in a virtual machine.