BigQuery Part 1: Getting Started
BigQuery, one of the many services available on Google Cloud Platform, is an enterprise data warehouse which lets you perform fast queries on very large datasets using Google’s infrastructure. You can get started using it free of charge with one of the many publicly available datasets built into GCP.
In this installment, we’ll take a look at setup and basic usage. I put together a jupyter notebook to demonstrate. The ‘Open in Colab’ link will take you to the Colaboratory interactive version which will run in your browser. You might also need to use that link if the Gist embed below isn’t working (notebook embeds have been occasionally acting up on Gist lately).
View this notebook on Google Collaboratory
Getting started with the BigQuery API
We want to start exploring the Google BiqQuery public datasets. Let’s start by walking through the required setup steps, and then we can load and explore some data. Note that the procedure for using BigQuery outside of the colaboratory notebook environment is different and will be described in a separate article.
Preliminaries
Follow this quickstart guide, which will explain how to:
- Create a Cloud Platform project if you don’t have one already.
- Enable billing) for the project
- Enable the BigQuery API
Provide credentials to the runtime
Now we need to authenticate to gain access to the BigQuery API. When you first run this, you will be given a link to a page with your authentication key, which you will need to paste here.
|
|
Now that we’re authenticated, we need to load the BigQuery package , and the google.colab.data_table package that can be used to display large pandas dataframes as an interactive data. Loading data_table is optional, but it will be useful for working with data in pandas.
|
|
Typical API Usage (through google-cloud-bigquery)
Now we can create a client, specifying the project name (replace ‘bigquery-test-project-256220’ with your project name, which you can find in the GCP console)
|
|
With the client created for our project, now we can set up dataset references for a couple of datasets, Hacker News and Bitcoin datasets.
See the full list of public datasets if you want to explore others.
|
|
Now that we have data references, we can load the actual datasets and confirm the type.
|
|
google.cloud.bigquery.dataset.Dataset
Use client.list_tables, along with list comprehension, to get table information for both datasets:
|
|
['comments', 'full', 'full_201510', 'stories'] ['blocks', 'inputs', 'outputs', 'transactions']
Let’s take a closer look at the ‘blocks’ table in the bitcoin dataset:
|
|
['clustering_fields', 'created', 'dataset_id', 'description', 'encryption_configuration', 'etag', 'expires', 'external_data_configuration', 'friendly_name', 'from_api_repr', 'from_string', 'full_table_id', 'labels', 'location', 'modified', 'num_bytes', 'num_rows', 'partition_expiration', 'partitioning_type', 'path', 'project', 'reference', 'schema', 'self_link', 'streaming_buffer', 'table_id', 'table_type', 'time_partitioning', 'to_api_repr', 'to_bqstorage', 'view_query', 'view_use_legacy_sql']
Let’s examine the schema to get more information:
|
|
[SchemaField('hash', 'STRING', 'REQUIRED', 'Hash of this block', ()), SchemaField('size', 'INTEGER', 'NULLABLE', 'The size of block data in bytes', ()), SchemaField('stripped_size', 'INTEGER', 'NULLABLE', 'The size of block data in bytes excluding witness data', ()), SchemaField('weight', 'INTEGER', 'NULLABLE', 'Three times the base size plus the total size. https://github.com/bitcoin/bips/blob/master/bip-0141.mediawiki', ()), SchemaField('number', 'INTEGER', 'REQUIRED', 'The number of the block', ()), SchemaField('version', 'INTEGER', 'NULLABLE', 'Protocol version specified in block header', ()), SchemaField('merkle_root', 'STRING', 'NULLABLE', 'The root node of a Merkle tree, where leaves are transaction hashes', ()), SchemaField('timestamp', 'TIMESTAMP', 'REQUIRED', 'Block creation timestamp specified in block header', ()), SchemaField('timestamp_month', 'DATE', 'REQUIRED', 'Month of the block creation timestamp specified in block header', ()), SchemaField('nonce', 'STRING', 'NULLABLE', 'Difficulty solution specified in block header', ()), SchemaField('bits', 'STRING', 'NULLABLE', 'Difficulty threshold specified in block header', ()), SchemaField('coinbase_param', 'STRING', 'NULLABLE', 'Data specified in the coinbase transaction of this block', ()), SchemaField('transaction_count', 'INTEGER', 'NULLABLE', 'Number of transactions included in this block', ())]
The table scheme gives us column names, types, parameters, descriptions, and (for the weight, a link to a GitHub repo).
The schema is a necessary input for one of the more common and useful BigQuery commands: list_rows()
, which returns a slice of the dataset without scanning any other section of the table. For that reason, it is often preferable to using a limit
clause in your query.
We want to select a subset of the columns, but the selected_fields
parameter requires a schema object as an input, so we need to build a subset of the schema first to pass for that parameter.
|
|
Let’s take a look at the results, first in raw format (not very easy to look at), then in dictionary form (much neater):
|
|
[Row(('000000000000000000824a2cbabfd94ded8a80c4179a9e662eebb205a6d22663', 487527, datetime.datetime(2017, 9, 29, 14, 25, 30, tzinfo=)), {'hash': 0, 'number': 1, 'timestamp': 2}), Row(('0000000000000000000030b69821159aeabd59ce42f32d3c0b10dfa3fab984f1', 487001, datetime.datetime(2017, 9, 26, 5, 55, 18, tzinfo= )), {'hash': 0, 'number': 1, 'timestamp': 2}), Row(('000000000000000000f6f0d7f30bb0c820b20c0a238d6b13747fb34b1ffda97d', 487132, datetime.datetime(2017, 9, 27, 2, 47, 7, tzinfo= )), {'hash': 0, 'number': 1, 'timestamp': 2}), Row(('000000000000000000d0eaefb69506912073c17d0388727bd3c1463ecd2ad2a2', 486707, datetime.datetime(2017, 9, 24, 0, 58, 8, tzinfo= )), {'hash': 0, 'number': 1, 'timestamp': 2}), Row(('0000000000000000007908cf4cab8d18b29bd5aa2b292a414f6c10675f9984eb', 487282, datetime.datetime(2017, 9, 28, 3, 58, 8, tzinfo= )), {'hash': 0, 'number': 1, 'timestamp': 2}), Row(('000000000000000000e330bef17e39727e0ead7bce9b82f57b871be23add429e', 487003, datetime.datetime(2017, 9, 26, 6, 17, 5, tzinfo= )), {'hash': 0, 'number': 1, 'timestamp': 2}), Row(('0000000000000000008820077a37573e5a3c0d6a8315513e4cccd59462771228', 485973, datetime.datetime(2017, 9, 19, 3, 12, 9, tzinfo= )), {'hash': 0, 'number': 1, 'timestamp': 2}), Row(('00000000000000000000133426bc20b82bb1827108a626db6bde3c45adf93824', 486014, datetime.datetime(2017, 9, 19, 11, 19, 41, tzinfo= )), {'hash': 0, 'number': 1, 'timestamp': 2}), Row(('00000000000000000073100b1b3cb9ed537c43666a39e388f86e4484e7c14e30', 486192, datetime.datetime(2017, 9, 20, 20, 1, 14, tzinfo= )), {'hash': 0, 'number': 1, 'timestamp': 2}), Row(('0000000000000000000f9bc1d8095de15900a39db25b8323c71985778626a1e8', 487118, datetime.datetime(2017, 9, 27, 1, 15, 14, tzinfo= )), {'hash': 0, 'number': 1, 'timestamp': 2})] {'hash': '000000000000000000824a2cbabfd94ded8a80c4179a9e662eebb205a6d22663', 'number': 487527, 'timestamp': datetime.datetime(2017, 9, 29, 14, 25, 30, tzinfo= )} {'hash': '0000000000000000000030b69821159aeabd59ce42f32d3c0b10dfa3fab984f1', 'number': 487001, 'timestamp': datetime.datetime(2017, 9, 26, 5, 55, 18, tzinfo= )} {'hash': '000000000000000000f6f0d7f30bb0c820b20c0a238d6b13747fb34b1ffda97d', 'number': 487132, 'timestamp': datetime.datetime(2017, 9, 27, 2, 47, 7, tzinfo= )} {'hash': '000000000000000000d0eaefb69506912073c17d0388727bd3c1463ecd2ad2a2', 'number': 486707, 'timestamp': datetime.datetime(2017, 9, 24, 0, 58, 8, tzinfo= )} {'hash': '0000000000000000007908cf4cab8d18b29bd5aa2b292a414f6c10675f9984eb', 'number': 487282, 'timestamp': datetime.datetime(2017, 9, 28, 3, 58, 8, tzinfo= )} {'hash': '000000000000000000e330bef17e39727e0ead7bce9b82f57b871be23add429e', 'number': 487003, 'timestamp': datetime.datetime(2017, 9, 26, 6, 17, 5, tzinfo= )} {'hash': '0000000000000000008820077a37573e5a3c0d6a8315513e4cccd59462771228', 'number': 485973, 'timestamp': datetime.datetime(2017, 9, 19, 3, 12, 9, tzinfo= )} {'hash': '00000000000000000000133426bc20b82bb1827108a626db6bde3c45adf93824', 'number': 486014, 'timestamp': datetime.datetime(2017, 9, 19, 11, 19, 41, tzinfo= )} {'hash': '00000000000000000073100b1b3cb9ed537c43666a39e388f86e4484e7c14e30', 'number': 486192, 'timestamp': datetime.datetime(2017, 9, 20, 20, 1, 14, tzinfo= )} {'hash': '0000000000000000000f9bc1d8095de15900a39db25b8323c71985778626a1e8', 'number': 487118, 'timestamp': datetime.datetime(2017, 9, 27, 1, 15, 14, tzinfo= )}
What if we had performed a full table scan, instead of using list_rows? How many resources would have been consumed? Let’s find out:
|
|
Using BiqQuery via magics
There’s another, shorthand way to access data (replace ‘bigquery-test-project-256220’ with your project ID):
|
|
Here’s the whole notebook as a gist. Click the link at the top to view in Google Colab, so that you can run it interactively: