DevOps Data engineering
terraform + bigquery
May 13, 2024     4 minutes read

Why would you use terraform in data engineering?

terraform is a standard in industry for setting up cloud infrastructure. You might wonder, just as I do, whether you should store your DDLs in terraform: it certainly is possible, but there are some drawbacks, e.g. how to handle database schema migrations? And how does it cooperate with other tools, like Apache Airflow? Surely there are interesing alternatives, like dbt or dataform.

In this article I will not judge whether terraform is a good choice for your particular use case. But it could be.

Prerequisites

GCP project

A “project” is GCP’s abstraction to group resources and it has one billing account (one debit card attached). In GCP you always work inside of a project, so we need to create one. It is very simple: go to the GCP Console, type “projects” into the search bar, choose “Create a project”, provide its name and click “Create”.

For this tutorial I created a project named bigquery-tutorial.

Theoretically we could create project from terraform, but you usually set up payment methods manually, because you do not change them ever after programatically. It’s not developer’s job.

gcloud

GCP provides a nice CLI for working with its services: gcloud, and terraform uses it to provision infrastructure. In order to install it, follow these instructions.

After the installation you need to log in to your Google account with gcloud auth login. You might also need to run gcloud auth application-default login, and gcloud auth application-default set-quota-project bigquery-tutorial.

terraform config

setup

For primer on terraform, please refer to this article

terraform {
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "4.51.0"
    }
  }
}

locals {
  region = "EUROPE-CENTRAL2"
}

provider "google" {
  project = "bigquery-tutorial-<id>"
  region  = local.region
}

For this tutorial I chose Europe-central2 region (Warsaw, Poland), because this is where I live. After running terraform init terraform installs plugins that it needs to work with GCP.

bucket

Creating a bucket is very simple:

resource "google_storage_bucket" "bucket" {
  name          = "some-bucket"
  location      = local.region  # one-region bucket
  force_destroy = true

  public_access_prevention = "enforced"
}

You might notice some interesting config options here:

For further options you might want to refer to the documentation, among which are:

BigQuery

terraform allows you to create BigQuery datasets and tables:

resource "google_bigquery_dataset" "default" {
  dataset_id = "cars_per_year"
  location = local.region  # one-region dataset
}

resource "google_bigquery_table" "default" {
  dataset_id = google_bigquery_dataset.default.dataset_id
  table_id   = "cars"
  schema = file("./cars_schema.json")
  # we want to be able to recreate this table anytime we modify its config
  deletion_protection = false

  external_data_configuration {
    autodetect    = false
    source_format = "CSV"

    csv_options {
        quote = "\""  
        skip_leading_rows = 1  # skips header
    }

    source_uris = [
      "${google_storage_bucket.bucket.url}/*.csv",
    ]
  }

  depends_on = [
    google_bigquery_dataset.default,
  ]
}

where we read tables schema from a cars_schema.json file:

[
  {
    "name": "year",
    "type": "integer"
  },
  {
    "name": "make",
    "type": "STRING"
  },
  {
    "name": "model",
    "type": "STRING"
  },
  {
    "name": "body_styles",
    "type": "STRING"
  }
]

Why terraform might be not the best choice for BigQuery?