Snowplow

Installing and Configuring Snowplow Open Source on Google Cloud

Snowplow is a company based in London that offers a “Behavioral Data Platform” in the cloud and also an open source version. According to the official website, they are the third most used web tracking tool, behind Google Analytics and Facebook.

Motivation for this post

The process of installing and configuring Snowplow Open Source on Google Cloud is quite complex and, to make matters worse, the documentation is very outdated.

The documentation is also open-source, and I have even submitted pull requests to correct some of the errors I found.
Anyway, I documented everything in this article for future reference.
I hope it helps someone else too.

I went through the whole process between December 2022 and January 2023 using a MacBook Pro M1. For my specific use case, I activated both BigQuery and PostgreSQL, which costs around $240 per month.

Usually, only one Data Warehouse is used. In GCP, I recommend BigQuery because:

  • It handles large volumes of data well
  • Some packages from dbt, like snowplow_ecommerce, are not yet compatible with PostgreSQL.
  • It can cost less, depending on usage (even with extra Pub/Sub)
    • BigQuery is charged based on usage, not on online time.
    • The first terabyte of storage is free.

Overview

Explaining the image item by item:

  • Trackers: This is where we want to collect information from.
    It is possible to track anything capable of making REST calls, such as websites, mobile apps, and databases.
    The SDKs help a lot in this process.
    The data is sent following an explicit schema that you define yourself.
  • Iglu Server: Cloud Engine (with load balancer) and Cloud SQL database.
    It is a schema repository.
    Schemas are defined in JSON Schema files and can be versioned, for example: YouTube JSON Schema and GA Ecommerce Schema.
  • Stream collector app: Compute Engine (with load balancer).
    It contains the HTTP endpoint that receives raw events, serializes them, and sends them to Pub/Sub topics (sp-raw-topic or sp-bad-1-topic).

If the event is too large, for example, it is sent to the bad data PubSub topic.

  • Enrich PubSub: JVM application on Compute Engine (one or more VMs).
    At this stage, events from the Raw Stream (sp-raw-topic, written by the collector) are read and validated.

    • If the event does not match a schema defined in the Iglu Server, this event is sent to the Bad Stream.
      This prevents false or poor quality data from going into production and also allows analyzing the received schema, which helps to find possible errors.
      In Google Analytics, this data would simply be ignored.
    • If the event matches a schema, it will be “enriched” according to what you configured.
      For example:
      • Split the User Agent into parts:
        • Device type
        • Operating system
        • System version
        • etc.
      • Anonymize IPs
      • Filter personal data (PII)
      • Extend geolocation information

    There are several enrichments available.

    Once “enriched”, the event is sent to the Good Enriched topic (sp-enriched-topic).

  • Loader: loads into tables
    If the loader is unable to insert rows into BigQuery, they are sent to a stream of failed inserts and the insertion is attempted again.

  • Mutator: Is used when some schema needs to be changed or extended. All events go to the same giant table. Changing a schema can cause a big problem.

Terraform

Instead of creating and configuring each resource manually through the Google Cloud interface or the command line, with Terraform you define, declaratively, the infrastructure as code in a specific language based on HCL (HashiCorp Configuration Language).

Like any code, Terraform files can be versioned, which makes it easier to make modifications and rollbacks.

There is a Quick Start that deploys using Terraform. Unfortunately, the current modules do not cover all setups, and some of them are quite outdated. Still, I will use them as a base because Terraform reduces the risk of errors and makes it easier to maintain the infrastructure.

Prerequisites

  1. Creating a project

    Create a project on Google Cloud and activate billing. I will use the name snowplow-project in this article.

  2. Installing Terraform

    brew install terraform
    
  3. Installing Google Cloud CLI

    brew install google-cloud-sdk
    
    $ gcloud version
    Google Cloud SDK 412.0.0
    bq 2.0.83
    core 2022.12.09
    gcloud-crc32c 1.0.0
    gsutil 5.17
    
    gcloud auth login
    

    Add to ~/.zshrc:

    source "$(brew --prefix)/Caskroom/google-cloud-sdk/latest/google-cloud-sdk/path.zsh.inc
    

    This command will add the directory /opt/homebrew/Caskroom/google-cloud-sdk/latest/google-cloud-sdk/bin to the PATH.

  4. Service account

    1. Create a service account named spsetup
      Menu » IAM and admin » Service accounts » Create Service Account

    2. Create a key
      KEYS » Add Key » Key type: JSON
      Save the file as ~/Snowplow/key.json

    3. Point the environment variable to the file path

      export GOOGLE_APPLICATION_CREDENTIALS="/Users/julio/Snowplow/key.json"
      
  5. Activating APIs

    Go to each of the links below and activate the APIs.

  6. VPC and Subnets

    The “secure” configuration of Snowplow requires a VPC. It can be created through the web interface, command line, or Terraform.

    • Through the command line:

      gcloud compute networks create \
          vpc-snowplow \
          --project=projeto-snowplow \
          --description=Snowplow\ VPC \
          --subnet-mode=custom \
          --mtu=1460 \
          --bgp-routing-mode=regional
      
      gcloud compute networks subnets create \
          vpc-snowplow-europe-west3 \
          --project=projeto-snowplow \
          --range=10.0.0.0/24 \
          --stack-type=IPV4_ONLY \
          --network=vpc-snowplow \
          --region=europe-west3 \
          --enable-private-ip-google-access
      
    • Through Terraform:

      # VPC
      # https://www.terraform.io/docs/providers/google/r/compute_network.html
      # Keep the default route (0.0.0.0/0) to allow internet access
      resource "google_compute_network" "snowplow_vpc" {
        name                            = "snowplow-vpc"
        routing_mode                    = "REGIONAL"
        auto_create_subnetworks         = false
        delete_default_routes_on_create = false
      }
      
      # Private Subnet
      # https://www.terraform.io/docs/providers/google/r/compute_subnetwork.html
      resource "google_compute_subnetwork" "snowplow_private_subnet" {
        name                     = "snowplow-private-subnet-europe-west3"
        ip_cidr_range            = "10.0.0.0/24"
        region                   = var.region
        network                  = google_compute_network.snowplow_vpc.id
        private_ip_google_access = true
      }
      
  7. Cloud Router

    • Through the command line:

      gcloud compute routers create router-snowplow \
      --project=projeto-snowplow \
      --description=To\ allow\ access\ to\ the\ internet \
      --region=europe-west3 \
      --network=vpc-snowplow \
      --set-advertisement-mode=custom
      
    • Through Terraform:

      # Cloud Router
      # https://www.terraform.io/docs/providers/google/r/compute_router.html
      resource "google_compute_router" "snowplow_router" {
        name    = "snowplow-router"
        region  = var.region
        network = google_compute_network.snowplow_vpc.id
        bgp {
          asn            = 64514
          advertise_mode = "CUSTOM"
        }
      }
      
  8. Cloud NAT

    Create a NAT gateway. This is how VMs will have internet access.

    • Through Terraform:

      # NAT Gateway
      # https://www.terraform.io/docs/providers/google/r/compute_router_nat.html
      resource "google_compute_router_nat" "snowplow_nat" {
        name                               = "snowplow-nat"
        router                             = google_compute_router.snowplow_router.name
        region                             = google_compute_router.snowplow_router.region
        nat_ip_allocate_option             = "AUTO_ONLY"
        source_subnetwork_ip_ranges_to_nat = "LIST_OF_SUBNETWORKS"
      
        subnetwork {
          name                    = google_compute_subnetwork.snowplow_private_subnet.id
          source_ip_ranges_to_nat = ["ALL_IP_RANGES"]
        }
      
        log_config {
          enable = true
          filter = "ERRORS_ONLY"
        }
      }
      
  9. Java

    brew install java
    brew install openjdk
    
    sudo ln -sfn \
      $(brew --prefix)/opt/openjdk@11/libexec/openjdk.jdk \
      /Library/Java/JavaVirtualMachines/openjdk-11.jdk
    
  10. Igluctl

    Copy the link of the latest version available at https://github.com/snowplow/igluctl/releases/latest.

    wget https://github.com/snowplow/igluctl/releases/download/0.10.2/igluctl_0.10.2.zip
    unzip igluctl_0.10.2.zip
    rm igluctl_0.10.2.zip
    
    chmod +x igluctl
    mv igluctl ~/bin
    
    igluctl --version
    0.10.2
    

    Move the executable to a directory in the PATH. ~/bin in my case.

Quick Start

  1. Repository

    Clone the repository

    git clone https://github.com/snowplow/quickstart-examples
    
  2. Project Folder

    Copy only what will be used to a new folder:

    cp -r quickstart-examples/terraform/gcp/iglu_server/secure ~/Snowplow/iglu_server
    cp -r quickstart-examples/terraform/gcp/pipeline/secure ~/Snowplow/pipeline
    

    I copied only the secure folders, as it is recommended for production environments. The difference is that Snowplow will be configured in the VPC we created earlier.

Iglu Server

  1. terraform.tfvars File

    Edit the file ~/Snowplow/iglu_server/terraform.tfvars:

    prefix = "sp"
    project_id = "projeto-snowplow"
    region = "europe-west3"
    network    = "snowplow-vpc"
    subnetwork = "snowplow-private-subnet-europe-west3"
    ssh_ip_allowlist = ["88.217.100.200/32", "35.235.240.0/20"]
    ssh_key_pairs = [
      {
        user_name  = "snowplow"
        public_key = "ssh-ed25519 AAAAC3NzC1lZDI…E5Gv8/gQzPr3H/9B+Nxc/GInziPJX"
      }
    ]
    
    # --- Snowplow Iglu Server
    iglu_db_name     = "iglu"
    iglu_db_username = "iglu"
    iglu_db_password = "SENHA_DO_IGLU_BD"
    iglu_super_api_key = "01234567-890a-bcde-f012-34567890abcd"
    
    user_provided_id  = ""
    telemetry_enabled = false
    
    # --- SSL Configuration (optional)
    ssl_information = {
      certificate_id = ""
      enabled        = false
    }
    
    # --- Extra Labels to append to created resources (optional)
    labels = {}
    

    To generate the SSH key:

    ssh-keygen -t rsa -b 4096 -f snowplow -C "Snowplow"
    

    To generate a random UUID: https://duckduckgo.com/?q=uuid

  2. Terraform Commands

    cd ~/Snowplow/iglu_server
    terraform init
    terraform plan -out plan
    terraform apply "plan"
    

    Make a note of the output, which will be the Iglu Server IP.
    The output is also saved in the terraform.tfstate:

    $ grep -A1 iglu_server_ip_address terraform.tfstate
        "iglu_server_ip_address": {
          "value": "34.100.100.200",
    
  3. Add schemas to the Iglu server

    git clone https://github.com/snowplow/iglu-central
    cp -r iglu-central/schemas ~/Snowplow/
    rm -rf iglu-central
    
    cd ~/Snowplow
    rm -rf schemas/UNUSED_SCHEMAS
    igluctl static push --public schemas/ http://IGLU_SERVER_IP 01234567-890a-bcde-f012-34567890abcd
    

    Keep only the schemas you will use in the schemas folder.
    The UUID at the end of the command is the iglu_super_api_key we generated earlier.

  4. Inspect Iglu Server status

    The Iglu Server has some endpoints to inspect its own state.

    $ curl http://34.100.100.200/api/meta/health/db
    OK
    
    $ curl -s http://34.100.100.200/api/meta/server \
        -H "apikey: 01234567-890a-bcde-f012-34567890abcd" | jq
    {
      "version": "0.8.7",
      "authInfo": {
        "vendor": "",
        "schema": "CREATE_VENDOR",
        "key": [
          "CREATE",
          "DELETE"
        ]
      },
      "database": "postgres",
      "schemaCount": 0,
      "debug": false,
      "patchesAllowed": true
    }
    

Pipeline

  1. File bigquery.terraform.tfvars

    Edit the file ~/Snowplow/pipeline/bigquery.terraform.tfvars:

    prefix = "sp"
    project_id = "projeto-snowplow"
    region = "europe-west3"
    
    network    = "snowplow-vpc"
    subnetwork = "snowplow-private-subnet-europe-west3"
    
    ssh_ip_allowlist = ["88.217.100.200/32", "35.235.240.0/20"]
    
    ssh_key_pairs = [
      {
        user_name  = "snowplow"
        public_key = "ssh-ed25519 AAAAC3NzC1lZDI…E5Gv8/gQzPr3H/9B+Nxc/GInziPJX"
      }
    ]
    
    iglu_server_dns_name = "http://34.100.100.200"
    iglu_super_api_key = "01234567-890a-bcde-f012-34567890abcd"
    
    bigquery_db_enabled  = true
    bigquery_loader_dead_letter_bucket_deploy = true
    bigquery_loader_dead_letter_bucket_name = "sp-bq-loader-dead-letter_bucket"
    
    user_provided_id  = ""
    telemetry_enabled = false
    
    ssl_information = {
      certificate_id = ""
      enabled        = false
    }
    
    labels = {}
    
    ###########################################################################
    
    postgres_db_enabled  = true
    
    postgres_db_name     = "snowplow"
    postgres_db_username = "snowplow"
    postgres_db_password = "TYPE_PG_PASSWORD_HERE"
    
    postgres_db_authorized_networks = [
      {
        name  = "julio"
        value = "88.217.100.200/32"
      }
    ]
    
    postgres_db_tier = "db-g1-small"
    
  2. Terraform Execution

    cd ~/Snowplow/pipeline
    terraform init
    terraform apply
    
  3. Test Events

    Get the collector’s address:

    $ grep -A1 collector_ip_address terraform.tfstate
        "collector_ip_address": {
          "value": "34.111.120.123",
    

    Send the following test events:

    curl 'http://34.111.120.123/com.snowplowanalytics.snowplow/tp2' \
       -H 'Content-Type: application/json; charset=UTF-8' \
       -H 'Cookie: _sp=305902ac-8d59-479c-ad4c-82d4a2e6bb9c' \
       --data-raw \
      '{
        "schema": "iglu:com.snowplowanalytics.snowplow/payload_data/jsonschema/1-0-4",
        "data": [
          {
            "e": "pv",
            "url": "/docs/open-source-quick-start/quick-start-installation-guide-on-aws/send-test-events-to-your-pipeline/",
            "page": "Send test events to your pipeline - Snowplow Docs",
            "refr": "https://docs.snowplow.io/",
            "tv": "js-2.17.2",
            "tna": "spExample",
            "aid": "docs-example",
            "p": "web",
            "tz": "Europe/London",
            "lang": "en-GB",
            "cs": "UTF-8",
            "res": "3440x1440",
            "cd": "24",
            "cookie": "1",
            "eid": "4e35e8c6-03c4-4c17-8202-80de5bd9d953",
            "dtm": "1626182778191",
            "cx": "eyJzY2hlbWEiOiJpZ2x1Om…YS03ZjRlNzk2OTM3ZmEifX1dfQ",
            "vp": "863x1299",
            "ds": "848x5315",
            "vid": "3",
            "sid": "87c18fc8-2055-4ec4-8ad6-fff64081c2f3",
            "duid": "5f06dbb0-a893-472b-b61a-7844032ab3d6",
            "stm": "1626182778194"
          },
          {
            "e": "ue",
            "ue_px": "eyJzY2hlbWEiOiJpZ2x1Om…IlB1cnBsZSBTbm93cGxvdyBIb29kaWUifX19",
            "tv": "js-2.17.2",
            "tna": "spExample",
            "aid": "docs-example",
            "p": "web",
            "tz": "Europe/London",
            "lang": "en-GB",
            "cs": "UTF-8",
            "res": "3440x1440",
            "cd": "24",
            "cookie": "1",
            "eid": "542a79d3-a3b8-421c-99d6-543ff140a56a",
            "dtm": "1626182778193",
            "cx": "eyJzY2hlbWEiOiJpZ2x1Om…lNzk2OTM3ZmEifX1dfQ",
            "vp": "863x1299",
            "ds": "848x5315",
            "vid": "3",
            "sid": "87c18fc8-2055-4ec4-8ad6-fff64081c2f3",
            "duid": "5f06dbb0-a893-472b-b61a-7844032ab3d6",
            "refr": "https://docs.snowplow.io/",
            "url": "/docs/open-source-quick-start/quick-start-installation-guide-on-aws/send-test-events-to-your-pipeline/",
            "stm": "1626182778194"
          }
        ]
      }'
    

    The above command sends one event that is good and one that does not match any schema.

  4. BigQuery Verification

    Go to https://console.cloud.google.com/bigquery?project=projeto-snowplow and find the table events in the dataset sp_snowplow_db. Check if there is any data in the PREVIEW tab.

    This is the table for “good data”.

    The pipeline for “bad data” is not included in the Quick Start and must be created manually.

  5. PostgreSQL Verification

    Get the database’s IP:

    $ grep -A1 postgres_db_ip_address terraform.tfstate
      "postgres_db_ip_address": {
        "value": "34.89.100.100",
    

    Connect using a program like DBeaver:

    • Host: 34.89.100.100
    • Database: snowplow
    • Port: 5432
    • Username: snowplow
    • Password: TYPE_PG_PASSWORD_HERE

    The table for “good data” is atomic.events.
    It should be empty as the queue has already been consumed by BigQuery.
    It is necessary to duplicate the queue so that PostgreSQL also receives the events.

    There are several tables for “bad data”, one of them is atomic_bad.com_snowplowanalytics_snowplow_badrows_schema_violations_1.

Next Steps

  • Configure domain and enable SSL
  • Create custom schemas
  • Bad data in BigQuery

Julio Batista Silva
Julio Batista Silva
Data Engineer

I’m a computer engineer passionate about science, technology, photography, and languages. Currently working as a Data Engineer in Germany.

comments powered by Disqus