Google BigQuery

Configure automatic daily deposits of your event and dispatch data from S3 or Google Storage to Google BigQuery

Google BigQuery Data Warehouse Integration

The Google BigQuery Data Warehouse integration provides a powerful solution for storing and analyzing your event, dispatch, and visitor data. We utilize S3 or Google Cloud Storage as an intermediary storage layer, allowing you to easily import the data into BigQuery using Google's transfer integrations, enabling efficient querying and analysis of your complete event dataset.

How the Integration Works

  • Scheduled Cloud Storage Deposits: Events and dispatches are automatically deposited into your S3 bucket or Google Cloud Storage bucket on your chosen sync frequency — daily by default, or hourly on request. Hourly delivery seals each completed hour into its own immutable folder under a live/ prefix — files are written once and never overwritten, so it loads cleanly with the BigQuery Data Transfer Service and needs no extra bucket permissions (see Sync Frequency and BigQuery). Contact support@oursprivacy.com to enable it.
  • Visitor Updates: Visitor data contains records where the last seen timestamp is greater than or equal to yesterday, requiring upsert processing
  • Complete Data: All events and dispatches from your account are included in the deposits
  • Flexible Import: You can configure BigQuery to read directly from your S3 bucket or Google Cloud Storage bucket
  • Efficient Querying: BigQuery's SQL-based querying engine allows for fast, in-place analysis of your data
  • Scalable Storage: BigQuery's infrastructure handles large volumes of data efficiently

More Information

Data Organization

Source Structure

your-bucket/
  ├── events/                       # daily (canonical, sealed day)
  │   └── YYYY/MM/DD/*.parquet
  ├── dispatches/
  │   └── YYYY/MM/DD/*.parquet
  ├── visitors/
  │   └── YYYY/MM/DD/*.parquet
  └── live/                         # hourly (opt-in, one immutable folder per hour)
      ├── events/
      │   └── YYYY/MM/DD/HH/*.parquet
      ├── dispatches/
      │   └── YYYY/MM/DD/HH/*.parquet
      └── visitors/
          └── YYYY/MM/DD/HH/*.parquet

The daily feed (events/, dispatches/, visitors/) is always present and is the canonical record for each finished day. The live/ tree only appears when hourly sync is enabled.

Data Processing Considerations

Events and Dispatches

Events and dispatches are complete daily snapshots containing all data for that day. Each day's parquet files contain all events and dispatches that occurred on that specific date.

Visitors

Visitor data contains records that have been recently updated. This means you'll need to implement an upsert process to merge this incremental data into your BigQuery tables:

  1. Read the parquet files from the visitors directory for the current day
  2. Identify existing records in your BigQuery table using visitor identifiers
  3. Update existing records with new information from the parquet files
  4. Insert new records for visitors that don't exist in your table
  5. Handle conflicts based on your business logic (e.g., latest timestamp wins)

This incremental approach ensures you have the most up-to-date visitor information while maintaining data consistency across your BigQuery tables.

Sync Frequency and BigQuery

Warehouse Destinations sync daily by default and can opt into an hourly feed. Both are designed to work with the BigQuery Data Transfer Service's default append (WRITE_APPEND) behavior:

  • Daily writes one complete, sealed snapshot per day into events/YYYY/MM/DD/ (and the same for dispatches/visitors). Each file is written once and never modified, so a daily transfer loads each day exactly once.
  • Hourly writes each completed hour into its own new, immutable folder under live/events/YYYY/MM/DD/HH/. Because we never rewrite or overwrite a file, the Data Transfer Service — which reloads a file only when its modification time changes — loads each hour exactly once and never produces duplicates.

This is a deliberate design choice. The Data Transfer Service in append mode will re-load any file whose modification time changes, so an "overwrite the current day in place" scheme would silently duplicate rows on every run. By sealing immutable per-hour folders instead, the hourly feed stays safe for the default append configuration — you do not need WRITE_TRUNCATE or any special transfer setup.

Point your daily table's transfer at the events/, dispatches/, and visitors/ prefixes. If you also want the low-latency feed, set up a separate transfer against the live/ prefixes. Each hour is sealed about 15 minutes after it closes, so the live/ feed runs roughly an hour behind real time; the daily feed remains the canonical, complete record for each finished day.

Getting Started

To set up the Google BigQuery integration:

  1. Contact your account manager to enable the integration and provide you with the required permissions for your cloud storage bucket
  2. Configure your S3 bucket or Google Cloud Storage bucket to receive the daily data deposits
  3. Set up BigQuery access to your cloud storage using one of the methods below

Once configured, your event, dispatch, and visitor data will be automatically deposited into your cloud storage bucket daily, ready for import into BigQuery. Remember to implement the appropriate upsert logic for visitor data to maintain data consistency in your target systems.

Setting Up BigQuery Access to Cloud Storage

BigQuery provides integration options for both S3 and Google Cloud Storage:

Amazon S3 Integration

BigQuery can connect to your S3 bucket to import data directly using the BigQuery Data Transfer Service. For detailed setup instructions, refer to BigQuery's S3 transfer documentation.

Google Cloud Storage Integration

BigQuery also supports Google Cloud Storage for data import. For detailed setup instructions, refer to BigQuery's Google Cloud Storage documentation.

Simple Getting Started Guide

Note: This is the simplest way to get started without daily partitioning. The Ours Privacy data sync is well-configured for partitioning. For advanced partitioning options, see the BigQuery transfer documentation.

Follow these steps to set up your BigQuery integration:

  1. Create a table in BigQuery:

    • Download a sample file from your cloud storage bucket
    • Upload it to BigQuery to create your initial table structure
  2. Configure data transfer:

    • Set up the BigQuery Data Transfer Service to automatically populate your table daily
      • The File format is "PARQUET"
      • The source URI will depend on your cloud storage provider
        • Note: If you want to sync dispatches and visitors, you will have to setup separate data transfers for each data type (events, dispatches, and visitors).
        • Note: You can introduce data parameters to the source URI by following the BigQuery partitioning documentation. But, the easiest way to get setup is to just configure the wildcard pattern.

Below are examples of the table creation and data transfer configuration interfaces:

Example: Table creation via file upload


Example: Data transfer configuration

How is this guide?

On this page