Google Storage Warehouse

Configure automatic daily deposits of your event and dispatch data in parquet format to your Google Cloud Storage bucket

Google Cloud Storage Data Warehouse Integration

The Google Cloud Storage Data Warehouse integration provides a native solution for storing your event, dispatch, and visitor data in a scalable, cost-effective manner. Ours Privacy automatically deposits daily data into your specified Google Cloud Storage bucket in parquet format, containing all events and dispatches that occurred on your account, plus visitor records that have been recently updated.

How the Integration Works

Daily Deposits: Events and dispatches are automatically collected and deposited into your Google Cloud Storage bucket on a daily basis
Visitor Updates: Visitor data contains records where the last seen timestamp is greater than or equal to yesterday, requiring upsert processing
Parquet Format: Data is stored in efficient parquet format, optimized for analytics and querying
Complete Data: All events and dispatches from your account are included in the daily deposits
Flexible Access: Once in your Google Cloud Storage bucket, you can process, analyze, or move the data as needed

Data Organization

The data in your Google Cloud Storage bucket is organized in a partitioned structure:

gs://your-bucket/
  ├── events/
  │   └── YYYY/
  │       └── MM/
  │           └── DD/
  │               └── *.parquet
  ├── dispatches/
  │   └── YYYY/
  │       └── MM/
  │           └── DD/
  │               └── *.parquet
  └── visitors/
      └── YYYY/
          └── MM/
              └── DD/
                  └── *.parquet

This partitioning by year/month/day makes it easy to:

Query specific time periods efficiently
Manage data retention policies
Process historical data in batches
Use partition projections for optimized querying

Data Processing Considerations

Events and Dispatches

Events and dispatches are complete daily snapshots containing all data for that day. Each day's parquet files contain all events and dispatches that occurred on that specific date.

Visitors

Visitor data contains records that have been recently updated. This means you'll need to implement an upsert process to merge this incremental data into your data lake, warehouse, or database:

Read the parquet files from the visitors directory for the current day
Identify existing records in your target system using visitor identifiers
Update existing records with new information from the parquet files
Insert new records for visitors that don't exist in your system
Handle conflicts based on your business logic (e.g., latest timestamp wins)

This incremental approach ensures you have the most up-to-date visitor information while maintaining data consistency across your analytics infrastructure.

Getting Started

To set up the Google Cloud Storage Data Warehouse integration:

Contact your account manager to enable the integration and provide you with the required policy updates for your Google Cloud Storage bucket
Input your Google Cloud Storage bucket details to the Ours Privacy App

Once configured, your event, dispatch, and visitor data will be automatically deposited into your Google Cloud Storage bucket daily, ready for your use in analytics, reporting, or other data processing workflows. Remember to implement the appropriate upsert logic for visitor data to maintain data consistency in your target systems.

Data Lake Integration

You can also process your data with other data lake solutions:

Big Query: use a daily import to load your data into Big Query
Databricks: Use the path partitioning for efficient Delta Lake operations
Snowflake: External tables can leverage the partitioning structure
Apache Spark: Direct parquet reading with partition discovery

Best Practices

Ensure your Google Cloud Storage bucket has appropriate access policies (you will need to contact a member of the Ours Privacy team for this)
Consider setting up lifecycle policies to manage data retention
Use BigQuery's partition filtering for efficient querying
Take advantage of the partitioning structure for cost optimization
Consider using BigQuery's external tables for seamless integration
Use partition projections when querying across multiple date ranges

Google Storage