5.2 Snowflake and BigQuery connectivity
Summary
5.2 is focused on expanding connectivity of Harbr Data platform with Snowflake and BiqQuery as both sources for assets and as targets for export. You can now:
Create on platform assets for Snowflake and BigQuery
Set up repeating updates for those assets to keep data up to date
Export Products or Assets to Snowflake and BigQuery locations (applies to assets that store tables)
Note: The above is available for AWS platform only at the moment
Changes in Spaces
Engineering built ‘Open Source’ Space specs (replacing functionality like for like with the CS deployed ones)
Improved startup times
Superset dashboard import from File Asset; Customisable Superset Icon and favicon
Create a ‘On Platform’ Snowflake asset
In addition to at source Snowflake assets, you can now create on platform Snowflake assets.
When creating an asset, you need to provide a connector, select database and table/schema that will be used to create an asset.
These types of assets have all the same features as at source Snowflake assets but can also be exported, both individually and as part of a product.
Update your Snowflake asset
Once Snowflake ‘On Platform’ asset is created, there are two ways to ensure it has the latest data:
Manual update - by selecting Run Update
Scheduling regular updates - by enabling and editing schedule. Schedule can be set to repeat at a certain time and interval.
Any update would update ‘last refresh date’ for the asset and trigger any exports or tasks that are configured against the asset.
Note that the time for scheduled update is set at UTC timezone.
Export to Snowflake
Any table asset or product that is on platform can now be exported to a Snowflake location, that is specified using the Snowflake connector and database/schema that data will be exported to.
There will be a new table created on the specified database/schema, with a table name specified by the users before export. By default, the names are set to:
Assets: asset_provisionedname
Product: product_provisionedname_asset_provisionedname
The data will be overwritten on repeat exports. Supported export size has been tested up to 400 GB.
Note that connector selected for export requires permissions that enable user to create tables and connection test will be done before the export.
Create a BigQuery connector
Connector is used to store the credentials that are used to connect to BigQuery when creating assets or exporting to a BigQuery location.
For BigQuery connector, you will need to specify:
Basic details, such as name and description
Upload a GCP service account key file
Specify BQ Project ID if it wasn’t available in the file
Once connector is created, we will perform a connection test to ensure it is usable.
Create a ‘On Platform’ BigQuery asset
You can now create on platform BigQuery assets.
When creating an asset, you need to provide a connector, select dataset and table that will be used to create an asset.
Once created, the asset behaves the same way as any other asset and can be added to product or used in Spaces/Export.
Update your BigQuery asset
Once BigQuery ‘On Platform’ asset is created, there are two ways to ensure it has the latest data:
Manual update - by selecting Run Update
Scheduling regular updates - by enabling and editing schedule. Schedule can be set to repeat at a certain time and interval.
Any update would update ‘last refresh date’ for the asset and trigger any exports or tasks that are configured against the asset.
Note that the time for scheduled update is set at UTC timezone.
Export to BigQuery
Any table asset or product that is on platform can now be exported to a BigQuery location, that is specified using the BigQuery connector and dataset that data will be exported to.
There will be a new table created on the specified dataset, with a table name specified by the users before export. By default, the names are set to:
Assets: asset_provisionedname
Product: product_provisionedname_asset_provisionedname
Names cannot contain whitespace characters
The data will be overwritten on repeat exports. Supported export size has been tested up to 400 GB.
Note that connector selected for export requires permissions that enable user to create tables and connection test will be done before the export.
Open Source Space specs
Cluster service upgrades (Dataproc 2.1, EMR 6.11)
Superset upgraded to 2.1
Superset now supports custom icons and favicons
Trino upgraded to v423
RStudio (2023.06.0+421) & R (4.3.1) upgraded
New ‘SQL’ icon linking directly to Superset SQLLab
Tools installed via docker (increased stability post release preventing downstream library changes)
Jupyter running Spark 3.3.0
New Spark context enabling Trino within Python
JupyterLab git integration
Hue, Zeppelin, Hadoop & Spark removed as tool options from portal
Improvements
Ability to download asset dictionary to a CSV file
When creating assets from a task, available tables can now be selected from a dropdown instead of having to enter the path within a space.
Fixed bugs
Organization interaction rules no longer override an Ecosystem Administrator’s ability to see organization names when viewing products on the platform in the manage products view.
When an eco admin visits the manage products view, their organization is selected in the organization filter by default. Should the eco admin choose a different organization to filter on, we have enabled that when they navigate in and out of a product, the selected organization on the manage products list is remembered and it doesn’t keep reverting to their organization.
When managing subscriptions, the user details are now exposed. Previously this information was presenting ‘Single user’ and to find out who the user was, the subscription manager was having to open the subscription management view.
Release Start Date: 7th December 2023
It might take a few days before the release is available in your platform(s).