Catalog and govern Amazon Athena federated queries with Amazon SageMaker Lakehouse

Yesterday, we announced Amazon SageMaker Unified Studio (Preview), an integrated experience for all your data and AI and Amazon SageMaker Lakehouse to unify data – from Amazon Simple Storage Service (S3) to third-party sources such as Snowflake. We’re excited by how SageMaker Lakehouse helps break down data silos, but we also know customers don’t want to compromise on data governance or introduce security and compliance risks as they expand data access.

With this new capability, data analysts can now securely access and query data stored outside S3 data lakes, including Amazon Redshift data warehouses and Amazon DynamoDB databases, all through a single, unified experience. Administrators can now apply access controls at different levels of granularity to ensure sensitive data remains protected while expanding data access. This allows organizations to accelerate data initiatives while maintaining security and compliance, leading to faster, data-driven decision-making.

In this post, we show how to connect to, govern, and run federated queries on data stored in Redshift, DynamoDB (Preview), and Snowflake (Preview). To query our data, we use Athena, which is seamlessly integrated with SageMaker Unified Studio. We use SageMaker Lakehouse to present data to end-users as federated catalogs, a new type of catalog object. Finally, we demonstrate how to use column-level security permissions in AWS Lake Formation to give analysts access to the data they need while restricting access to sensitive information.

Background

As data volumes grow, organizations often employ specialized storage systems to achieve optimal performance and cost-efficiency with different use cases. However, this approach can result in data silos, and makes it challenging to gain insights from data for several reasons. First, end-users often have to set up connections to data sources on their own. This is challenging because of configuration details that vary by source and technical connectivity properties they may not have access to. Second, data sources often have their own built-in access controls, which fragments data governance. Lastly, copying data from one storage system to another for the purposes of analysis adds cost and creates duplication risks.

SageMaker Lakehouse streamlines connecting to, cataloging, and managing permissions on data from multiple sources. It integrates with SageMaker Unified Studio, Athena, and other popular tools to give flexibility to end-users to work with data from their preferred tools.

As you create connections to data, SageMaker Lakehouse creates the underlying catalogs, databases, and tables, and integrates these resources with Lake Formation. Administrators can then define and centrally manage fine-grained access controls on these resources, without having to learn different access management concepts for each data source.

With the right access permissions in place, data discovery and analytics workflows are streamlined. Data analysts no longer need to connect to data sources on their own, saving time and frustration from setting up connectors with configurations that vary by source. Instead, analysts can simply run SQL queries on federated data catalogs, seamlessly accessing diverse data for various needs, which accelerates insights and enhances productivity.

Solution overview

This post presents a solution where a company is using multiple data sources containing customer data. Analysts want to query this data for analytics and AI and machine learning (ML) workloads. However, regulations require personally identifiable information (PII) data to be secured. The following diagram illustrates the solution architecture.

In our use case, an administrator is responsible for data governance and has administrator-level access to data sources – including Redshift, DynamoDB, and Snowflake. Existing regulations require administrators to safeguard sensitive PII data, such as customer mobile phone number, which is stored in multiple places. At the same time, there are business stakeholders in data analyst job functions who need access to these databases because they contain valuable business data that they need access to in order to gain insight on business health.

We will use an administrator account to create connections to Redshift, DynamoDB, and Snowflake, register these as catalogs in SageMaker Lakehouse, and then set up fine-grained access controls using Lake Formation. When complete, we use a data analyst account to query the data with Athena but we will be unable to access the data the role is not entitled to.

Prerequisites

Make sure you have the following prerequisites:

An AWS account with permission to create IAM roles and IAM policies
An AWS Identity and Access Management (IAM) user with an access key and secret key to configure the AWS Command Line Interface (AWS CLI)
Administrator access to SageMaker Lakehouse and the following roles:
A SageMaker Unified Studio domain and two projects using the SQL Analytics profile. To learn more, refer to the Amazon SageMaker Unified Studio Administrator Guide.
- An Admin project will be used to create connections
- A Data Analyst project will be used to analyze data and will include both administrator and analysts as members. Take note of the IAM role in the Data Analyst project from the Project Overview page. This IAM role will be referenced when granting access later on.
Administrator access to one or more of the following data sources, and data sources set up as shown in the appendix A and B:
- Redshift
- DynamoDB
- Snowflake

Set up federated catalogs

The first step is to set up federated catalogs for our data sources using an administrator account. The section below walks you through the end-to-end process with DynamoDB and demonstrates how to query the data when setup is complete. When you are done setting up and exploring the DynamoDB data, repeat these steps for Redshift and Snowflake.

On the SageMaker Unified Studio console, open your project.
Choose Data in the navigation pane.
In the data explorer, choose the plus icon to add a data source.
Under Add a data source, choose Add connection, then choose Amazon DynamoDB.
Enter your connection details, and choose Add data source.

Next, SageMaker Unified Studio connects to your data source, registers the data source as a federated catalog with SageMaker Lakehouse, and displays it in your data explorer.

To explore and query your data, click any SageMaker Lakehouse catalog to view its contents. Use the data explorer to drill down to a table and use the Actions menu to select Query with Athena.

This brings you to the query editor where your sample query is executed. Here, try different SQL statements to better understand your data and to gain familiarity with query development features in SageMaker Unified Studio. To learn more, see SQL analytics in the Amazon SageMaker Unified Studio User Guide.

Similarly, you can setup data source connection for Redshift and Snowflake and query the data. Please refer to Appendix B which contains screenshots capturing the details needed to create the connection and data catalog for Redshift and Snowflake sources.

Set up fine-grained access permissions on federated catalogs

Our next step is to set up access permissions on our federated catalogs. As mentioned in the prerequisites, you have already set up an IAM role with data analyst permissions and a SageMaker Studio data analyst project. We will grant permissions to the data analyst role and SageMaker studio data analyst project role to ensure that access controls you specify are enforced when the data is queried. The following steps show how to set up permissions on a Redshift federated catalog, but the steps are the same for each data source.

Navigate to Lake Formation in the AWS management console as an administrator.
In the Lake Formation console, under Data Catalog in the navigation pane, choose Catalogs. Here, you will see the federated catalogs that were set up previously in SageMaker Unified Studio.
Choose the federated catalog that you wish to set up permissions for. Here, you can see details for the catalog and any associated databases and tables, and manage permissions.
From the Actions menu, choose Grant to grant permissions to the data analyst role and SageMaker studio data analyst project role.
In Catalogs, choose the federated catalog name for the source you wish to grant permissions on.
In Databases, choose your Redshift schema, Snowflake schema, or default for DynamoDB.
In Database permissions, select Describe.
Choose Grant.

The next step is to grant the permission on the tables to the data analyst role and SageMaker studio data analyst project role. For this solution, assume you wish to restrict access to a sensitive column containing the mobile phone number for each customer.

In the Actions menu, choose Grant.
In Catalogs, choose your federated catalog.
In Databases, choose your Redshift schema, Snowflake schema, or default for DynamoDB.
In Tables, choose your tables.
In Table permissions, choose Select.
In Data permissions, choose Column-based access.
In Choose permission filter, choose Include columns.
In Select columns, choose one or more columns.
Choose Grant.

You have successfully set up fine-grained access permissions on your Redshift federated catalog. Repeat these steps to add permissions on your DynamoDB and Snowflake federated catalogs.

Validate fine-grained access permissions on federated catalogs

Now that you have set up federated catalogs with fine-grained access permissions, it’s time to run queries to confirm access permissions are working as expected.

First, access SageMaker Unified Studio using the data analyst role and navigate to your project, select Query Editor from the Build menu, and click on the DynamoDB catalog in the Data explorer. Next, drill down to a table and click Query with Athena to run a sample query. Note how permissions are working as expected because the query result does not include the mobile phone number column that was visible before.

Next, query the Redshift data source and note how the mobile phone number is not included in the query result.

Lastly, query the Snowflake data source and, like the previous examples, note how the result does not include the mobile phone number column.

In this example, we demonstrated how to set up a basic column-level filter to restrict access to sensitive data. However, SageMaker Lakehouse supports a broad range of fine-grained access control scenarios beyond column filters that allow you to meet complex security and compliance requirements across diverse data sources. To learn more, see Managing Permissions.

Clean up

Make sure you remove the SageMaker Lakehouse resources to mitigate any unexpected costs. Start by deleting the connections, catalogs, underlying data sources, projects, and domain that you created for this blog. For additional details, refer to the Amazon SageMaker Unified Studio Administrator Guide.

Conclusion

In this blog post, we utilized fine-grained access controls with federated queries in Athena. We demonstrated how this feature allows flexibility in choosing the right data storage solutions for your needs while securely expanding access to data. We showed how to create federated catalogs and set up access policies with Lake Formation, and then queried data with Athena where we saw permissions enforced on different sources. This approach unified data access controls and streamlined data discovery, saving end-users valuable time. To learn more about federated queries in Athena and the data sources that support fine-grained access controls today, see Register your connection as a Glue Data Catalog in the Athena User Guide.

We encourage you to try fine-grained access controls on federated queries today in SageMaker Unified Studio, and to share your feedback with us. To learn more, see Getting started in the Amazon SageMaker Unified Studio User Guide.

Appendix A: Set up data sources

In this section, we provide the steps to set up your data sources.

Redshift

You can create a new table customer_rs in your current database with columns cust_id, mobile, and zipcode and populate with sample data using the following SQL command:

CREATE TABLE "customer_rs" AS
SELECT 6 AS "cust_id",  66666666 AS "mobile", 6000 as "zipcode"
UNION ALL SELECT 7, 77777777, 7000
UNION ALL SELECT 8,  88888888, 8000
UNION ALL SELECT 9,  99999999, 9000
UNION ALL SELECT 10, 11112222, 1100

DynamoDB

You can create a new table in DynamoDB with the partition key cust_id and the sort key zipcode through AWS CloudShell with the following command:

aws dynamodb create-table \
    --table-name customer_ddb \
    --attribute-definitions \
        AttributeName=cust_id,AttributeType=N \
        AttributeName=zipcode,AttributeType=N \
    --key-schema \
        AttributeName=cust_id,KeyType=HASH \
        AttributeName=zipcode,KeyType=RANGE \
    --provisioned-throughput \
        ReadCapacityUnits=5,WriteCapacityUnits=5 \
    --table-class STANDARD

You can populate the DynamoDB table with the following commands:

aws dynamodb put-item \
    --table-name customer_ddb  \
    --item \
        ‘{“cust_id”: {“N”: “11”}, “zipcode”: {“N”: “2000”}, “mobile”: {“N”: “11113333”}}’

aws dynamodb put-item \
    --table-name customer_ddb  \
    --item \
              ‘{“cust_id”: {“N”: “12”}, “zipcode”: {“N”: “2000”}, “mobile”: {“N”: “22224444”}}’

aws dynamodb put-item \
    --table-name customer_ddb \
    --item \
               ‘{“cust_id”: {“N”: “13”}, “zipcode”: {“N”: “3000”}, “mobile”: {“N”: “33335555”}}’
                            
aws dynamodb put-item \
    --table-name customer_ddb \
    --item \
               ‘{“cust_id”: {“N”: “14”}, “zipcode”: {“N”: “4000”}, “mobile”: {“N”: “55556666”}}’

Snowflake

You can create your database, schema, and tables in Snowflake with the following SQL queries:

use database tasty_bytes_sample_data
create schema "sf_schema"

CREATE TABLE "customer_sf" AS
SELECT 1 AS "cust_id",  11111111 AS "mobile", 1000 as "zipcode" 
UNION ALL SELECT 2, 22222222 , 2000
UNION ALL SELECT 3,  33333333, 3000
UNION ALL SELECT 4,  44444444, 4000
UNION ALL SELECT 5, 55555555, 5000
UNION ALL SELECT 21, 12341234, 1234

Appendix B: Connection Properties for Redshift and Snowflake

Redshift Connection Properties:

Snowflake Connection Properties:

About the Authors

Sandeep Adwankar is a Senior Product Manager at AWS. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure, and access data.

Praveen Kumar is a Principal Analytics Solution Architect at AWS with expertise in designing, building, and implementing modern data and analytics platforms using cloud-centered services. His areas of interests are serverless technology, modern cloud data warehouses, streaming, and generative AI applications.

Stuti Deshpande is a Big Data Specialist Solutions Architect at AWS. She works with customers around the globe, providing them strategic and architectural guidance on implementing analytics solutions using AWS. She has extensive experience in big data, ETL, and analytics. In her free time, Stuti likes to travel, learn new dance forms, and enjoy quality time with family and friends.

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.

Scott Rigney is a Senior Technical Product Manager with AWS and has expertise in analytics, data science, and machine learning. He is passionate about building software products that enable enterprises to make data-driven decisions and drive innovation.