Skip to content

Commit

Permalink
feat(catalog): add lake formation support in catalog (#810)
Browse files Browse the repository at this point in the history
* add lake formation support in catalog

---------

Co-authored-by: vgkowski <[email protected]>
  • Loading branch information
vgkowski and vgkowski authored Feb 12, 2025
1 parent cf80fda commit 43f6e51
Show file tree
Hide file tree
Showing 24 changed files with 1,864 additions and 189 deletions.
1 change: 1 addition & 0 deletions .gitignore

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions .projenrc.ts
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@ const rootProject = new LernaProject({
'cdk.out',
'.DS_Store',
'LICENSE.bak',
'framework/test/e2e/mytest.e2e.test.ts',
],

projenrcTs: true,
Expand Down
242 changes: 240 additions & 2 deletions framework/API.md

Large diffs are not rendered by default.

15 changes: 15 additions & 0 deletions framework/src/governance/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ AWS Glue Catalog database for an Amazon S3 dataset.
- The database default location is pointing to an S3 bucket location `s3://<locationBucket>/<locationPrefix>/`
- The database can store various tables structured in their respective prefixes, for example: `s3://<locationBucket>/<locationPrefix>/<table_prefix>/`
- By default, a database level crawler is scheduled to run once a day (00:01h local timezone). The crawler can be disabled and the schedule/frequency of the crawler can be modified with a cron expression.
- The permission model of the database can use IAM, LakeFormation or Hybrid mode.

![Data Catalog Database](../../../website/static/img/adsf-data-catalog.png)

Expand All @@ -20,6 +21,20 @@ The AWS Glue Data Catalog resources created by the `DataCatalogDatabase` constru

[example default usage](./examples/data-catalog-database-default.lit.ts)

## Using Lake Formation permission model

You can change the default permission model of the database to use [Lake Formation](https://docs.aws.amazon.com/lake-formation/latest/dg/how-it-works.html) exclusively or [hybrid mode](https://docs.aws.amazon.com/lake-formation/latest/dg/hybrid-access-mode.html).

Changing the permission model to Lake Formation or Hybrid has the following impact:
* The CDK provisioning role is added as a Lake Formation administrator so it can perform Lake Formation operations
* The IAMAllowedPrincipal grant is removed from the database to enforce Lake Formation as the unique permission model (only for Lake Formation permission model)

:::caution Lake Formation Data Lake Settings
Lake Formation and Hybrid permission models are configured using PutDataLakeSettings API call. Concurrent API calls can lead to throttling. If you create multiple `DataCatalogDatabases`, it's recommended to create dependencies between the `dataLakeSettings` that are exposed in each database to avoid concurrent calls. See the example in the `DataLakeCatalog`construct [here](https://github.com/awslabs/data-solutions-framework-on-aws/blob/main/framework/src/governance/lib/data-lake-catalog.ts#L137)
:::

[example lake formation permission model](./examples/data-catalog-database-permissions.lit.ts)

## Modifying the crawler behavior

You can change the default configuration of the AWS Glue Crawler to match your requirements:
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
// Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
// SPDX-License-Identifier: Apache-2.0

import * as cdk from 'aws-cdk-lib';
import { Bucket } from 'aws-cdk-lib/aws-s3';
import { Construct } from 'constructs';
import * as dsf from '../../index';

/// !show
class ExampleDefaultDataCatalogDatabaseStack extends cdk.Stack {
constructor(scope: Construct, id: string) {
super(scope, id);
const bucket = new Bucket(this, 'DataCatalogBucket');

new dsf.governance.DataCatalogDatabase(this, 'DataCatalogDatabase', {
locationBucket: bucket,
locationPrefix: '/databasePath',
name: 'example-db',
permissionModel: dsf.utils.PermissionModel.LAKE_FORMATION,
});
}
}
/// !hide

const app = new cdk.App();
new ExampleDefaultDataCatalogDatabaseStack(app, 'ExampleDefaultDataCatalogDatabaseStack');
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
// Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
// SPDX-License-Identifier: Apache-2.0

import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import * as dsf from '../../index';

/// !show
class ExampleDefaultDataLakeCatalogStack extends cdk.Stack {
constructor(scope: Construct, id: string) {
super(scope, id);
const storage = new dsf.storage.DataLakeStorage(this, 'MyDataLakeStorage');

new dsf.governance.DataLakeCatalog(this, 'DataCatalog', {
dataLakeStorage: storage,
permissionModel: dsf.utils.PermissionModel.LAKE_FORMATION,
});
}
}
/// !hide

const app = new cdk.App();
new ExampleDefaultDataLakeCatalogStack(app, 'ExampleDefaultDataLakeCatalogStack');
26 changes: 24 additions & 2 deletions framework/src/governance/lib/data-catalog-database-props.ts
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ import { IRole } from 'aws-cdk-lib/aws-iam';
import { IKey } from 'aws-cdk-lib/aws-kms';
import { IBucket } from 'aws-cdk-lib/aws-s3';
import { ISecret } from 'aws-cdk-lib/aws-secretsmanager';
import { PermissionModel } from '../../utils';

/**
* Properties for the `DataCatalogDatabase` construct
Expand All @@ -24,8 +25,7 @@ export interface DataCatalogDatabaseProps {

/**
* Top level location where table data is stored.
* The location prefix cannot be empty if the `locationBucket` is set.
* The minimal configuration is `/` for the root level in the Bucket.
* @default - the root of the bucket is used as the location prefix.
*/
readonly locationPrefix?: string;

Expand Down Expand Up @@ -87,4 +87,26 @@ export interface DataCatalogDatabaseProps {
* @default - The resources are not deleted (`RemovalPolicy.RETAIN`).
*/
readonly removalPolicy?: RemovalPolicy;

/**
* The permission model to apply to the Glue Database.
* @default - IAM permission model is used
*/
readonly permissionModel?: PermissionModel;

/**
* The IAM Role used by Lake Formation for [data access](https://docs.aws.amazon.com/lake-formation/latest/dg/registration-role.html).
* The role is assumed by Lake Formation to provide temporary credentials to query engines.
* Only needed when permissionModel is set to Lake Formation or Hybrid
* @default - A new role is created
*/
readonly lakeFormationDataAccessRole?: IRole;

/**
* The IAM Role assumed by the construct resources to perform Lake Formation configuration.
* The role is assumed by Lambda functions to perform Lake Formation related operations.
* Only needed when permissionModel is set to Lake Formation or Hybrid
* @default - A new role is created
*/
readonly lakeFormationConfigurationRole?: IRole;
}
Loading

0 comments on commit 43f6e51

Please sign in to comment.