Support for AWS Glue DynamicFrame #781

catworlddomination · 2024-01-18T23:29:23Z

When adding the Spline agent bundle to an AWS Glue Python script (Spark 3.3, Python 3), lineage is produced when using the standard patterns like df = spark.read.csv(file_path, header=True, inferSchema=True) and df.write... as expected.

However, AWS Glue does have a concept of Dynamic Frames usage of which which looks something like

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

# Script generated for node Amazon S3
f0 = glueContext.create_dynamic_frame.from_options(
    format_options={
        "quoteChar": -1,
        "withHeader": True,
        "separator": ",",
        "optimizePerformance": False,
    },
    connection_type="s3",
    format="csv",
    connection_options={
	## Need to replace with S3 path with file name
        "paths": ["s3://...."],
        "recurse": True,
    },
    transformation_ctx="f0",
)

# Script generated for node Amazon S3
f1 = glueContext.write_dynamic_frame.from_options(
    frame=f0,
    connection_type="s3",
    format="csv",
    connection_options={
	## Need to replace with S3 path
        "path": "s3://....",
        "partitionKeys": [],
    },
    transformation_ctx="f1",
)

job.commit()

Can Spline support this dynamic frame pattern in AWS Glue? I used the spark-3.3-spline-agent-bundle_2.12-2.0.0.jar bundle - Spline agent initialized successfully, but could not produce lineage.

The text was updated successfully, but these errors were encountered:

wajda · 2024-01-20T13:01:33Z

I didn't try it specifically, but from the AWS doc on the DynamicFrame there is a chance that it would not work. The crucial thing for Spline agent is the existence of the internal Spark write event that the agent can intercept and grab the execution plan from it. That only exists in the Spark SQL API, meaning the DataFrame. For instance RDD lineage isn't supported because of that very reason - Spark doesn't provide any usable (for lineage purposes) logical plan on RDD operations. I don't know how exactly the DynamicFrame is implemented (it's closed source), so it's unclear if DynamicFrame operations eventually translate to DataFrame ones or not. If they don't, Spline don't have ability to track them.

wajda · 2024-01-20T13:09:32Z

Try to look at the Spark driver's debug logs carefully. If Spline agent is notified on Glue write events there have to be messages. See:

cerveada added this to Spline Jan 18, 2024

github-project-automation bot moved this to New in Spline Jan 18, 2024

wajda added the feature label Jan 20, 2024

wajda changed the title ~~Can Spline support lineage for AWS glue Spark dynamic frames?~~ Support for AWS Glue DynamicFrame Feb 6, 2024

wajda mentioned this issue Feb 6, 2024

Can Spline support lineage for AWS glue Spark dynamic frames? #786

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for AWS Glue DynamicFrame #781

Support for AWS Glue DynamicFrame #781

catworlddomination commented Jan 18, 2024 •

edited by wajda

Loading

wajda commented Jan 20, 2024

wajda commented Jan 20, 2024

Support for AWS Glue DynamicFrame #781

Support for AWS Glue DynamicFrame #781

Comments

catworlddomination commented Jan 18, 2024 • edited by wajda Loading

wajda commented Jan 20, 2024

wajda commented Jan 20, 2024

catworlddomination commented Jan 18, 2024 •

edited by wajda

Loading