You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When adding the Spline agent bundle to an AWS Glue Python script (Spark 3.3, Python 3), lineage is produced when using the standard patterns like df = spark.read.csv(file_path, header=True, inferSchema=True) and df.write... as expected.
However, AWS Glue does have a concept of Dynamic Frames usage of which which looks something like
importsysfromawsglue.transformsimport*fromawsglue.utilsimportgetResolvedOptionsfrompyspark.contextimportSparkContextfromawsglue.contextimportGlueContextfromawsglue.jobimportJobargs=getResolvedOptions(sys.argv, ["JOB_NAME"])
sc=SparkContext()
glueContext=GlueContext(sc)
spark=glueContext.spark_sessionjob=Job(glueContext)
job.init(args["JOB_NAME"], args)
# Script generated for node Amazon S3f0=glueContext.create_dynamic_frame.from_options(
format_options={
"quoteChar": -1,
"withHeader": True,
"separator": ",",
"optimizePerformance": False,
},
connection_type="s3",
format="csv",
connection_options={
## Need to replace with S3 path with file name"paths": ["s3://...."],
"recurse": True,
},
transformation_ctx="f0",
)
# Script generated for node Amazon S3f1=glueContext.write_dynamic_frame.from_options(
frame=f0,
connection_type="s3",
format="csv",
connection_options={
## Need to replace with S3 path"path": "s3://....",
"partitionKeys": [],
},
transformation_ctx="f1",
)
job.commit()
Can Spline support this dynamic frame pattern in AWS Glue? I used the spark-3.3-spline-agent-bundle_2.12-2.0.0.jar bundle - Spline agent initialized successfully, but could not produce lineage.
The text was updated successfully, but these errors were encountered:
I didn't try it specifically, but from the AWS doc on the DynamicFrame there is a chance that it would not work. The crucial thing for Spline agent is the existence of the internal Spark write event that the agent can intercept and grab the execution plan from it. That only exists in the Spark SQL API, meaning the DataFrame. For instance RDD lineage isn't supported because of that very reason - Spark doesn't provide any usable (for lineage purposes) logical plan on RDD operations. I don't know how exactly the DynamicFrame is implemented (it's closed source), so it's unclear if DynamicFrame operations eventually translate to DataFrame ones or not. If they don't, Spline don't have ability to track them.
When adding the Spline agent bundle to an AWS Glue Python script (Spark 3.3, Python 3), lineage is produced when using the standard patterns like
df = spark.read.csv(file_path, header=True, inferSchema=True) and df.write...
as expected.However, AWS Glue does have a concept of Dynamic Frames usage of which which looks something like
Can Spline support this dynamic frame pattern in AWS Glue? I used the spark-3.3-spline-agent-bundle_2.12-2.0.0.jar bundle - Spline agent initialized successfully, but could not produce lineage.
The text was updated successfully, but these errors were encountered: