diff --git a/metadata-ingestion/docs/sources/kafka-connect/kafka-connect.md b/metadata-ingestion/docs/sources/kafka-connect/kafka-connect.md index 97af98c31bbe77..249edb940aefbb 100644 --- a/metadata-ingestion/docs/sources/kafka-connect/kafka-connect.md +++ b/metadata-ingestion/docs/sources/kafka-connect/kafka-connect.md @@ -105,6 +105,377 @@ source: **Note**: When `use_connect_topics_api` is `false`, topic information will not be extracted, which may impact lineage accuracy but improves performance and works in air-gapped environments. +### Enhanced Topic Resolution for Source and Sink Connectors + +DataHub now provides intelligent topic resolution that works reliably across all environments, including Confluent Cloud where the Kafka Connect topics API is unavailable. + +#### How It Works + +**Source Connectors** (Debezium, Snowflake CDC, JDBC): + +- Always derive expected topics from connector configuration (`table.include.list`, `database.include.list`) +- Apply configured transforms (RegexRouter, EventRouter, etc.) to predict final topic names +- When Kafka API is available: Filter to only topics that exist in Kafka +- When Kafka API is unavailable (Confluent Cloud): Create lineages for all configured tables without filtering + +**Sink Connectors** (S3, Snowflake, BigQuery, JDBC): + +- Support both explicit topic lists (`topics` field) and regex patterns (`topics.regex` field) +- When `topics.regex` is used: + - Priority 1: Match against `manifest.topic_names` from Kafka API (if available) + - Priority 2: Query DataHub for Kafka topics and match pattern (if `use_schema_resolver` enabled) + - Priority 3: Warn user that pattern cannot be expanded + +#### Configuration Examples + +**Source Connector with Pattern Expansion:** + +```yml +# Debezium PostgreSQL source with wildcard tables +connector.config: + table.include.list: "public.analytics_.*" + # When Kafka API unavailable, DataHub will: + # 1. Query DataHub for all PostgreSQL tables matching pattern + # 2. Derive expected topic names (server.schema.table format) + # 3. Apply transforms if configured + # 4. Create lineages without Kafka validation +``` + +**Sink Connector with topics.regex (Confluent Cloud):** + +```yml +# S3 sink connector consuming from pattern-matched topics +connector.config: + topics.regex: "analytics\\..*" # Match topics like analytics.users, analytics.orders + # When Kafka API unavailable, DataHub will: + # 1. Query DataHub for all Kafka topics (requires use_schema_resolver: true) + # 2. Match topics against the regex pattern + # 3. Create lineages for matched topics +``` + +**Enable DataHub Topic Querying for Sink Connectors:** + +```yml +source: + type: kafka-connect + config: + connect_uri: "https://api.confluent.cloud/connect/v1/environments/env-123/clusters/lkc-abc456" + username: "your-connect-api-key" + password: "your-connect-api-secret" + + # Enable DataHub schema resolver for topic pattern expansion + use_schema_resolver: true # Required for topics.regex fallback + + # Configure graph connection for DataHub queries + datahub_gms_url: "http://localhost:8080" # Your DataHub GMS endpoint +``` + +#### Key Benefits + +1. **Confluent Cloud Support**: Both source and sink connectors work correctly with pattern-based configurations +2. **Config as Source of Truth**: Source connectors always derive topics from configuration, not from querying all tables in DataHub +3. **Smart Fallback**: Sink connectors can query DataHub for Kafka topics when Kafka API is unavailable +4. **Pattern Expansion**: Wildcards in `table.include.list` and `topics.regex` are properly expanded +5. **Transform Support**: All transforms (RegexRouter, EventRouter, etc.) are applied correctly + +#### When DataHub Topic Querying is Used + +DataHub will query for topics in these scenarios: + +**Source Connectors:** + +- When expanding wildcard patterns in `table.include.list` (e.g., `ANALYTICS.PUBLIC.*`) +- Queries source platform (PostgreSQL, MySQL, etc.) for tables matching the pattern + +**Sink Connectors:** + +- When `topics.regex` is used AND Kafka API is unavailable (Confluent Cloud) +- Queries DataHub's Kafka platform for topics matching the regex pattern +- Requires `use_schema_resolver: true` in configuration + +**Important Notes:** + +- DataHub never queries "all tables" to create lineages - config is always the source of truth +- Source connectors query source platforms (databases) to expand table patterns +- Sink connectors query Kafka platform to expand topic regex patterns +- Both require appropriate DataHub credentials and connectivity + +### Using DataHub Schema Resolver for Pattern Expansion and Column-Level Lineage + +The Kafka Connect source can query DataHub for schema information to provide two capabilities: + +1. **Pattern Expansion** - Converts wildcard patterns like `database.*` into actual table names by querying DataHub +2. **Column-Level Lineage** - Generates field-level lineage by matching schemas between source tables and Kafka topics + +Both features require existing metadata in DataHub from your database and Kafka schema registry ingestion. + +#### Configuration Overview + +```yml +source: + type: kafka-connect + config: + connect_uri: "http://localhost:8083" + + # Enable DataHub schema querying + use_schema_resolver: true + + # Control which features to use (both default to true when schema resolver enabled) + schema_resolver_expand_patterns: true # Expand wildcard patterns + schema_resolver_finegrained_lineage: true # Generate column-level lineage + + # DataHub connection (required when use_schema_resolver=true) + datahub_api: + server: "http://localhost:8080" + token: "your-datahub-token" # Optional +``` + +#### Pattern Expansion + +Converts wildcard patterns in connector configurations into actual table names by querying DataHub. + +**Example: MySQL Source with Wildcards** + +```yml +# Connector config contains pattern +connector.config: + table.include.list: "analytics.user_*" # Pattern: matches user_events, user_profiles, etc. + +# DataHub config +source: + type: kafka-connect + config: + use_schema_resolver: true + schema_resolver_expand_patterns: true +# Result: DataHub queries for MySQL tables matching "analytics.user_*" +# Finds: user_events, user_profiles, user_sessions +# Creates lineage: +# mysql.analytics.user_events -> kafka.server.analytics.user_events +# mysql.analytics.user_profiles -> kafka.server.analytics.user_profiles +# mysql.analytics.user_sessions -> kafka.server.analytics.user_sessions +``` + +**When to use:** + +- Connector configs have wildcard patterns (`database.*`, `schema.table_*`) +- You want accurate lineage without manually listing every table +- Source metadata exists in DataHub from database ingestion + +**When to skip:** + +- Connector configs use explicit table lists (no patterns) +- Source metadata not yet in DataHub +- Want faster ingestion without DataHub API calls + +**Configuration:** + +```yml +source: + type: kafka-connect + config: + use_schema_resolver: true + schema_resolver_expand_patterns: true # Enable pattern expansion + + + # If you only want column-level lineage but NOT pattern expansion: + # schema_resolver_expand_patterns: false +``` + +**Behavior without schema resolver:** +Patterns are treated as literal table names, resulting in potentially incorrect lineage. + +#### Column-Level Lineage + +Generates field-level lineage by matching column names between source tables and Kafka topics. + +**Example: PostgreSQL to Kafka CDC** + +```yml +source: + type: kafka-connect + config: + use_schema_resolver: true + schema_resolver_finegrained_lineage: true +# Source table schema in DataHub: +# postgres.public.users: [user_id, email, created_at, updated_at] + +# Kafka topic schema in DataHub: +# kafka.server.public.users: [user_id, email, created_at, updated_at] + +# Result: Column-level lineage created: +# postgres.public.users.user_id -> kafka.server.public.users.user_id +# postgres.public.users.email -> kafka.server.public.users.email +# postgres.public.users.created_at -> kafka.server.public.users.created_at +# postgres.public.users.updated_at -> kafka.server.public.users.updated_at +``` + +**Requirements:** + +- Source table schema exists in DataHub (from database ingestion) +- Kafka topic schema exists in DataHub (from schema registry or Kafka ingestion) +- Column names match between source and target (case-insensitive matching) + +**Benefits:** + +- **Impact Analysis**: See which fields are affected by schema changes +- **Data Tracing**: Track specific data elements through pipelines +- **Schema Understanding**: Visualize how data flows at the field level + +**ReplaceField Transform Support:** + +Column-level lineage respects ReplaceField transforms that filter or rename columns: + +```yml +# Connector excludes specific fields +connector.config: + transforms: "removeFields" + transforms.removeFields.type: "org.apache.kafka.connect.transforms.ReplaceField$Value" + transforms.removeFields.exclude: "internal_id,temp_column" +# DataHub behavior: +# Source schema: [user_id, email, internal_id, temp_column] +# After transform: [user_id, email] +# Column lineage created only for: user_id, email +``` + +**Configuration:** + +```yml +source: + type: kafka-connect + config: + use_schema_resolver: true + schema_resolver_finegrained_lineage: true # Enable column-level lineage + + + # If you only want pattern expansion but NOT column-level lineage: + # schema_resolver_finegrained_lineage: false +``` + +**Behavior without schema resolver:** +Only dataset-level lineage is created (e.g., `postgres.users -> kafka.users`), without field-level detail. + +#### Complete Configuration Example + +```yml +source: + type: kafka-connect + config: + # Kafka Connect cluster + connect_uri: "http://localhost:8083" + cluster_name: "production-connect" + + # Enable schema resolver features + use_schema_resolver: true + schema_resolver_expand_patterns: true # Expand wildcard patterns + schema_resolver_finegrained_lineage: true # Generate column-level lineage + + # DataHub connection + datahub_api: + server: "http://datahub.company.com" + token: "${DATAHUB_TOKEN}" + + # Platform instances (if using multiple) + platform_instance_map: + postgres: "prod-postgres" + kafka: "prod-kafka" +``` + +#### Performance Impact + +**API Calls per Connector:** + +- Pattern expansion: 1 GraphQL query per unique wildcard pattern +- Column-level lineage: 2 GraphQL queries (source schema + target schema) +- Results cached for ingestion run duration + +**Optimization:** + +```yml +# Minimal configuration - no schema resolver +source: + type: kafka-connect + config: + connect_uri: "http://localhost:8083" + # use_schema_resolver: false # Default - no DataHub queries + +# Pattern expansion only +source: + type: kafka-connect + config: + use_schema_resolver: true + schema_resolver_expand_patterns: true + schema_resolver_finegrained_lineage: false # Skip column lineage for faster ingestion + +# Column lineage only +source: + type: kafka-connect + config: + use_schema_resolver: true + schema_resolver_expand_patterns: false # Skip pattern expansion + schema_resolver_finegrained_lineage: true +``` + +**Best Practice:** +Run database and Kafka schema ingestion before Kafka Connect ingestion to pre-populate DataHub with schema metadata. + +#### Troubleshooting + +**"Pattern expansion found no matches for: analytics.\*"** + +Causes: + +- Source database metadata not in DataHub +- Pattern syntax doesn't match DataHub dataset names +- Platform instance mismatch + +Solutions: + +1. Run database ingestion first to populate DataHub +2. Verify pattern matches table naming in source system +3. Check `platform_instance_map` matches database ingestion config +4. Use explicit table list to bypass pattern expansion temporarily + +**"SchemaResolver not available: DataHub graph connection is not available"** + +Causes: + +- Missing `datahub_api` configuration +- DataHub GMS not accessible + +Solutions: + +```yml +source: + type: kafka-connect + config: + use_schema_resolver: true + datahub_api: + server: "http://localhost:8080" # Add DataHub GMS URL + token: "your-token" # Add if authentication enabled +``` + +**Column-level lineage not appearing** + +Check: + +1. Source table schema exists: Search for table in DataHub UI +2. Kafka topic schema exists: Search for topic in DataHub UI +3. Column names match (case differences are handled automatically) +4. Check ingestion logs for warnings about missing schemas + +**Slow ingestion with schema resolver enabled** + +Profile: + +- Check logs for "Schema resolver cache hits: X, misses: Y" +- High misses indicate missing metadata in DataHub + +Temporarily disable to compare: + +```yml +use_schema_resolver: false +``` + ### Working with Platform Instances If you've multiple instances of kafka OR source/sink systems that are referred in your `kafka-connect` setup, you'd need to configure platform instance for these systems in `kafka-connect` recipe to generate correct lineage edges. You must have already set `platform_instance` in recipes of original source/sink systems. Refer the document [Working with Platform Instances](https://docs.datahub.com/docs/platform-instances) to understand more about this. diff --git a/metadata-ingestion/src/datahub/ingestion/source/kafka_connect/common.py b/metadata-ingestion/src/datahub/ingestion/source/kafka_connect/common.py index 84268412048db7..0d59e9c9a24010 100644 --- a/metadata-ingestion/src/datahub/ingestion/source/kafka_connect/common.py +++ b/metadata-ingestion/src/datahub/ingestion/source/kafka_connect/common.py @@ -1,6 +1,6 @@ import logging from dataclasses import dataclass, field -from typing import Dict, Final, List, Optional +from typing import TYPE_CHECKING, Callable, Dict, Final, List, Optional, TypedDict from pydantic import model_validator from pydantic.fields import Field @@ -10,6 +10,13 @@ DatasetLineageProviderConfigBase, PlatformInstanceConfigMixin, ) +from datahub.ingestion.source.kafka_connect.config_constants import ( + parse_comma_separated_list, +) +from datahub.ingestion.source.kafka_connect.pattern_matchers import JavaRegexMatcher +from datahub.ingestion.source.kafka_connect.transform_plugins import ( + get_transform_pipeline, +) from datahub.ingestion.source.state.stale_entity_removal_handler import ( StaleEntityRemovalSourceReport, StatefulStaleMetadataRemovalConfig, @@ -18,6 +25,10 @@ StatefulIngestionConfigBase, ) from datahub.utilities.lossy_collections import LossyList +from datahub.utilities.urns.dataset_urn import DatasetUrn + +if TYPE_CHECKING: + from datahub.sql_parsing.schema_resolver import SchemaResolver logger = logging.getLogger(__name__) @@ -57,70 +68,13 @@ } -class ConnectorConfigKeys: - """Centralized configuration keys to avoid magic strings throughout the codebase.""" - - # Core connector configuration - CONNECTOR_CLASS: Final[str] = "connector.class" - - # Topic configuration - TOPICS: Final[str] = "topics" - TOPICS_REGEX: Final[str] = "topics.regex" - KAFKA_TOPIC: Final[str] = "kafka.topic" - TOPIC: Final[str] = "topic" - TOPIC_PREFIX: Final[str] = "topic.prefix" - - # JDBC configuration - CONNECTION_URL: Final[str] = "connection.url" - TABLE_INCLUDE_LIST: Final[str] = "table.include.list" - TABLE_WHITELIST: Final[str] = "table.whitelist" - QUERY: Final[str] = "query" - MODE: Final[str] = "mode" - - # Debezium/CDC configuration - DATABASE_SERVER_NAME: Final[str] = "database.server.name" - DATABASE_HOSTNAME: Final[str] = "database.hostname" - DATABASE_PORT: Final[str] = "database.port" - DATABASE_DBNAME: Final[str] = "database.dbname" - DATABASE_INCLUDE_LIST: Final[str] = "database.include.list" - - # Kafka configuration - KAFKA_ENDPOINT: Final[str] = "kafka.endpoint" - BOOTSTRAP_SERVERS: Final[str] = "bootstrap.servers" - KAFKA_BOOTSTRAP_SERVERS: Final[str] = "kafka.bootstrap.servers" - - # BigQuery configuration - PROJECT: Final[str] = "project" - DEFAULT_DATASET: Final[str] = "defaultDataset" - DATASETS: Final[str] = "datasets" - TOPICS_TO_TABLES: Final[str] = "topicsToTables" - SANITIZE_TOPICS: Final[str] = "sanitizeTopics" - KEYFILE: Final[str] = "keyfile" - - # Snowflake configuration - SNOWFLAKE_DATABASE_NAME: Final[str] = "snowflake.database.name" - SNOWFLAKE_SCHEMA_NAME: Final[str] = "snowflake.schema.name" - SNOWFLAKE_TOPIC2TABLE_MAP: Final[str] = "snowflake.topic2table.map" - SNOWFLAKE_PRIVATE_KEY: Final[str] = "snowflake.private.key" - SNOWFLAKE_PRIVATE_KEY_PASSPHRASE: Final[str] = "snowflake.private.key.passphrase" - - # S3 configuration - S3_BUCKET_NAME: Final[str] = "s3.bucket.name" - TOPICS_DIR: Final[str] = "topics.dir" - AWS_ACCESS_KEY_ID: Final[str] = "aws.access.key.id" - AWS_SECRET_ACCESS_KEY: Final[str] = "aws.secret.access.key" - S3_SSE_CUSTOMER_KEY: Final[str] = "s3.sse.customer.key" - S3_PROXY_PASSWORD: Final[str] = "s3.proxy.password" - - # MongoDB configuration - - # Transform configuration - TRANSFORMS: Final[str] = "transforms" - - # Authentication configuration - VALUE_CONVERTER_BASIC_AUTH_USER_INFO: Final[str] = ( - "value.converter.basic.auth.user.info" - ) +class FineGrainedLineageDict(TypedDict): + """Structure for fine-grained (column-level) lineage mappings.""" + + upstreamType: str + downstreamType: str + upstreams: List[str] + downstreams: List[str] # Confluent Cloud connector class names @@ -129,10 +83,12 @@ class ConnectorConfigKeys: # - https://docs.confluent.io/cloud/current/connectors/cc-postgresql-cdc-source-v2.html # - https://docs.confluent.io/cloud/current/connectors/cc-postgresql-sink.html # - https://docs.confluent.io/cloud/current/connectors/cc-snowflake-sink.html +# - https://docs.confluent.io/cloud/current/connectors/cc-snowflake-source.html POSTGRES_CDC_SOURCE_CLOUD: Final[str] = "PostgresCdcSource" POSTGRES_CDC_SOURCE_V2_CLOUD: Final[str] = "PostgresCdcSourceV2" POSTGRES_SINK_CLOUD: Final[str] = "PostgresSink" SNOWFLAKE_SINK_CLOUD: Final[str] = "SnowflakeSink" +SNOWFLAKE_SOURCE_CLOUD: Final[str] = "SnowflakeSource" MYSQL_SOURCE_CLOUD: Final[str] = "MySqlSource" MYSQL_CDC_SOURCE_CLOUD: Final[str] = "MySqlCdcSource" MYSQL_SINK_CLOUD: Final[str] = "MySqlSink" @@ -257,6 +213,32 @@ class KafkaConnectSourceConfig( "This is the recommended approach for Confluent Cloud instead of manually constructing the full URI.", ) + # Schema resolver configuration for enhanced lineage + use_schema_resolver: bool = Field( + default=False, + description="Use DataHub's schema metadata to enhance CDC connector lineage. " + "When enabled (requires DataHub graph connection): " + "1) Expands table patterns (e.g., 'database.*') to actual tables using DataHub metadata " + "2) Generates fine-grained column-level lineage for CDC sources/sinks. " + "Disabled by default to maintain backward compatibility.", + ) + + schema_resolver_expand_patterns: Optional[bool] = Field( + default=None, + description="Enable table pattern expansion using DataHub schema metadata. " + "When use_schema_resolver=True, this controls whether to expand patterns like 'database.*' " + "to actual table names by querying DataHub. Only applies when use_schema_resolver is enabled. " + "Defaults to True when use_schema_resolver is enabled.", + ) + + schema_resolver_finegrained_lineage: Optional[bool] = Field( + default=None, + description="Enable fine-grained (column-level) lineage extraction using DataHub schema metadata. " + "When use_schema_resolver=True, this controls whether to generate column-level lineage " + "by matching schemas between source tables and Kafka topics. Only applies when use_schema_resolver is enabled. " + "Defaults to True when use_schema_resolver is enabled.", + ) + stateful_ingestion: Optional[StatefulStaleMetadataRemovalConfig] = None @model_validator(mode="before") @@ -291,6 +273,94 @@ def auto_construct_connect_uri(cls, values: Dict) -> Dict: return values + @model_validator(mode="after") + def validate_configuration_interdependencies(self) -> "KafkaConnectSourceConfig": + """ + Validate configuration field interdependencies and provide clear error messages. + + Checks: + 1. Schema resolver dependent fields require use_schema_resolver=True + 2. Kafka API credentials are complete (key + secret) + 3. Confluent Cloud IDs are complete (environment + cluster) + 4. Warn if conflicting configurations are provided + """ + # 1. Set schema resolver defaults if not explicitly configured + if self.use_schema_resolver: + # Schema resolver is enabled - set sensible defaults for sub-features + if self.schema_resolver_expand_patterns is None: + self.schema_resolver_expand_patterns = True + if self.schema_resolver_finegrained_lineage is None: + self.schema_resolver_finegrained_lineage = True + else: + # Schema resolver is disabled - set defaults to False + if self.schema_resolver_expand_patterns is None: + self.schema_resolver_expand_patterns = False + if self.schema_resolver_finegrained_lineage is None: + self.schema_resolver_finegrained_lineage = False + + # 2. Validate Kafka API credentials are complete + kafka_api_key_provided = self.kafka_api_key is not None + kafka_api_secret_provided = self.kafka_api_secret is not None + + if kafka_api_key_provided != kafka_api_secret_provided: + raise ValueError( + "Configuration error: Both 'kafka_api_key' and 'kafka_api_secret' must be provided together. " + f"Currently kafka_api_key={'set' if kafka_api_key_provided else 'not set'}, " + f"kafka_api_secret={'set' if kafka_api_secret_provided else 'not set'}." + ) + + # 3. Validate Confluent Cloud IDs are complete + env_id_provided = self.confluent_cloud_environment_id is not None + cluster_id_provided = self.confluent_cloud_cluster_id is not None + + if env_id_provided != cluster_id_provided: + raise ValueError( + "Configuration error: Both 'confluent_cloud_environment_id' and 'confluent_cloud_cluster_id' " + "must be provided together for automatic URI construction. " + f"Currently environment_id={'set' if env_id_provided else 'not set'}, " + f"cluster_id={'set' if cluster_id_provided else 'not set'}." + ) + + # 4. Warn if conflicting configurations (informational, not error) + if env_id_provided and cluster_id_provided: + # Confluent Cloud IDs provided - check for potential conflicts + if self.connect_uri and self.connect_uri != DEFAULT_CONNECT_URI: + # User explicitly set connect_uri AND provided Cloud IDs + constructed_uri = self.construct_confluent_cloud_uri( + self.confluent_cloud_environment_id, # type: ignore[arg-type] + self.confluent_cloud_cluster_id, # type: ignore[arg-type] + ) + if self.connect_uri != constructed_uri: + logger.warning( + f"Configuration conflict: Both 'connect_uri' and Confluent Cloud IDs are set. " + f"Using connect_uri='{self.connect_uri}' (ignoring environment/cluster IDs). " + f"Expected URI from IDs would be: '{constructed_uri}'. " + f"Remove connect_uri to use automatic URI construction." + ) + + # 5. Validate kafka_rest_endpoint format if provided + if self.kafka_rest_endpoint: + if not self.kafka_rest_endpoint.startswith(("http://", "https://")): + raise ValueError( + f"Configuration error: 'kafka_rest_endpoint' must be a valid HTTP(S) URL. " + f"Got: '{self.kafka_rest_endpoint}'. " + f"Expected format: https://pkc-xxxxx.region.provider.confluent.cloud" + ) + + # 6. Warn if schema resolver enabled but all features explicitly disabled + if self.use_schema_resolver: + if ( + self.schema_resolver_expand_patterns is False + and self.schema_resolver_finegrained_lineage is False + ): + logger.warning( + "Schema resolver is enabled but all features are disabled. " + "To fix: Either enable schema_resolver_expand_patterns=True or schema_resolver_finegrained_lineage=True, " + "or set use_schema_resolver=False to avoid unnecessary DataHub queries." + ) + + return self + def get_connect_credentials(self) -> tuple[Optional[str], Optional[str]]: """Get the appropriate credentials for Connect API access.""" return self.username, self.password @@ -359,6 +429,7 @@ class KafkaConnectLineage: target_platform: str job_property_bag: Optional[Dict[str, str]] = None source_dataset: Optional[str] = None + fine_grained_lineages: Optional[List[FineGrainedLineageDict]] = None @dataclass @@ -374,26 +445,6 @@ class ConnectorManifest: lineages: List[KafkaConnectLineage] = field(default_factory=list) topic_names: List[str] = field(default_factory=list) - def extract_lineages( - self, config: "KafkaConnectSourceConfig", report: "KafkaConnectSourceReport" - ) -> List[KafkaConnectLineage]: - """Extract lineages for this connector using connector registry.""" - from datahub.ingestion.source.kafka_connect.connector_registry import ( - ConnectorRegistry, - ) - - return ConnectorRegistry.extract_lineages(self, config, report) - - def extract_flow_property_bag( - self, config: "KafkaConnectSourceConfig", report: "KafkaConnectSourceReport" - ) -> Optional[Dict[str, str]]: - """Extract flow property bag for this connector using connector registry.""" - from datahub.ingestion.source.kafka_connect.connector_registry import ( - ConnectorRegistry, - ) - - return ConnectorRegistry.extract_flow_property_bag(self, config, report) - def get_topics_from_config( self, config: "KafkaConnectSourceConfig", report: "KafkaConnectSourceReport" ) -> List[str]: @@ -424,35 +475,6 @@ def unquote( return string -def parse_comma_separated_list(value: str) -> List[str]: - """ - Safely parse a comma-separated list with robust error handling. - - Args: - value: Comma-separated string to parse - - Returns: - List of non-empty stripped items - - Handles edge cases: - - Empty/None values - - Leading/trailing commas - - Multiple consecutive commas - - Whitespace-only items - """ - if not value or not value.strip(): - return [] - - # Split on comma and clean up each item - items = [] - for item in value.split(","): - cleaned_item = item.strip() - if cleaned_item: # Only add non-empty items - items.append(cleaned_item) - - return items - - def validate_jdbc_url(url: str) -> bool: """Validate JDBC URL format and return whether it's well-formed.""" if not url or not isinstance(url, str): @@ -591,9 +613,13 @@ def get_platform_instance( ) elif config.platform_instance_map and config.platform_instance_map.get(platform): instance_name = config.platform_instance_map[platform] - logger.info( - f"Instance name assigned is: {instance_name} for Connector Name {connector_name} and platform {platform}" - ) + + # Only log when a platform instance is actually assigned (non-None) + if instance_name: + logger.debug( + f"Platform instance '{instance_name}' assigned for connector '{connector_name}' platform '{platform}'" + ) + return instance_name @@ -632,6 +658,7 @@ class BaseConnector: connector_manifest: ConnectorManifest config: KafkaConnectSourceConfig report: KafkaConnectSourceReport + schema_resolver: Optional["SchemaResolver"] = None def extract_lineages(self) -> List[KafkaConnectLineage]: """Extract lineage mappings for this connector. Override in subclasses.""" @@ -650,10 +677,515 @@ def supports_connector_class(connector_class: str) -> bool: """Check if this connector handles the given class. Override in subclasses.""" return False - @staticmethod - def get_platform(connector_class: str) -> str: - """Get the platform for this connector type. Override in subclasses.""" + def get_platform(self) -> str: + """Get the platform for this connector instance. Override in subclasses.""" return "unknown" + def _discover_tables_from_database( + self, database_name: str, platform: str + ) -> List[str]: + """ + Discover all tables in a database by querying DataHub. + + This method queries DataHub for all tables in the specified database and platform. + It's used when connectors don't have table.include.list configured, meaning they + capture ALL tables from the database. + + The method first tries to use cached URNs from SchemaResolver (populated from + previous ingestion runs or lineage resolution). If the cache is empty, it queries + DataHub's GraphQL API directly using the graph client's get_urns_by_filter() method. + + Args: + database_name: The database name (e.g., "mydb", "testdb") + platform: The platform name (e.g., "postgres", "mysql") -# Removed: TopicResolver and ConnectorTopicHandlerRegistry - logic moved directly to BaseConnector subclasses + Returns: + List of table names in schema.table format (e.g., ["public.users", "public.orders"]) + """ + if not self.schema_resolver: + logger.warning("SchemaResolver not available for table discovery") + return [] + + try: + # First try to get URNs from cache (fast path) + all_urns = self.schema_resolver.get_urns() + + # If cache is empty, query DataHub directly + if not all_urns: + if not self.schema_resolver.graph: + logger.warning( + "Cannot discover tables - no DataHub graph connection available" + ) + return [] + + logger.info( + f"SchemaResolver cache is empty. Querying DataHub for datasets " + f"with platform={platform}, env={self.schema_resolver.env}" + ) + + # Use graph.get_urns_by_filter() to get all datasets for this platform + # This is more efficient than a search query and uses the proper filtering API + all_urns = set( + self.schema_resolver.graph.get_urns_by_filter( + entity_types=["dataset"], + platform=platform, + platform_instance=self.schema_resolver.platform_instance, + env=self.schema_resolver.env, + ) + ) + + if not all_urns: + logger.warning( + f"No datasets found in DataHub for platform={platform}, env={self.schema_resolver.env}. " + f"Make sure you've ingested {platform} datasets into DataHub before running Kafka Connect ingestion." + ) + return [] + + logger.info( + f"Found {len(all_urns)} datasets in DataHub for platform={platform}" + ) + + logger.debug( + f"Processing {len(all_urns)} URNs for platform={platform}, database={database_name}" + ) + + discovered_tables = [] + + for urn in all_urns: + # URN format: urn:li:dataset:(urn:li:dataPlatform:postgres,database.schema.table,PROD) + table_name = self._extract_table_name_from_urn(urn) + if not table_name: + continue + + # Filter by platform + if f"dataPlatform:{platform}" not in urn: + continue + + # Filter by database - check if table_name starts with database prefix + if database_name: + if table_name.lower().startswith(f"{database_name.lower()}."): + # Remove database prefix to get "schema.table" + schema_table = table_name[len(database_name) + 1 :] + discovered_tables.append(schema_table) + else: + # No database filtering - include all tables + discovered_tables.append(table_name) + + logger.info( + f"Discovered {len(discovered_tables)} tables from database '{database_name}' for platform '{platform}'" + ) + return discovered_tables + + except Exception as e: + logger.warning( + f"Failed to discover tables from database '{database_name}': {e}", + exc_info=True, + ) + return [] + + def _apply_replace_field_transform( + self, source_columns: List[str] + ) -> Dict[str, Optional[str]]: + """ + Apply ReplaceField SMT transformations to column mappings. + + ReplaceField transform can filter, rename, or drop fields: + - include: Keep only specified fields (all others dropped) + - exclude: Drop specified fields (all others kept) + - rename: Rename fields using from:to format + + Reference: https://docs.confluent.io/platform/current/connect/transforms/replacefield.html + + Args: + source_columns: List of source column names + + Returns: + Dictionary mapping source column -> target column name (None if dropped) + """ + # Parse transforms from connector config + transforms_config = self.connector_manifest.config.get("transforms", "") + if not transforms_config: + # No transforms - return 1:1 mapping + return {col: col for col in source_columns} + + transform_names = parse_comma_separated_list(transforms_config) + + # Build column mapping (source -> target, None means dropped) + column_mapping: Dict[str, Optional[str]] = {col: col for col in source_columns} + + # Apply each ReplaceField transform in order + for transform_name in transform_names: + transform_type = self.connector_manifest.config.get( + f"transforms.{transform_name}.type", "" + ) + + # Check if this is a ReplaceField$Value transform + # We only support Value transforms since those affect the column data + if ( + transform_type + != "org.apache.kafka.connect.transforms.ReplaceField$Value" + ): + continue + + # Get transform configuration + include_config = self.connector_manifest.config.get( + f"transforms.{transform_name}.include", "" + ) + exclude_config = self.connector_manifest.config.get( + f"transforms.{transform_name}.exclude", "" + ) + rename_config = self.connector_manifest.config.get( + f"transforms.{transform_name}.renames", "" + ) + + # Apply include filter (keep only specified fields) + if include_config: + include_fields = set(parse_comma_separated_list(include_config)) + for col in list(column_mapping.keys()): + if column_mapping[col] not in include_fields: + column_mapping[col] = None + + # Apply exclude filter (drop specified fields) + if exclude_config: + exclude_fields = set(parse_comma_separated_list(exclude_config)) + for col in list(column_mapping.keys()): + if column_mapping[col] in exclude_fields: + column_mapping[col] = None + + # Apply renames (format: "from:to,from2:to2") + if rename_config: + rename_pairs = parse_comma_separated_list(rename_config) + rename_map = {} + for pair in rename_pairs: + if ":" in pair: + from_field, to_field = pair.split(":", 1) + rename_map[from_field.strip()] = to_field.strip() + + # Apply renames to the column mapping + for col in list(column_mapping.keys()): + current_name = column_mapping[col] + if current_name and current_name in rename_map: + column_mapping[col] = rename_map[current_name] + + return column_mapping + + def _extract_fine_grained_lineage( + self, + source_dataset: str, + source_platform: str, + target_dataset: str, + target_platform: str = "kafka", + ) -> Optional[List[FineGrainedLineageDict]]: + """ + Extract column-level lineage using schema metadata from DataHub. + + This unified implementation works for all source connectors that preserve + column names in a 1:1 mapping (e.g., CDC connectors, JDBC polling connectors). + + Args: + source_dataset: Source table name (e.g., "database.schema.table") + source_platform: Source platform (e.g., "postgres", "snowflake", "mysql") + target_dataset: Target Kafka topic name + target_platform: Target platform (default: "kafka") + + Returns: + List of fine-grained lineage dictionaries or None if not available + """ + # Check if feature is enabled + if not self.config.use_schema_resolver: + return None + if not self.config.schema_resolver_finegrained_lineage: + return None + if not self.schema_resolver: + return None + + # Skip fine-grained lineage for Kafka source platform + # SchemaResolver is designed for database platforms, not Kafka topics + if source_platform.lower() == "kafka": + logger.debug( + f"Skipping fine-grained lineage extraction for Kafka topic {source_dataset} " + "- schema resolver only supports database platforms" + ) + return None + + try: + from datahub.emitter.mce_builder import make_schema_field_urn + from datahub.sql_parsing._models import _TableName + from datahub.utilities.urns.dataset_urn import DatasetUrn + + # Build source table reference + source_table = _TableName( + database=None, db_schema=None, table=source_dataset + ) + + # Resolve source table schema from DataHub + source_urn_str, source_schema = self.schema_resolver.resolve_table( + source_table + ) + + if not source_schema: + logger.debug( + f"No schema metadata found in DataHub for {source_platform} table {source_dataset}" + ) + return None + + # Build target URN using DatasetUrn helper with correct target platform + target_urn = DatasetUrn.create_from_ids( + platform_id=target_platform, + table_name=target_dataset, + env=self.config.env, + ) + + # Apply ReplaceField transforms to column mappings + # source_schema is Dict[str, str] mapping column names to types + column_mapping = self._apply_replace_field_transform( + list(source_schema.keys()) + ) + + # Create fine-grained lineage for each source column + fine_grained_lineages: List[FineGrainedLineageDict] = [] + + for source_col in source_schema: + target_col = column_mapping.get(source_col) + + # Skip if field was dropped by ReplaceField transform + if target_col is None: + logger.debug( + f"Skipping column '{source_col}' - dropped by ReplaceField transform" + ) + continue + + fine_grained_lineage: FineGrainedLineageDict = { + "upstreamType": "FIELD_SET", + "downstreamType": "FIELD", + "upstreams": [make_schema_field_urn(source_urn_str, source_col)], + "downstreams": [make_schema_field_urn(str(target_urn), target_col)], + } + fine_grained_lineages.append(fine_grained_lineage) + + if fine_grained_lineages: + logger.info( + f"Generated {len(fine_grained_lineages)} fine-grained lineages " + f"for {source_platform} table {source_dataset} → {target_dataset}" + ) + return fine_grained_lineages + + except Exception as e: + logger.debug( + f"Failed to extract fine-grained lineage for " + f"{source_dataset} → {target_dataset}: {e}" + ) + + return None + + def _extract_table_name_from_urn(self, urn: str) -> Optional[str]: + """ + Extract table name from DataHub URN using standard DatasetUrn parser. + + Args: + urn: DataHub dataset URN + Format: urn:li:dataset:(urn:li:dataPlatform:platform,table_name,ENV) + Example: urn:li:dataset:(urn:li:dataPlatform:snowflake,database.schema.table,PROD) + + Returns: + Extracted table name (e.g., "database.schema.table") or None if parsing fails + """ + try: + return DatasetUrn.from_string(urn).name + except Exception as e: + logger.debug(f"Failed to extract table name from URN {urn}: {e}") + return None + + def _extract_lineages_from_schema_resolver( + self, + source_platform: str, + topic_namer: Callable[[str], str], + transforms: List[Dict[str, str]], + connector_type: str = "connector", + ) -> List[KafkaConnectLineage]: + """ + Common helper to extract lineages using SchemaResolver. + + This unified implementation eliminates code duplication between Snowflake, Debezium, + and other source connectors that derive topic names from table names. + + Args: + source_platform: Source database platform (postgres, snowflake, mysql, etc.) + topic_namer: Callback function that converts table_name → base_topic_name + Example for Snowflake: lambda table: f"prefix{table}" + Example for Debezium: lambda table: f"server.{table}" + transforms: List of transform configurations to apply to derived topics + connector_type: Connector type name for logging (default: "connector") + + Returns: + List of lineage mappings from database tables to expected Kafka topics + """ + lineages: List[KafkaConnectLineage] = [] + + if not self.schema_resolver: + logger.debug( + "SchemaResolver not available, cannot derive topics from DataHub" + ) + return lineages + + try: + # Get all URNs from schema resolver and filter for the source platform + # The cache may contain URNs from other platforms if shared across runs + all_urns = self.schema_resolver.get_urns() + + # Filter URNs by platform using DatasetUrn parser + platform_urns = [] + for urn in all_urns: + try: + dataset_urn = DatasetUrn.from_string(urn) + if dataset_urn.platform == source_platform: + platform_urns.append(urn) + except Exception as e: + logger.debug(f"Failed to parse URN {urn}: {e}") + continue + + logger.info( + f"SchemaResolver returned {len(platform_urns)} URNs for platform={source_platform}, " + f"platform_instance={self.schema_resolver.platform_instance or 'None'}, " + f"will derive {connector_type} topics for connector '{self.connector_manifest.name}'" + ) + + if not platform_urns: + logger.warning( + f"No {source_platform} datasets found in DataHub SchemaResolver cache " + f"for platform_instance={self.schema_resolver.platform_instance or 'None'}. " + f"Make sure you've ingested {source_platform} datasets into DataHub before running Kafka Connect ingestion." + ) + return lineages + + # Process each table and generate expected topic name + for urn in platform_urns: + table_name = self._extract_table_name_from_urn(urn) + if not table_name: + continue + + # Generate base topic name using connector-specific naming logic + expected_topic = topic_namer(table_name) + + # Apply transforms if configured + if transforms: + result = get_transform_pipeline().apply_forward( + [expected_topic], self.connector_manifest.config + ) + if result.warnings: + for warning in result.warnings: + logger.warning( + f"Transform warning for {self.connector_manifest.name}: {warning}" + ) + if result.topics and len(result.topics) > 0: + expected_topic = result.topics[0] + + # Extract fine-grained lineage if enabled + fine_grained = self._extract_fine_grained_lineage( + table_name, source_platform, expected_topic, KAFKA + ) + + # Create lineage mapping + lineage = KafkaConnectLineage( + source_dataset=table_name, + source_platform=source_platform, + target_dataset=expected_topic, + target_platform=KAFKA, + fine_grained_lineages=fine_grained, + ) + lineages.append(lineage) + + logger.info( + f"Created {len(lineages)} lineages from DataHub schemas for {connector_type} '{self.connector_manifest.name}'" + ) + return lineages + + except Exception as e: + logger.warning( + f"Failed to extract lineages from DataHub schemas for connector '{self.connector_manifest.name}': {e}", + exc_info=True, + ) + return [] + + def _expand_topic_regex_patterns( + self, + topics_regex: str, + available_topics: Optional[List[str]] = None, + ) -> List[str]: + """ + Expand topics.regex pattern against available Kafka topics using JavaRegexMatcher. + + This helper method is used by sink connectors to resolve topics.regex patterns + when the Kafka API is unavailable (e.g., Confluent Cloud). + + Priority order for topic sources: + 1. Use provided available_topics (from manifest.topic_names if Kafka API worked) + 2. Query DataHub for Kafka topics (if schema_resolver enabled) + 3. Return empty list and warn (can't expand without topic list) + + Args: + topics_regex: Java regex pattern from topics.regex config + available_topics: Optional list of available topics (from Kafka API) + + Returns: + List of topics matching the regex pattern + """ + matcher = JavaRegexMatcher() + + # Priority 1: Use provided available_topics (from Kafka API) + if available_topics: + matched_topics = matcher.filter_matches([topics_regex], available_topics) + if matched_topics: + logger.info( + f"Expanded topics.regex '{topics_regex}' to {len(matched_topics)} topics " + f"from {len(available_topics)} available Kafka topics" + ) + elif not matched_topics: + logger.warning( + f"Java regex pattern '{topics_regex}' did not match any of the {len(available_topics)} available topics" + ) + return matched_topics + + # Priority 2: Query DataHub for Kafka topics + if self.schema_resolver and self.schema_resolver.graph: + logger.info( + f"Kafka API unavailable for connector '{self.connector_manifest.name}' - " + f"querying DataHub for Kafka topics to expand pattern '{topics_regex}'" + ) + try: + # Query DataHub for all Kafka topics + kafka_topic_urns = list( + self.schema_resolver.graph.get_urns_by_filter( + platform="kafka", + env=self.schema_resolver.env, + entity_types=["dataset"], + ) + ) + + datahub_topics = [] + for urn in kafka_topic_urns: + topic_name = self._extract_table_name_from_urn(urn) + if topic_name: + datahub_topics.append(topic_name) + + matched_topics = matcher.filter_matches([topics_regex], datahub_topics) + + logger.info( + f"Found {len(matched_topics)} Kafka topics in DataHub matching pattern '{topics_regex}' " + f"(out of {len(datahub_topics)} total Kafka topics)" + ) + return matched_topics + + except Exception as e: + logger.warning( + f"Failed to query DataHub for Kafka topics to expand pattern '{topics_regex}': {e}", + exc_info=True, + ) + + # Priority 3: No topic sources available - warn and return empty + logger.warning( + f"Cannot expand topics.regex '{topics_regex}' for connector '{self.connector_manifest.name}' - " + f"Kafka API unavailable and DataHub query not available. " + f"Enable 'use_schema_resolver' in config to query DataHub for Kafka topics." + ) + return [] diff --git a/metadata-ingestion/src/datahub/ingestion/source/kafka_connect/config_constants.py b/metadata-ingestion/src/datahub/ingestion/source/kafka_connect/config_constants.py new file mode 100644 index 00000000000000..120a318cdfb687 --- /dev/null +++ b/metadata-ingestion/src/datahub/ingestion/source/kafka_connect/config_constants.py @@ -0,0 +1,106 @@ +""" +Shared configuration constants and utilities for Kafka Connect. + +This module contains constants and utility functions that are used across multiple +modules (common.py, transform_plugins.py, etc.) without creating circular dependencies. +""" + +import logging +from typing import Final, List + +logger = logging.getLogger(__name__) + + +class ConnectorConfigKeys: + """Centralized configuration keys to avoid magic strings throughout the codebase.""" + + # Core connector configuration + CONNECTOR_CLASS: Final[str] = "connector.class" + + # Topic configuration + TOPICS: Final[str] = "topics" + TOPICS_REGEX: Final[str] = "topics.regex" + KAFKA_TOPIC: Final[str] = "kafka.topic" + TOPIC: Final[str] = "topic" + TOPIC_PREFIX: Final[str] = "topic.prefix" + + # JDBC configuration + CONNECTION_URL: Final[str] = "connection.url" + TABLE_INCLUDE_LIST: Final[str] = "table.include.list" + TABLE_WHITELIST: Final[str] = "table.whitelist" + QUERY: Final[str] = "query" + MODE: Final[str] = "mode" + + # Debezium/CDC configuration + DATABASE_SERVER_NAME: Final[str] = "database.server.name" + DATABASE_HOSTNAME: Final[str] = "database.hostname" + DATABASE_PORT: Final[str] = "database.port" + DATABASE_DBNAME: Final[str] = "database.dbname" + DATABASE_INCLUDE_LIST: Final[str] = "database.include.list" + + # Kafka configuration + KAFKA_ENDPOINT: Final[str] = "kafka.endpoint" + BOOTSTRAP_SERVERS: Final[str] = "bootstrap.servers" + KAFKA_BOOTSTRAP_SERVERS: Final[str] = "kafka.bootstrap.servers" + + # BigQuery configuration + PROJECT: Final[str] = "project" + DEFAULT_DATASET: Final[str] = "defaultDataset" + DATASETS: Final[str] = "datasets" + TOPICS_TO_TABLES: Final[str] = "topicsToTables" + SANITIZE_TOPICS: Final[str] = "sanitizeTopics" + KEYFILE: Final[str] = "keyfile" + + # Snowflake configuration + SNOWFLAKE_DATABASE_NAME: Final[str] = "snowflake.database.name" + SNOWFLAKE_SCHEMA_NAME: Final[str] = "snowflake.schema.name" + SNOWFLAKE_TOPIC2TABLE_MAP: Final[str] = "snowflake.topic2table.map" + SNOWFLAKE_PRIVATE_KEY: Final[str] = "snowflake.private.key" + SNOWFLAKE_PRIVATE_KEY_PASSPHRASE: Final[str] = "snowflake.private.key.passphrase" + + # S3 configuration + S3_BUCKET_NAME: Final[str] = "s3.bucket.name" + TOPICS_DIR: Final[str] = "topics.dir" + AWS_ACCESS_KEY_ID: Final[str] = "aws.access.key.id" + AWS_SECRET_ACCESS_KEY: Final[str] = "aws.secret.access.key" + S3_SSE_CUSTOMER_KEY: Final[str] = "s3.sse.customer.key" + S3_PROXY_PASSWORD: Final[str] = "s3.proxy.password" + + # MongoDB configuration + + # Transform configuration + TRANSFORMS: Final[str] = "transforms" + + # Authentication configuration + VALUE_CONVERTER_BASIC_AUTH_USER_INFO: Final[str] = ( + "value.converter.basic.auth.user.info" + ) + + +def parse_comma_separated_list(value: str) -> List[str]: + """ + Safely parse a comma-separated list with robust error handling. + + Args: + value: Comma-separated string to parse + + Returns: + List of non-empty stripped items + + Handles edge cases: + - Empty/None values + - Leading/trailing commas + - Multiple consecutive commas + - Whitespace-only items + """ + if not value or not value.strip(): + return [] + + # Split on comma and clean up each item + items = [] + for item in value.split(","): + cleaned_item = item.strip() + if cleaned_item: # Only add non-empty items + items.append(cleaned_item) + + return items diff --git a/metadata-ingestion/src/datahub/ingestion/source/kafka_connect/connector_registry.py b/metadata-ingestion/src/datahub/ingestion/source/kafka_connect/connector_registry.py index b33e176a6a8672..1d9e67a1381923 100644 --- a/metadata-ingestion/src/datahub/ingestion/source/kafka_connect/connector_registry.py +++ b/metadata-ingestion/src/datahub/ingestion/source/kafka_connect/connector_registry.py @@ -5,7 +5,7 @@ """ import logging -from typing import Dict, List, Optional +from typing import TYPE_CHECKING, List, Optional from datahub.ingestion.source.kafka_connect.common import ( CLOUD_JDBC_SOURCE_CLASSES, @@ -13,6 +13,7 @@ POSTGRES_SINK_CLOUD, SINK, SNOWFLAKE_SINK_CLOUD, + SNOWFLAKE_SOURCE_CLOUD, SOURCE, BaseConnector, ConnectorManifest, @@ -20,8 +21,13 @@ KafkaConnectLineage, KafkaConnectSourceConfig, KafkaConnectSourceReport, + get_platform_instance, ) +if TYPE_CHECKING: + from datahub.ingestion.api.common import PipelineContext + from datahub.sql_parsing.schema_resolver import SchemaResolver + logger = logging.getLogger(__name__) @@ -33,11 +39,76 @@ class ConnectorRegistry: corresponding lineage extraction implementations. """ + @staticmethod + def create_schema_resolver( + ctx: Optional["PipelineContext"], + config: KafkaConnectSourceConfig, + connector: BaseConnector, + ) -> Optional["SchemaResolver"]: + """ + Create SchemaResolver for enhanced lineage extraction if enabled. + + Args: + ctx: Pipeline context (contains graph connection) + config: Kafka Connect source configuration + connector: Connector instance to get platform from + + Returns: + SchemaResolver instance if feature is enabled and graph is available, None otherwise + """ + if not config.use_schema_resolver: + return None + + if not ctx: + logger.debug( + f"SchemaResolver not available for connector {connector.connector_manifest.name}: " + "PipelineContext is None" + ) + return None + + if not ctx.graph: + logger.warning( + f"SchemaResolver not available for connector {connector.connector_manifest.name}: " + "DataHub graph connection is not available. Make sure the ingestion is running with " + "a valid DataHub connection (datahub_api or sink configuration)." + ) + return None + + try: + from datahub.sql_parsing.schema_resolver import SchemaResolver + + # Get platform from connector instance (single source of truth) + platform = connector.get_platform() + + # Get platform instance if configured + platform_instance = get_platform_instance( + config, connector.connector_manifest.name, platform + ) + + logger.info( + f"Creating SchemaResolver for connector {connector.connector_manifest.name} " + f"with platform={platform}, platform_instance={platform_instance}" + ) + + return SchemaResolver( + platform=platform, + platform_instance=platform_instance, + env=config.env, + graph=ctx.graph, + ) + except Exception as e: + logger.warning( + f"Failed to create SchemaResolver for connector {connector.connector_manifest.name}: {e}. " + "Falling back to standard lineage extraction." + ) + return None + @staticmethod def get_connector_for_manifest( manifest: ConnectorManifest, config: KafkaConnectSourceConfig, report: KafkaConnectSourceReport, + ctx: Optional["PipelineContext"] = None, ) -> Optional[BaseConnector]: """ Get the appropriate connector instance for a manifest. @@ -46,23 +117,52 @@ def get_connector_for_manifest( manifest: The connector manifest config: DataHub configuration report: Ingestion report + ctx: Pipeline context (optional, for schema resolver) Returns: Connector instance or None if no handler found """ connector_class_value = manifest.config.get("connector.class", "") - # Determine connector type based on manifest type + logger.info( + f"Processing connector '{manifest.name}' - type={manifest.type}, class={connector_class_value}" + ) + + # Create connector instance first if manifest.type == SOURCE: - return ConnectorRegistry._get_source_connector( + connector = ConnectorRegistry._get_source_connector( connector_class_value, manifest, config, report ) elif manifest.type == SINK: - return ConnectorRegistry._get_sink_connector( + connector = ConnectorRegistry._get_sink_connector( connector_class_value, manifest, config, report ) + else: + logger.warning( + f"Unknown connector type '{manifest.type}' for connector '{manifest.name}'" + ) + return None - return None + if connector: + # Log which handler was selected + handler_name = connector.__class__.__name__ + platform = connector.get_platform() + logger.info( + f"Connector '{manifest.name}' will be handled by {handler_name} (platform={platform})" + ) + + # Create and attach schema resolver using connector's platform + schema_resolver = ConnectorRegistry.create_schema_resolver( + ctx, config, connector + ) + if schema_resolver: + connector.schema_resolver = schema_resolver + else: + logger.debug( + f"No handler found for connector '{manifest.name}' with class '{connector_class_value}'" + ) + + return connector @staticmethod def _get_source_connector( @@ -79,11 +179,15 @@ def _get_source_connector( ConfluentJDBCSourceConnector, DebeziumSourceConnector, MongoSourceConnector, + SnowflakeSourceConnector, ) # Traditional JDBC source connector if connector_class_value == JDBC_SOURCE_CONNECTOR_CLASS: return ConfluentJDBCSourceConnector(manifest, config, report) + # Snowflake Source connector (Confluent Cloud managed) + elif connector_class_value == SNOWFLAKE_SOURCE_CLOUD: + return SnowflakeSourceConnector(manifest, config, report) # Cloud CDC connectors (use Debezium-style naming) elif ( connector_class_value in CLOUD_JDBC_SOURCE_CLASSES @@ -142,46 +246,36 @@ def _get_sink_connector( return None - @staticmethod - def extract_lineages( - manifest: ConnectorManifest, - config: KafkaConnectSourceConfig, - report: KafkaConnectSourceReport, - ) -> List[KafkaConnectLineage]: - """Extract lineages using the appropriate connector.""" - connector = ConnectorRegistry.get_connector_for_manifest( - manifest, config, report - ) - if connector: - return connector.extract_lineages() - return [] - - @staticmethod - def extract_flow_property_bag( - manifest: ConnectorManifest, - config: KafkaConnectSourceConfig, - report: KafkaConnectSourceReport, - ) -> Optional[Dict[str, str]]: - """Extract flow property bag using the appropriate connector.""" - connector = ConnectorRegistry.get_connector_for_manifest( - manifest, config, report - ) - if connector: - return connector.extract_flow_property_bag() - return None - @staticmethod def get_topics_from_config( manifest: ConnectorManifest, config: KafkaConnectSourceConfig, report: KafkaConnectSourceReport, + ctx: Optional["PipelineContext"] = None, ) -> List[str]: """Extract topics from config using the appropriate connector.""" + logger.debug( + f"get_topics_from_config called for connector '{manifest.name}' " + f"(type={manifest.type}, class={manifest.config.get('connector.class', 'unknown')})" + ) + connector = ConnectorRegistry.get_connector_for_manifest( - manifest, config, report + manifest, config, report, ctx ) if connector: - return connector.get_topics_from_config() + logger.debug( + f"Calling get_topics_from_config on {connector.__class__.__name__} for '{manifest.name}'" + ) + topics = connector.get_topics_from_config() + logger.info( + f"get_topics_from_config returned {len(topics)} topics for connector '{manifest.name}': " + f"{topics[:10] if topics else '[]'}" + ) + return topics + else: + logger.warning( + f"No connector handler found for '{manifest.name}' - cannot derive topics from config" + ) return [] @@ -236,7 +330,6 @@ def supports_connector_class(connector_class: str) -> bool: """Generic connector supports any unknown class.""" return True - @staticmethod - def get_platform(connector_class: str) -> str: + def get_platform(self) -> str: """Generic connectors have configurable platforms.""" return "unknown" diff --git a/metadata-ingestion/src/datahub/ingestion/source/kafka_connect/kafka_connect.py b/metadata-ingestion/src/datahub/ingestion/source/kafka_connect/kafka_connect.py index ffe2fc3d3e18a1..c7ebdd67824e39 100644 --- a/metadata-ingestion/src/datahub/ingestion/source/kafka_connect/kafka_connect.py +++ b/metadata-ingestion/src/datahub/ingestion/source/kafka_connect/kafka_connect.py @@ -175,7 +175,7 @@ def extract_connector_lineages(self, connector_manifest: ConnectorManifest) -> b # Try to get a connector handler from the registry connector = ConnectorRegistry.get_connector_for_manifest( - connector_manifest, self.config, self.report + connector_manifest, self.config, self.report, self.ctx ) if not connector: @@ -729,7 +729,7 @@ def _get_topics_from_connector_config( ) return ConnectorRegistry.get_topics_from_config( - connector_manifest, self.config, self.report + connector_manifest, self.config, self.report, self.ctx ) # Note: _get_topic_fields_for_connector and get_platform_from_connector_class removed @@ -857,11 +857,25 @@ def construct_job_workunits( ), ).as_workunit() + # Convert fine-grained lineage dictionaries to proper class instances + fine_grained_lineages_typed = None + if lineage.fine_grained_lineages: + fine_grained_lineages_typed = [ + models.FineGrainedLineageClass( + upstreamType=fg["upstreamType"], + downstreamType=fg["downstreamType"], + upstreams=fg.get("upstreams"), + downstreams=fg.get("downstreams"), + ) + for fg in lineage.fine_grained_lineages + ] + yield MetadataChangeProposalWrapper( entityUrn=job_urn, aspect=models.DataJobInputOutputClass( inputDatasets=inlets, outputDatasets=outlets, + fineGrainedLineages=fine_grained_lineages_typed, ), ).as_workunit() diff --git a/metadata-ingestion/src/datahub/ingestion/source/kafka_connect/pattern_matchers.py b/metadata-ingestion/src/datahub/ingestion/source/kafka_connect/pattern_matchers.py new file mode 100644 index 00000000000000..3e08a36bb69fd6 --- /dev/null +++ b/metadata-ingestion/src/datahub/ingestion/source/kafka_connect/pattern_matchers.py @@ -0,0 +1,195 @@ +"""Pattern matching utilities for Kafka Connect connectors. + +This module provides pluggable pattern matching for different connector types. +Each connector type (Debezium, Snowflake, etc.) uses a different pattern syntax, +so this abstraction allows each to implement its own matcher while sharing +common filtering logic. +""" + +import fnmatch +import logging +from typing import List, Protocol + +logger = logging.getLogger(__name__) + + +class PatternMatcher(Protocol): + """Protocol for pattern matching strategies. + + Different Kafka Connect connectors use different pattern syntaxes: + - Debezium connectors: Full Java regex (character classes, quantifiers, etc.) + - Snowflake connectors: Simple wildcards (* and ?) + + This protocol allows each connector to provide its own implementation. + """ + + def matches(self, pattern: str, text: str) -> bool: + """Check if text matches the given pattern. + + Args: + pattern: The pattern to match against (syntax depends on implementation) + text: The text to check for a match + + Returns: + True if text matches pattern, False otherwise + """ + ... + + def filter_matches(self, patterns: List[str], texts: List[str]) -> List[str]: + """Filter a list of texts to only those matching at least one pattern. + + Args: + patterns: List of patterns to match against + texts: List of texts to filter + + Returns: + List of texts that match at least one pattern + """ + ... + + +class JavaRegexMatcher: + """Pattern matcher using Java regex syntax. + + Used by Debezium connectors (PostgresCDCV2, etc.) which rely on Java's + java.util.regex.Pattern for table.include.list and table.exclude.list. + + Java regex supports full regex features: + - Character classes: [abc], [^abc], [a-zA-Z] + - Quantifiers: *, +, ?, {n}, {n,}, {n,m} + - Alternation: (pattern1|pattern2) + - Anchors: ^, $ + - Escaping: \\. for literal dots, etc. + + Examples: + "public\\.(users|orders).*" matches tables starting with users or orders in public schema + ".*\\.fact_.*" matches all fact_ tables in any schema + "reporting\\.dim_[0-9]+" matches dim_1, dim_2, etc. in reporting schema + + Note: Requires JPype and Java runtime. If Java is unavailable, matching will fail + and log warnings rather than silently producing incorrect results. + """ + + def matches(self, pattern: str, text: str) -> bool: + """Check if text matches Java regex pattern. + + Args: + pattern: Java regex pattern + text: Text to match against + + Returns: + True if text matches pattern, False if no match or Java unavailable + """ + try: + from java.util.regex import Pattern as JavaPattern + + regex_pattern = JavaPattern.compile(pattern) + return bool(regex_pattern.matcher(text).matches()) + + except (ImportError, RuntimeError) as e: + logger.warning( + f"Java regex library not available for pattern matching: {e}. " + f"Cannot match pattern '{pattern}' against '{text}'. " + f"Debezium uses Java regex and Python regex is not compatible." + ) + return False + except Exception as e: + logger.warning( + f"Failed to compile or match Java regex pattern '{pattern}': {e}" + ) + return False + + def filter_matches(self, patterns: List[str], texts: List[str]) -> List[str]: + """Filter texts by matching against Java regex patterns. + + Args: + patterns: List of Java regex patterns + texts: List of texts to filter + + Returns: + List of texts matching at least one pattern, or empty list if Java unavailable + """ + try: + from java.util.regex import Pattern as JavaPattern + except (ImportError, RuntimeError) as e: + logger.warning( + f"Java regex library not available for pattern matching: {e}. " + f"Cannot filter texts using patterns {patterns}. " + f"Debezium uses Java regex and Python regex is not compatible. " + f"Returning empty list." + ) + return [] + + matched_texts = [] + + for pattern in patterns: + try: + regex_pattern = JavaPattern.compile(pattern) + + for text in texts: + if ( + regex_pattern.matcher(text).matches() + and text not in matched_texts + ): + matched_texts.append(text) + + except Exception as e: + logger.warning( + f"Failed to compile or match Java regex pattern '{pattern}': {e}" + ) + + return matched_texts + + +class WildcardMatcher: + """Pattern matcher using simple wildcard syntax. + + Used by Snowflake Source connectors which use simple shell-style wildcards + for table.include.list and table.exclude.list. + + Supported wildcards: + - "*" matches any sequence of characters (zero or more) + - "?" matches any single character + + This is MUCH SIMPLER than Java regex - no character classes, quantifiers, + or alternation. Just basic wildcards. + + Examples: + "ANALYTICS.PUBLIC.*" matches all tables in ANALYTICS.PUBLIC schema + "*.PUBLIC.TABLE1" matches TABLE1 in PUBLIC schema across all databases + "DB.SCHEMA.USER?" matches USER1, USERS, etc. + + Implementation uses Python's fnmatch module which provides shell-style + wildcard matching without full regex complexity. + """ + + def matches(self, pattern: str, text: str) -> bool: + """Check if text matches wildcard pattern. + + Args: + pattern: Wildcard pattern using * and ? + text: Text to match against + + Returns: + True if text matches pattern, False otherwise + """ + return fnmatch.fnmatch(text, pattern) + + def filter_matches(self, patterns: List[str], texts: List[str]) -> List[str]: + """Filter texts by matching against wildcard patterns. + + Args: + patterns: List of wildcard patterns + texts: List of texts to filter + + Returns: + List of texts matching at least one pattern + """ + matched_texts = [] + + for pattern in patterns: + for text in texts: + if fnmatch.fnmatch(text, pattern) and text not in matched_texts: + matched_texts.append(text) + + return matched_texts diff --git a/metadata-ingestion/src/datahub/ingestion/source/kafka_connect/sink_connectors.py b/metadata-ingestion/src/datahub/ingestion/source/kafka_connect/sink_connectors.py index 1d62b6d924152d..c0210de879e905 100644 --- a/metadata-ingestion/src/datahub/ingestion/source/kafka_connect/sink_connectors.py +++ b/metadata-ingestion/src/datahub/ingestion/source/kafka_connect/sink_connectors.py @@ -8,17 +8,19 @@ from datahub.ingestion.source.kafka_connect.common import ( KAFKA, BaseConnector, - ConnectorConfigKeys, ConnectorManifest, KafkaConnectLineage, KafkaConnectSourceConfig, KafkaConnectSourceReport, get_dataset_name, has_three_level_hierarchy, - parse_comma_separated_list, remove_prefix, validate_jdbc_url, ) +from datahub.ingestion.source.kafka_connect.config_constants import ( + ConnectorConfigKeys, + parse_comma_separated_list, +) from datahub.ingestion.source.kafka_connect.transform_plugins import ( get_transform_pipeline, ) @@ -57,14 +59,31 @@ def _get_parser(self, connector_manifest: ConnectorManifest) -> S3SinkParser: ) def get_topics_from_config(self) -> List[str]: - """Extract topics from S3 sink connector configuration.""" + """ + Extract topics from S3 sink connector configuration. + + Supports both explicit topic lists and regex patterns: + - topics: Comma-separated list of topic names + - topics.regex: Java regex pattern to match topics dynamically + """ config = self.connector_manifest.config - # S3 sink connectors use 'topics' field + # Priority 1: Explicit 'topics' field topics = config.get(ConnectorConfigKeys.TOPICS, "") if topics: return parse_comma_separated_list(topics) + # Priority 2: 'topics.regex' pattern + topics_regex = config.get(ConnectorConfigKeys.TOPICS_REGEX, "") + if topics_regex: + # Expand pattern using available sources + return self._expand_topic_regex_patterns( + topics_regex, + available_topics=self.connector_manifest.topic_names + if self.connector_manifest.topic_names + else None, + ) + return [] def extract_flow_property_bag(self) -> Dict[str, str]: @@ -139,6 +158,10 @@ def extract_lineages(self) -> List[KafkaConnectLineage]: return [] + def get_platform(self) -> str: + """Get the platform for S3 Sink connector.""" + return "s3" + @dataclass class SnowflakeSinkConnector(BaseConnector): @@ -216,14 +239,31 @@ def get_parser( ) def get_topics_from_config(self) -> List[str]: - """Extract topics from Snowflake sink connector configuration.""" + """ + Extract topics from Snowflake sink connector configuration. + + Supports both explicit topic lists and regex patterns: + - topics: Comma-separated list of topic names + - topics.regex: Java regex pattern to match topics dynamically + """ config = self.connector_manifest.config - # Snowflake sink connectors use 'topics' field + # Priority 1: Explicit 'topics' field topics = config.get(ConnectorConfigKeys.TOPICS, "") if topics: return parse_comma_separated_list(topics) + # Priority 2: 'topics.regex' pattern + topics_regex = config.get(ConnectorConfigKeys.TOPICS_REGEX, "") + if topics_regex: + # Expand pattern using available sources + return self._expand_topic_regex_patterns( + topics_regex, + available_topics=self.connector_manifest.topic_names + if self.connector_manifest.topic_names + else None, + ) + return [] def extract_flow_property_bag(self) -> Dict[str, str]: @@ -251,17 +291,31 @@ def extract_lineages(self) -> List[KafkaConnectLineage]: for topic, table in parser.topics_to_tables.items(): target_dataset: str = f"{parser.database_name}.{parser.schema_name}.{table}" + + # Extract column-level lineage if enabled (uses base class method) + fine_grained = self._extract_fine_grained_lineage( + source_dataset=topic, + source_platform=KAFKA, + target_dataset=target_dataset, + target_platform="snowflake", + ) + lineages.append( KafkaConnectLineage( source_dataset=topic, source_platform=KAFKA, target_dataset=target_dataset, target_platform="snowflake", + fine_grained_lineages=fine_grained, ) ) return lineages + def get_platform(self) -> str: + """Get the platform for Snowflake Sink connector.""" + return "snowflake" + @dataclass class BigQuerySinkConnector(BaseConnector): @@ -350,12 +404,15 @@ def get_list(self, property: str) -> Iterable[Tuple[str, str]]: logger.warning(f"Failed to parse mapping entry '{entry}': {e}") def get_dataset_for_topic_v1(self, topic: str, parser: BQParser) -> Optional[str]: + from datahub.ingestion.source.kafka_connect.pattern_matchers import ( + JavaRegexMatcher, + ) + topicregex_dataset_map: Dict[str, str] = dict(self.get_list(parser.datasets)) # type: ignore - from java.util.regex import Pattern + matcher = JavaRegexMatcher() for pattern, dataset in topicregex_dataset_map.items(): - patternMatcher = Pattern.compile(pattern).matcher(topic) - if patternMatcher.matches(): + if matcher.matches(pattern, topic): return dataset return None @@ -387,14 +444,17 @@ def get_dataset_table_for_topic( table = topic if parser.topicsToTables: + from datahub.ingestion.source.kafka_connect.pattern_matchers import ( + JavaRegexMatcher, + ) + topicregex_table_map: Dict[str, str] = dict( self.get_list(parser.topicsToTables) # type: ignore ) - from java.util.regex import Pattern + matcher = JavaRegexMatcher() for pattern, tbl in topicregex_table_map.items(): - patternMatcher = Pattern.compile(pattern).matcher(topic) - if patternMatcher.matches(): + if matcher.matches(pattern, topic): table = tbl break @@ -403,14 +463,31 @@ def get_dataset_table_for_topic( return f"{dataset}.{table}" def get_topics_from_config(self) -> List[str]: - """Extract topics from BigQuery sink connector configuration.""" + """ + Extract topics from BigQuery sink connector configuration. + + Supports both explicit topic lists and regex patterns: + - topics: Comma-separated list of topic names + - topics.regex: Java regex pattern to match topics dynamically + """ config = self.connector_manifest.config - # BigQuery sink connectors use 'topics' field + # Priority 1: Explicit 'topics' field topics = config.get(ConnectorConfigKeys.TOPICS, "") if topics: return parse_comma_separated_list(topics) + # Priority 2: 'topics.regex' pattern + topics_regex = config.get(ConnectorConfigKeys.TOPICS_REGEX, "") + if topics_regex: + # Expand pattern using available sources + return self._expand_topic_regex_patterns( + topics_regex, + available_topics=self.connector_manifest.topic_names + if self.connector_manifest.topic_names + else None, + ) + return [] def extract_flow_property_bag(self) -> Dict[str, str]: @@ -465,16 +542,29 @@ def extract_lineages(self) -> List[KafkaConnectLineage]: continue target_dataset: str = f"{project}.{dataset_table}" + # Extract column-level lineage if enabled (uses base class method) + fine_grained = self._extract_fine_grained_lineage( + source_dataset=original_topic, + source_platform=KAFKA, + target_dataset=target_dataset, + target_platform=target_platform, + ) + lineages.append( KafkaConnectLineage( source_dataset=original_topic, # Keep original topic as source source_platform=KAFKA, target_dataset=target_dataset, target_platform=target_platform, + fine_grained_lineages=fine_grained, ) ) return lineages + def get_platform(self) -> str: + """Get the platform for BigQuery Sink connector.""" + return "bigquery" + @dataclass class JdbcSinkParser: @@ -764,14 +854,31 @@ def get_table_name_from_topic(self, topic: str, table_format: str) -> str: return table_format def get_topics_from_config(self) -> List[str]: - """Extract topics from JDBC sink connector configuration.""" + """ + Extract topics from JDBC sink connector configuration. + + Supports both explicit topic lists and regex patterns: + - topics: Comma-separated list of topic names + - topics.regex: Java regex pattern to match topics dynamically + """ config = self.connector_manifest.config - # JDBC sink connectors use 'topics' field + # Priority 1: Explicit 'topics' field topics = config.get(ConnectorConfigKeys.TOPICS, "") if topics: return parse_comma_separated_list(topics) + # Priority 2: 'topics.regex' pattern + topics_regex = config.get(ConnectorConfigKeys.TOPICS_REGEX, "") + if topics_regex: + # Expand pattern using available sources + return self._expand_topic_regex_patterns( + topics_regex, + available_topics=self.connector_manifest.topic_names + if self.connector_manifest.topic_names + else None, + ) + return [] def extract_flow_property_bag(self) -> Dict[str, str]: @@ -881,12 +988,21 @@ def extract_lineages(self) -> List[KafkaConnectLineage]: # Platform doesn't use schemas: database.table target_dataset = get_dataset_name(parser.database_name, table_name) + # Extract column-level lineage if enabled (uses base class method) + fine_grained = self._extract_fine_grained_lineage( + source_dataset=original_topic, + source_platform=KAFKA, + target_dataset=target_dataset, + target_platform=parser.target_platform, + ) + lineages.append( KafkaConnectLineage( source_dataset=original_topic, source_platform=KAFKA, target_dataset=target_dataset, target_platform=parser.target_platform, + fine_grained_lineages=fine_grained, ) ) @@ -911,6 +1027,10 @@ def extract_lineages(self) -> List[KafkaConnectLineage]: ) return [] + def get_platform(self) -> str: + """Get the platform for JDBC Sink connector.""" + return self.platform + BIGQUERY_SINK_CONNECTOR_CLASS: Final[str] = ( "com.wepay.kafka.connect.bigquery.BigQuerySinkConnector" diff --git a/metadata-ingestion/src/datahub/ingestion/source/kafka_connect/source_connectors.py b/metadata-ingestion/src/datahub/ingestion/source/kafka_connect/source_connectors.py index 7c732c35d10f34..34aa6f0a67da52 100644 --- a/metadata-ingestion/src/datahub/ingestion/source/kafka_connect/source_connectors.py +++ b/metadata-ingestion/src/datahub/ingestion/source/kafka_connect/source_connectors.py @@ -1,7 +1,7 @@ import logging import re -from dataclasses import dataclass -from typing import Dict, Final, Iterable, List, Optional, Tuple +from dataclasses import dataclass, field +from typing import Any, Dict, Final, Iterable, List, Optional, Tuple from sqlalchemy.engine.url import make_url @@ -12,6 +12,7 @@ KAFKA, KNOWN_TOPIC_ROUTING_TRANSFORMS, REGEXROUTER_TRANSFORM, + SNOWFLAKE_SOURCE_CLOUD, BaseConnector, ConnectorManifest, KafkaConnectLineage, @@ -22,6 +23,10 @@ unquote, validate_jdbc_url, ) +from datahub.ingestion.source.kafka_connect.pattern_matchers import ( + JavaRegexMatcher, + PatternMatcher, +) from datahub.ingestion.source.kafka_connect.transform_plugins import ( get_transform_pipeline, ) @@ -333,11 +338,20 @@ def _create_lineage_mapping( """Create a single lineage mapping from source table to topic.""" dataset_name = get_dataset_name(database_name, source_table) + # Extract column-level lineage if enabled (uses base class method) + fine_grained = self._extract_fine_grained_lineage( + source_dataset=dataset_name, + source_platform=source_platform, + target_dataset=topic, + target_platform=KAFKA, + ) + return KafkaConnectLineage( source_dataset=dataset_name if include_dataset else None, source_platform=source_platform, target_dataset=topic, target_platform=KAFKA, + fine_grained_lineages=fine_grained, ) def get_table_names(self) -> List[TableId]: @@ -1055,14 +1069,8 @@ def _derive_topics_from_config(self) -> List[str]: """Extract topics directly from connector configuration - most reliable approach.""" config = self.connector_manifest.config - # Use the connector registry for configuration-based topic derivation - from datahub.ingestion.source.kafka_connect.connector_registry import ( - ConnectorRegistry, - ) - - config_topics = ConnectorRegistry.get_topics_from_config( - self.connector_manifest, self.config, self.report - ) + # Call own get_topics_from_config method directly to avoid creating new instance + config_topics = self.get_topics_from_config() if config_topics: # Apply predictable transforms to get final topic names @@ -1490,6 +1498,525 @@ def _find_topics_by_advanced_prefix_patterns( return matching_topics + def get_platform(self) -> str: + """ + Get platform for JDBC connector. + + JDBC connectors can connect to multiple databases, so platform is inferred from + the connection URL in the connector configuration. + """ + try: + parser = self.get_parser(self.connector_manifest) + return parser.source_platform + except Exception as e: + logger.debug(f"Failed to get platform from parser: {e}") + # If parser fails, try to infer from JDBC URL directly + jdbc_url = self.connector_manifest.config.get("connection.url", "") + if jdbc_url: + return self._extract_platform_from_jdbc_url(jdbc_url) + return "unknown" + + +@dataclass +class SnowflakeSourceConnector(BaseConnector): + """ + Confluent Cloud Snowflake Source Connector. + + Reference: https://docs.confluent.io/cloud/current/connectors/cc-snowflake-source.html + + This connector uses JDBC-style polling (not CDC) to read from Snowflake tables. + Topic naming: + """ + + _cached_expanded_tables: Optional[List[str]] = field(default=None, init=False) + + @dataclass + class SnowflakeSourceParser: + source_platform: str + database_name: Optional[str] + topic_prefix: str + table_names: List[str] + transforms: List[Dict[str, str]] + + def get_parser( + self, + connector_manifest: ConnectorManifest, + ) -> SnowflakeSourceParser: + """Parse Snowflake Source connector configuration.""" + config = connector_manifest.config + + # Extract table names from table.include.list + table_config = config.get("table.include.list") or config.get( + "table.whitelist", "" + ) + table_names = parse_comma_separated_list(table_config) if table_config else [] + + # Extract database name from connection.url or snowflake.database.name + database_name = config.get("snowflake.database.name") or config.get( + "database.name" + ) + + # Topic prefix (used in topic naming pattern) + topic_prefix = config.get("topic.prefix", "") + + # Parse transforms + transforms_config = config.get("transforms", "") + transform_names = ( + parse_comma_separated_list(transforms_config) if transforms_config else [] + ) + + transforms = [] + for name in transform_names: + transform = {"name": name} + transforms.append(transform) + for key in config: + if key.startswith(f"transforms.{name}."): + transform[key.replace(f"transforms.{name}.", "")] = config[key] + + parser = self.SnowflakeSourceParser( + source_platform="snowflake", + database_name=database_name, + topic_prefix=topic_prefix, + table_names=table_names, + transforms=transforms, + ) + + return parser + + def get_topics_from_config(self) -> List[str]: + """ + Extract expected topics from Snowflake Source connector configuration. + + This method performs pattern expansion early so that the manifest's topic_names + contains the actual expanded topics rather than patterns. This allows the + subsequent lineage extraction to correctly match topics. + """ + try: + parser = self.get_parser(self.connector_manifest) + topic_prefix = parser.topic_prefix + table_names = parser.table_names + + # Check if any table names contain patterns + has_patterns = any(self._is_pattern(table) for table in table_names) + + # If patterns exist, expand them using DataHub schema resolver + if has_patterns: + if self.schema_resolver: + logger.info( + f"Expanding table patterns in get_topics_from_config for connector '{self.connector_manifest.name}'" + ) + expanded_tables = self._expand_table_patterns( + table_names, parser.source_platform, parser.database_name + ) + if expanded_tables: + # Cache expanded tables for reuse in extract_lineages + self._cached_expanded_tables = expanded_tables + table_names = expanded_tables + logger.info( + f"Expanded patterns to {len(expanded_tables)} tables for topic derivation" + ) + else: + logger.warning( + f"No tables found matching patterns for connector '{self.connector_manifest.name}'" + ) + # Cache empty list to signal that expansion was attempted but found nothing + self._cached_expanded_tables = [] + return [] + else: + # Patterns detected but no schema resolver - cannot expand + logger.warning( + f"Table patterns detected for connector '{self.connector_manifest.name}' " + f"but schema resolver is not available. Cannot derive topics from patterns." + ) + # Don't cache anything - let extract_lineages handle the warning + return [] + else: + # No patterns - just lowercase explicit table names to match DataHub normalization + table_names = [table.lower() for table in table_names] + # Cache the lowercased explicit tables + self._cached_expanded_tables = table_names + + # Snowflake Source topics follow pattern: {topic_prefix}{database.schema.table} + topics = [] + for table_name in table_names: + # Topic name is prefix + full table identifier + if topic_prefix: + topic_name = f"{topic_prefix}{table_name}" + else: + topic_name = table_name + topics.append(topic_name) + + # Apply transforms if configured + if parser.transforms: + logger.debug( + f"Applying {len(parser.transforms)} transforms to {len(topics)} derived topics for connector '{self.connector_manifest.name}'" + ) + result = get_transform_pipeline().apply_forward( + topics, self.connector_manifest.config + ) + if result.warnings: + for warning in result.warnings: + logger.warning( + f"Transform warning for {self.connector_manifest.name}: {warning}" + ) + topics = result.topics + logger.info( + f"Topics after transforms for '{self.connector_manifest.name}': {topics[:10] if len(topics) <= 10 else f'{topics[:10]}... ({len(topics)} total)'}" + ) + + return topics + except Exception as e: + logger.debug( + f"Failed to derive topics from Snowflake Source connector config: {e}" + ) + return [] + + def _is_pattern(self, table_name: str) -> bool: + """ + Check if table name contains wildcard pattern characters. + + IMPORTANT: Snowflake Source connector uses SIMPLE WILDCARD MATCHING, not Java regex. + Supported wildcards: + - "*" matches any sequence of characters (zero or more) + - "?" matches any single character + + This is DIFFERENT from Debezium connectors which use full Java regex. + + Examples: + - "ANALYTICS.PUBLIC.*" matches all tables in ANALYTICS.PUBLIC schema + - "*.PUBLIC.TABLE1" matches TABLE1 in PUBLIC schema across all databases + - "DB.SCHEMA.USER?" matches USER1, USERS, etc. + + Note: Without DataHub schema resolver, we cannot expand these patterns. + """ + # Check for wildcard characters (simple patterns only, not full regex) + # We check for all regex chars to detect if user accidentally used Java regex syntax + pattern_chars = [ + "*", + "+", + "?", + "[", + "]", + "(", + ")", + "|", + "{", + "}", + "^", + "$", + "\\", + ] + return any(char in table_name for char in pattern_chars) + + def _expand_table_patterns( + self, + table_patterns: List[str], + source_platform: str, + database_name: Optional[str], + ) -> List[str]: + """ + Expand table patterns using DataHub schema metadata. + + Examples: + - "ANALYTICS.PUBLIC.*" → ["ANALYTICS.PUBLIC.USERS", "ANALYTICS.PUBLIC.ORDERS", ...] + - "*.PUBLIC.TABLE1" → ["DB1.PUBLIC.TABLE1", "DB2.PUBLIC.TABLE1", ...] + - "ANALYTICS.PUBLIC.USERS" → ["ANALYTICS.PUBLIC.USERS"] (no expansion) + + Args: + table_patterns: List of table patterns from connector config + source_platform: Source platform (should be 'snowflake') + database_name: Database name for context (optional) + + Returns: + List of fully expanded table names + """ + if not self.schema_resolver: + logger.warning( + f"SchemaResolver not available for connector {self.connector_manifest.name} - cannot expand patterns" + ) + return [] + + expanded_tables = [] + + for pattern in table_patterns: + # Check if pattern needs expansion (contains regex special characters) + if self._is_pattern(pattern): + logger.info( + f"Expanding pattern '{pattern}' using DataHub schema metadata" + ) + tables = self._query_tables_from_datahub( + pattern, source_platform, database_name + ) + if tables: + logger.info( + f"Pattern expansion: '{pattern}' -> {len(tables)} tables found" + ) + logger.debug(f"Expanded tables: {tables}") + expanded_tables.extend(tables) + else: + logger.warning( + f"Pattern '{pattern}' did not match any tables in DataHub" + ) + else: + # Already explicit table name - no expansion needed + # Lowercase to match DataHub's normalization + expanded_tables.append(pattern.lower()) + + return expanded_tables + + def _query_tables_from_datahub( + self, + pattern: str, + platform: str, + database: Optional[str], + ) -> List[str]: + """ + Query DataHub for Snowflake tables matching the given pattern. + + Args: + pattern: Pattern (e.g., "ANALYTICS.PUBLIC.*", "*.PUBLIC.USERS") + platform: Source platform (should be "snowflake") + database: Database name for context + + Returns: + List of matching table names + """ + if not self.schema_resolver or not self.schema_resolver.graph: + return [] + + try: + # Query DataHub directly for tables matching the platform + # SchemaResolver's cache may be empty, so we use its graph connection directly + all_urns = list( + self.schema_resolver.graph.get_urns_by_filter( + platform=platform, + env=self.schema_resolver.env, + entity_types=["dataset"], + ) + ) + + if not all_urns: + logger.debug( + f"No {platform} datasets found in DataHub for pattern expansion" + ) + return [] + + matched_tables = [] + + # Convert pattern to Python regex + # Snowflake patterns are simpler than Debezium (just basic wildcard matching) + # Convert SQL-style patterns to Python regex: + # - "*" (any characters) → ".*" + # - "?" (single character) → "." + # Note: DataHub normalizes Snowflake table names to lowercase in URNs, + # so we lowercase the pattern to match + normalized_pattern = pattern.lower() + regex_pattern = ( + normalized_pattern.replace(".", r"\.") + .replace("*", ".*") + .replace("?", ".") + ) + regex = re.compile(regex_pattern) + + # TODO: Performance optimization - This loops through ALL datasets in DataHub + # for the platform without filtering. For large DataHub instances with thousands + # of tables, this could be very slow. Consider using graph.get_urns_by_filter() + # with more specific filters or implementing pagination. + for urn in all_urns: + # URN format: urn:li:dataset:(urn:li:dataPlatform:snowflake,database.schema.table,PROD) + table_name = self._extract_table_name_from_urn(urn) + if not table_name: + continue + + # Check if URN is for Snowflake platform + if f"dataplatform:{platform.lower()}" not in urn.lower(): + continue + + # Try pattern match (table_name from DataHub is already lowercase) + if regex.fullmatch(table_name): + matched_tables.append(table_name) + + logger.debug( + f"Pattern '{pattern}' matched {len(matched_tables)} tables from DataHub" + ) + return matched_tables + + except (ConnectionError, TimeoutError) as e: + logger.error(f"Failed to connect to DataHub for pattern '{pattern}': {e}") + if self.report: + self.report.report_failure( + f"datahub_connection_{self.connector_manifest.name}", str(e) + ) + return [] + except Exception as e: + logger.warning( + f"Failed to query tables from DataHub for pattern '{pattern}': {e}", + exc_info=True, + ) + return [] + + def extract_lineages(self) -> List[KafkaConnectLineage]: + """ + Extract lineage mappings from Snowflake tables to Kafka topics. + + This method always uses table.include.list from config as the source of truth. + When manifest.topic_names is available (Kafka API accessible), it filters + lineages to only topics that exist in Kafka. When unavailable (Confluent Cloud, + air-gapped), it creates lineages for all configured tables without validating + that derived topic names actually exist in Kafka. + """ + parser = self.get_parser(self.connector_manifest) + lineages: List[KafkaConnectLineage] = [] + + logging.debug( + f"Extracting lineages for Snowflake Source connector: " + f"platform={parser.source_platform}, database={parser.database_name}" + ) + + # Get table names from config (cached from get_topics_from_config if available) + if self._cached_expanded_tables is not None: + table_names = self._cached_expanded_tables + if not table_names: + logger.debug( + f"Pattern expansion found no matching tables for connector '{self.connector_manifest.name}'" + ) + return [] + logger.debug( + f"Reusing {len(table_names)} cached expanded tables from get_topics_from_config()" + ) + else: + # Expand patterns if not already cached + table_names = parser.table_names + if not table_names: + logger.debug( + "No table.include.list configuration found for Snowflake Source connector" + ) + return [] + + # Check if any table names contain patterns + has_patterns = any(self._is_pattern(table) for table in table_names) + + # If patterns exist but schema resolver is not available, skip processing + if has_patterns and not self.schema_resolver: + self.report.warning( + f"Snowflake Source connector '{self.connector_manifest.name}' has table patterns " + f"in table.include.list but DataHub schema resolver is not available. " + f"Skipping lineage extraction to avoid generating invalid URNs. " + f"Enable 'use_schema_resolver' in config to support pattern expansion." + ) + logger.warning( + f"Skipping lineage extraction for connector '{self.connector_manifest.name}' - " + f"patterns detected but schema resolver unavailable" + ) + return [] + + # If patterns exist and schema resolver is available, expand them + if has_patterns and self.schema_resolver: + logger.info( + f"Expanding table patterns for Snowflake Source connector '{self.connector_manifest.name}'" + ) + table_names = self._expand_table_patterns( + table_names, parser.source_platform, parser.database_name + ) + if not table_names: + logger.warning( + f"No tables found matching patterns for connector '{self.connector_manifest.name}'" + ) + return [] + else: + # No patterns - lowercase explicit table names + table_names = [table.lower() for table in table_names] + + topic_prefix = parser.topic_prefix + has_kafka_topics = bool(self.connector_manifest.topic_names) + + if not has_kafka_topics: + logger.info( + f"Kafka topics API not available for connector '{self.connector_manifest.name}' - " + f"creating lineages for all {len(table_names)} configured tables without validating " + f"that derived topic names actually exist in Kafka" + ) + + # Derive expected topics and apply transforms + for table_name in table_names: + # Build expected base topic name + if topic_prefix: + expected_topic = f"{topic_prefix}{table_name}" + else: + expected_topic = table_name + + # Apply transforms if configured + if parser.transforms: + result = get_transform_pipeline().apply_forward( + [expected_topic], self.connector_manifest.config + ) + if result.warnings: + for warning in result.warnings: + logger.warning( + f"Transform warning for {self.connector_manifest.name}: {warning}" + ) + if result.topics and len(result.topics) > 0: + expected_topic = result.topics[0] + + # Filter by Kafka topics if available + if ( + has_kafka_topics + and expected_topic not in self.connector_manifest.topic_names + ): + logger.debug( + f"Expected topic '{expected_topic}' not found in Kafka - skipping lineage for table '{table_name}'" + ) + continue + + # Extract column-level lineage if enabled + fine_grained = self._extract_fine_grained_lineage( + source_dataset=table_name, + source_platform=parser.source_platform, + target_dataset=expected_topic, + target_platform=KAFKA, + ) + + # Create lineage mapping + lineage = KafkaConnectLineage( + source_dataset=table_name, + source_platform=parser.source_platform, + target_dataset=expected_topic, + target_platform=KAFKA, + fine_grained_lineages=fine_grained, + ) + lineages.append(lineage) + logger.debug(f"Created lineage: {table_name} -> {expected_topic}") + + logger.info( + f"Created {len(lineages)} lineages for Snowflake connector '{self.connector_manifest.name}'" + ) + return lineages + + def extract_flow_property_bag(self) -> Dict[str, str]: + """Extract flow properties, masking sensitive information.""" + flow_property_bag = { + k: v + for k, v in self.connector_manifest.config.items() + if k + not in [ + "connection.password", + "connection.user", + "snowflake.private.key", + "snowflake.private.key.passphrase", + ] + } + + return flow_property_bag + + @staticmethod + def supports_connector_class(connector_class: str) -> bool: + """Check if this connector handles Snowflake Source.""" + return connector_class == SNOWFLAKE_SOURCE_CLOUD + + def get_platform(self) -> str: + """Get the platform for Snowflake Source connector.""" + return "snowflake" + @dataclass class MongoSourceConnector(BaseConnector): @@ -1552,6 +2079,10 @@ def extract_lineages(self) -> List[KafkaConnectLineage]: lineages.append(lineage) return lineages + def get_platform(self) -> str: + """Get the platform for Mongo Source connector.""" + return "mongodb" + @dataclass class DebeziumSourceConnector(BaseConnector): @@ -1579,6 +2110,19 @@ class DebeziumParser: source_platform: str server_name: Optional[str] database_name: Optional[str] + transforms: List[Dict[str, str]] + + def get_pattern_matcher(self) -> PatternMatcher: + """ + Get the pattern matcher for this connector type. + + Debezium connectors use Java regex for table.include.list and table.exclude.list, + which provides full regex capabilities including character classes, alternation, + quantifiers, etc. + + This differs from Snowflake Source connectors which use simple wildcard matching. + """ + return JavaRegexMatcher() def get_server_name(self, connector_manifest: ConnectorManifest) -> str: if "topic.prefix" in connector_manifest.config: @@ -1590,10 +2134,8 @@ def get_parser( self, connector_manifest: ConnectorManifest, ) -> DebeziumParser: - connector_class = connector_manifest.config.get(CONNECTOR_CLASS, "") - # Map connector class to platform - platform = self._get_platform_from_connector_class(connector_class) + platform = self.get_platform() # Map handler platform to parser platform (handler uses "sqlserver", parser expects "mssql") parser_platform = "mssql" if platform == "sqlserver" else platform @@ -1603,44 +2145,383 @@ def get_parser( platform, connector_manifest.config ) + # Parse transforms + config = connector_manifest.config + transforms_config = config.get("transforms", "") + transform_names = ( + parse_comma_separated_list(transforms_config) if transforms_config else [] + ) + + transforms = [] + for name in transform_names: + transform = {"name": name} + transforms.append(transform) + for key in config: + if key.startswith(f"transforms.{name}."): + transform[key.replace(f"transforms.{name}.", "")] = config[key] + return self.DebeziumParser( source_platform=parser_platform, server_name=self.get_server_name(connector_manifest), database_name=database_name, + transforms=transforms, ) def get_topics_from_config(self) -> List[str]: - """Extract expected topics from Debezium connector configuration.""" + """Extract expected topics from Debezium connector configuration. + + This method orchestrates the process of determining which Kafka topics + are produced by a Debezium connector by: + 1. Discovering tables from database or config + 2. Applying schema and table filters + 3. Deriving topic names from filtered tables + """ + logger.debug( + f"DebeziumSourceConnector.get_topics_from_config called for '{self.connector_manifest.name}'" + ) try: - parser = self.get_parser(self.connector_manifest) config = self.connector_manifest.config - server_name = parser.server_name or "" + logger.debug( + f"Debezium connector '{self.connector_manifest.name}' config keys: {list(config.keys())}" + ) - # Extract table names from configuration - table_config = config.get("table.include.list") or config.get( + # Get parser to extract database info + parser = self.get_parser(self.connector_manifest) + database_name = parser.database_name + source_platform = parser.source_platform + server_name = parser.server_name + + logger.debug( + f"Debezium connector '{self.connector_manifest.name}' - " + f"database='{database_name}', platform='{source_platform}', server_name='{server_name}'" + ) + + # Step 1: Get initial set of tables + table_names = self._get_table_names_from_config_or_discovery( + config, database_name, source_platform + ) + if not table_names: + return [] + + # Step 2: Apply schema filters (if tables were discovered from database) + if ( + self.schema_resolver + and self.config.use_schema_resolver + and database_name + ): + table_names = self._apply_schema_filters(config, table_names) + if not table_names: + return [] + + # Step 3: Apply table filters + table_names = self._apply_table_filters(config, table_names) + if not table_names: + return [] + + # Step 4: Derive topics from filtered tables + topics = self._derive_topics_from_tables(table_names, server_name) + + logger.info( + f"Derived {len(topics)} topics from Debezium connector '{self.connector_manifest.name}' config: " + f"{topics[:10] if len(topics) <= 10 else f'{topics[:10]}... ({len(topics)} total)'}" + ) + return topics + except Exception as e: + logger.warning( + f"Failed to derive topics from Debezium connector '{self.connector_manifest.name}' config: {e}", + exc_info=True, + ) + return [] + + def _get_table_names_from_config_or_discovery( + self, config: Dict[str, Any], database_name: Optional[str], source_platform: str + ) -> List[str]: + """Get table names either from config or by discovering from database. + + Args: + config: Connector configuration + database_name: Database name from connector config + source_platform: Source platform (e.g., "postgres", "mysql") + + Returns: + List of table names in "schema.table" format, or empty list if none found + """ + if not self.schema_resolver or not self.config.use_schema_resolver: + # SchemaResolver not available - fall back to table.include.list only + table_config = config.get("table.include.list") or config.get( "table.whitelist" ) if not table_config: - logger.debug("No table configuration found in Debezium connector") + logger.info( + f"No table.include.list found and SchemaResolver not available for connector '{self.connector_manifest.name}' - " + f"cannot derive topics from config" + ) return [] table_names = parse_comma_separated_list(table_config) + logger.debug( + f"Using {len(table_names)} tables from config (SchemaResolver not available): {table_names[:5]}" + ) + return table_names - # Debezium topics follow pattern: {server_name}.{table} where table includes schema - topics = [] - for table_name in table_names: - if server_name: - # Table name already includes schema prefix (e.g., "public.users") - topic_name = f"{server_name}.{table_name}" - else: - topic_name = table_name - topics.append(topic_name) + # SchemaResolver is available - use database.dbname to discover tables + if not database_name: + logger.warning( + f"Cannot discover tables for connector '{self.connector_manifest.name}' - " + f"database.dbname not configured" + ) + # Fall back to table.include.list if no database name + table_config = config.get("table.include.list") or config.get( + "table.whitelist" + ) + if not table_config: + logger.info( + f"No database.dbname and no table.include.list for connector '{self.connector_manifest.name}'" + ) + return [] + return parse_comma_separated_list(table_config) - return topics - except Exception as e: - logger.debug(f"Failed to derive topics from Debezium connector config: {e}") + # Discover all tables from database + logger.info( + f"Discovering tables from database '{database_name}' using SchemaResolver for connector '{self.connector_manifest.name}'" + ) + + discovered_tables = self._discover_tables_from_database( + database_name, source_platform + ) + + if not discovered_tables: + logger.warning( + f"No tables found in database '{database_name}' from SchemaResolver. " + f"Make sure you've ingested {source_platform} datasets for database '{database_name}' " + f"into DataHub before running Kafka Connect ingestion." + ) return [] + logger.info( + f"Discovered {len(discovered_tables)} tables from database '{database_name}': " + f"{discovered_tables[:10]}" + + ( + f"... ({len(discovered_tables)} total)" + if len(discovered_tables) > 10 + else "" + ) + ) + + return discovered_tables + + def _apply_schema_filters( + self, config: Dict[str, Any], tables: List[str] + ) -> List[str]: + """Apply schema.include.list and schema.exclude.list filters to tables. + + Args: + config: Connector configuration + tables: List of table names in "schema.table" format + + Returns: + Filtered list of table names + """ + # Apply schema.include.list filter if it exists + schema_include_config = config.get("schema.include.list") + if schema_include_config: + schema_include_patterns = parse_comma_separated_list(schema_include_config) + logger.info( + f"Applying schema.include.list filter with {len(schema_include_patterns)} patterns: {schema_include_patterns}" + ) + + # Filter by schema name (first part of "schema.table") + filtered_tables = [] + for table in tables: + # Extract schema name from "schema.table" format + if "." in table: + schema_name = table.split(".")[0] + if self._matches_any_pattern(schema_name, schema_include_patterns): + filtered_tables.append(table) + + tables = filtered_tables + logger.info( + f"After schema.include.list filtering, {len(tables)} tables remain" + ) + + if not tables: + logger.warning( + f"No tables matched schema.include.list patterns for connector '{self.connector_manifest.name}'" + ) + return [] + + # Apply schema.exclude.list filter if it exists + schema_exclude_config = config.get("schema.exclude.list") + if schema_exclude_config: + schema_exclude_patterns = parse_comma_separated_list(schema_exclude_config) + logger.info( + f"Applying schema.exclude.list filter with {len(schema_exclude_patterns)} patterns: {schema_exclude_patterns}" + ) + + # Filter out tables whose schema matches exclude patterns + before_count = len(tables) + filtered_tables = [] + for table in tables: + # Extract schema name from "schema.table" format + if "." in table: + schema_name = table.split(".")[0] + if not self._matches_any_pattern( + schema_name, schema_exclude_patterns + ): + filtered_tables.append(table) + else: + # Keep tables without schema separator + filtered_tables.append(table) + + tables = filtered_tables + excluded_count = before_count - len(tables) + logger.info( + f"After schema.exclude.list filtering, excluded {excluded_count} tables, {len(tables)} tables remain" + ) + + if not tables: + logger.warning( + f"All tables were excluded by schema.exclude.list for connector '{self.connector_manifest.name}'" + ) + return [] + + return tables + + def _apply_table_filters( + self, config: Dict[str, Any], tables: List[str] + ) -> List[str]: + """Apply table.include.list and table.exclude.list filters to tables. + + Args: + config: Connector configuration + tables: List of table names in "schema.table" format + + Returns: + Filtered list of table names + """ + # Apply table.include.list filter if it exists + table_config = config.get("table.include.list") or config.get("table.whitelist") + + if table_config: + # Parse patterns from config + table_patterns = parse_comma_separated_list(table_config) + logger.info( + f"Applying table.include.list filter with {len(table_patterns)} patterns: {table_patterns}" + ) + + # Filter tables using patterns + filtered_tables = self._filter_tables_by_patterns(tables, table_patterns) + + logger.info( + f"After include filtering, {len(filtered_tables)} tables match the patterns: " + f"{filtered_tables[:10]}" + + ( + f"... ({len(filtered_tables)} total)" + if len(filtered_tables) > 10 + else "" + ) + ) + + if not filtered_tables: + logger.warning( + f"No tables matched the include patterns for connector '{self.connector_manifest.name}'" + ) + return [] + + tables = filtered_tables + else: + # No filter - use all tables + logger.info( + f"No table.include.list filter - using all {len(tables)} tables" + ) + + # Apply table.exclude.list filter if it exists + exclude_config = config.get("table.exclude.list") or config.get( + "table.blacklist" + ) + + if exclude_config: + exclude_patterns = parse_comma_separated_list(exclude_config) + logger.info( + f"Applying table.exclude.list filter with {len(exclude_patterns)} patterns: {exclude_patterns}" + ) + + excluded_tables = self._filter_tables_by_patterns(tables, exclude_patterns) + + tables = [t for t in tables if t not in excluded_tables] + + logger.info( + f"After exclude filtering, {len(tables)} tables remain: " + f"{tables[:10]}" + + (f"... ({len(tables)} total)" if len(tables) > 10 else "") + ) + + if not tables: + logger.warning( + f"All tables were excluded by table.exclude.list for connector '{self.connector_manifest.name}'" + ) + return [] + + return tables + + def _derive_topics_from_tables( + self, table_names: List[str], server_name: Optional[str] + ) -> List[str]: + """Derive Kafka topic names from table names. + + Debezium topics follow pattern: {server_name}.{schema.table} + + Args: + table_names: List of table names in "schema.table" format + server_name: Server name from connector config + + Returns: + List of derived topic names + """ + topics = [] + for table_name in table_names: + # Table name already includes schema prefix (e.g., "public.users") + topic_name = f"{server_name}.{table_name}" if server_name else table_name + topics.append(topic_name) + + return topics + + def _filter_tables_by_patterns( + self, tables: List[str], patterns: List[str] + ) -> List[str]: + """ + Filter tables by matching against patterns using the connector's pattern matcher. + + This method uses the connector-specific pattern matcher (Java regex for Debezium) + to filter table names. + + Args: + tables: List of table names to filter (e.g., ["public.users", "public.orders"]) + patterns: List of patterns from table.include.list or table.exclude.list + + Returns: + List of tables that match at least one pattern + """ + # Get the appropriate pattern matcher for this connector type + matcher = self.get_pattern_matcher() + + # Use the matcher's filter_matches method + return matcher.filter_matches(patterns, tables) + + def _matches_any_pattern(self, text: str, patterns: List[str]) -> bool: + """ + Check if text matches any of the given patterns using the connector's pattern matcher. + + Args: + text: Text to check (e.g., schema name like "public") + patterns: List of patterns to match against (Java regex for Debezium) + + Returns: + True if text matches at least one pattern, False otherwise + """ + matcher = self.get_pattern_matcher() + return any(matcher.matches(pattern, text) for pattern in patterns) + def _get_database_name_for_platform( self, platform: str, config: Dict[str, str] ) -> Optional[str]: @@ -1662,8 +2543,9 @@ def _get_database_name_for_platform( # postgres, oracle, db2 use database.dbname return config.get("database.dbname") - def _get_platform_from_connector_class(self, connector_class: str) -> str: + def get_platform(self) -> str: """Map Debezium connector class to platform name.""" + connector_class = self.connector_manifest.config.get(CONNECTOR_CLASS, "") # Map based on well-known Debezium connector classes if "mysql" in connector_class.lower(): return "mysql" @@ -1683,6 +2565,15 @@ def _get_platform_from_connector_class(self, connector_class: str) -> str: return "unknown" def extract_lineages(self) -> List[KafkaConnectLineage]: + """ + Extract lineage mappings from Debezium source tables to Kafka topics. + + This method always uses table.include.list from config as the source of truth. + When manifest.topic_names is available (Kafka API accessible), it filters + lineages to only topics that exist in Kafka. When unavailable (Confluent Cloud, + air-gapped), it creates lineages for all configured tables without validating + that derived topic names actually exist in Kafka. + """ lineages: List[KafkaConnectLineage] = list() try: @@ -1691,9 +2582,6 @@ def extract_lineages(self) -> List[KafkaConnectLineage]: server_name = parser.server_name database_name = parser.database_name - if not self.connector_manifest.topic_names: - return lineages - # Check for EventRouter transform - requires special handling if self._has_event_router_transform(): logger.debug( @@ -1703,51 +2591,133 @@ def extract_lineages(self) -> List[KafkaConnectLineage]: source_platform, database_name ) - # Standard Debezium topic processing - # Escape server_name to handle cases where topic.prefix contains dots - # Some users configure topic.prefix like "my.server" which breaks the regex - server_name = server_name or "" - # Regex pattern (\w+\.\w+(?:\.\w+)?) supports BOTH 2-part and 3-part table names - topic_naming_pattern = rf"({re.escape(server_name)})\.(\w+\.\w+(?:\.\w+)?)" + # Get table names from config + table_config = self.connector_manifest.config.get( + "table.include.list" + ) or self.connector_manifest.config.get("table.whitelist") + + has_kafka_topics = bool(self.connector_manifest.topic_names) + + # When table.include.list is not specified, Debezium captures ALL tables from the database + # In this case, we use the actual topic names from Kafka API to reverse-engineer which tables are being captured + if not table_config: + if not has_kafka_topics: + logger.warning( + f"Debezium connector {self.connector_manifest.name} has no table.include.list config " + f"and no topics available from Kafka API - cannot extract lineages" + ) + return lineages + + if not server_name: + logger.warning( + f"Debezium connector {self.connector_manifest.name} has no server_name - cannot extract lineages from topics" + ) + return lineages + + # Use standard Debezium topic processing to extract table names from topics + logger.info( + f"Debezium connector {self.connector_manifest.name} has no table.include.list config - " + f"deriving table names from {len(self.connector_manifest.topic_names)} Kafka topics" + ) + return self._extract_lineages_from_topics( + source_platform, server_name, database_name + ) + + # Expand patterns if needed (requires SchemaResolver) + table_names = self._expand_table_patterns( + table_config, source_platform, database_name + ) + + if not table_names: + logger.warning( + f"No tables found after expanding patterns for connector '{self.connector_manifest.name}'" + ) + return lineages + + if not has_kafka_topics: + logger.info( + f"Kafka topics API not available for connector '{self.connector_manifest.name}' - " + f"creating lineages for all {len(table_names)} configured tables without validating " + f"that derived topic names actually exist in Kafka" + ) # Handle connectors with 2-level container (database + schema) in topic pattern connector_class = self.connector_manifest.config.get(CONNECTOR_CLASS, "") - maybe_duplicated_database_name = ( + includes_database_in_topic = ( connector_class in self.DEBEZIUM_CONNECTORS_WITH_2_LEVEL_CONTAINER_IN_PATTERN ) - for topic in self.connector_manifest.topic_names: - found = re.search(re.compile(topic_naming_pattern), topic) - logger.debug( - f"Processing topic: '{topic}' with regex pattern '{topic_naming_pattern}', found: {found}" - ) + # Derive expected topics and apply transforms + for table_name in table_names: + # For Debezium, derive expected topic name: {server_name}.{schema.table} + # Table name may include schema (e.g., "public.users") or database.schema (e.g., "testdb.public.users") + # SQL Server special case: {server_name}.{database}.{schema.table} + + # Extract schema.table part (remove database if present for 3-tier platforms) + if database_name and table_name.startswith(f"{database_name}."): + # Remove database prefix: "testdb.public.users" -> "public.users" + schema_table = table_name[len(database_name) + 1 :] + else: + schema_table = table_name - if found: - # Extract the table part after server_name - table_part = found.group(2) + # Build full dataset name with database + dataset_name = get_dataset_name(database_name, schema_table) - if ( - maybe_duplicated_database_name - and database_name - and table_part.startswith(f"{database_name}.") - ): - table_part = table_part[len(database_name) + 1 :] + # Generate expected Debezium topic name + if includes_database_in_topic and database_name: + # SQL Server: server.database.schema.table + if server_name: + expected_topic = f"{server_name}.{database_name}.{schema_table}" + else: + expected_topic = f"{database_name}.{schema_table}" + elif server_name: + # Standard: server.schema.table + expected_topic = f"{server_name}.{schema_table}" + else: + expected_topic = schema_table - logger.debug( - f"Extracted table part: '{table_part}' from topic '{topic}'" + # Apply transforms if configured + if parser.transforms: + result = get_transform_pipeline().apply_forward( + [expected_topic], self.connector_manifest.config ) - # Apply database name to create final dataset name - table_name = get_dataset_name(database_name, table_part) - logger.debug(f"Final table name: '{table_name}'") + if result.warnings: + for warning in result.warnings: + logger.warning( + f"Transform warning for {self.connector_manifest.name}: {warning}" + ) + if result.topics and len(result.topics) > 0: + expected_topic = result.topics[0] - lineage = KafkaConnectLineage( - source_dataset=table_name, - source_platform=source_platform, - target_dataset=topic, - target_platform=KAFKA, + # Filter by Kafka topics if available + if ( + has_kafka_topics + and expected_topic not in self.connector_manifest.topic_names + ): + logger.debug( + f"Expected topic '{expected_topic}' not found in Kafka - skipping lineage for table '{table_name}'" ) - lineages.append(lineage) + continue + + # Extract fine-grained lineage if enabled + fine_grained = self._extract_fine_grained_lineage( + dataset_name, source_platform, expected_topic, KAFKA + ) + + # Create lineage mapping + lineage = KafkaConnectLineage( + source_dataset=dataset_name, + source_platform=source_platform, + target_dataset=expected_topic, + target_platform=KAFKA, + fine_grained_lineages=fine_grained, + ) + lineages.append(lineage) + + logger.info( + f"Created {len(lineages)} lineages for Debezium connector '{self.connector_manifest.name}' from config" + ) return lineages except Exception as e: self.report.warning( @@ -1800,7 +2770,10 @@ def _extract_lineages_for_event_router( ) return lineages - table_names = parse_comma_separated_list(table_config) + # Expand table patterns if schema resolver is enabled + table_names = self._expand_table_patterns( + table_config, source_platform, database_name + ) # Try to filter topics using RegexRouter replacement pattern (if available) filtered_topics = self._filter_topics_for_event_router() @@ -1889,6 +2862,346 @@ def _filter_topics_for_event_router(self) -> List[str]: ) return list(self.connector_manifest.topic_names) + def _extract_lineages_from_topics( + self, + source_platform: str, + server_name: str, + database_name: Optional[str], + ) -> List[KafkaConnectLineage]: + """ + Extract lineages by reverse-engineering table names from Kafka topic names. + + This is used when table.include.list is not configured, meaning Debezium captures + ALL tables from the database. We parse the actual topic names from Kafka API to + determine which tables are being captured. + + Debezium topic naming patterns: + - MySQL: {server_name}.{database}.{table} + - PostgreSQL: {server_name}.{schema}.{table} + - SQL Server: {server_name}.{database}.{schema}.{table} + - Oracle: {server_name}.{schema}.{table} + """ + lineages: List[KafkaConnectLineage] = [] + + if not self.connector_manifest.topic_names: + return lineages + + connector_class = self.connector_manifest.config.get(CONNECTOR_CLASS, "") + includes_database_in_topic = ( + connector_class + in self.DEBEZIUM_CONNECTORS_WITH_2_LEVEL_CONTAINER_IN_PATTERN + ) + + for topic in self.connector_manifest.topic_names: + # Skip internal Debezium topics (schema history, transaction metadata, etc.) + if any( + internal in topic + for internal in [ + ".schema-changes", + ".transaction", + "dbhistory", + "__debezium", + ] + ): + logger.debug(f"Skipping internal Debezium topic: {topic}") + continue + + # Parse topic name to extract table information + # Expected format: {server_name}.{container}.{table} or {server_name}.{table} + parts = topic.split(".") + + # Skip topics that don't match expected Debezium format + if len(parts) < 2: + logger.debug( + f"Skipping topic '{topic}' - does not match Debezium naming pattern" + ) + continue + + # Try to match server_name prefix + if server_name and topic.startswith(f"{server_name}."): + # Remove server_name prefix + remaining = topic[len(server_name) + 1 :] + remaining_parts = remaining.split(".") + + # Extract table information based on connector type + if includes_database_in_topic: + # SQL Server: {server}.{database}.{schema}.{table} + if len(remaining_parts) >= 3: + db_name = remaining_parts[0] + schema_table = ".".join(remaining_parts[1:]) + elif len(remaining_parts) == 2: + # Fallback: {database}.{table} + db_name = remaining_parts[0] + schema_table = remaining_parts[1] + else: + logger.debug( + f"Skipping topic '{topic}' - unexpected format for 2-level container connector" + ) + continue + + # Use database from topic if available, otherwise use configured database + if database_name and db_name != database_name: + logger.debug( + f"Skipping topic '{topic}' - database '{db_name}' does not match configured '{database_name}'" + ) + continue + + dataset_name = get_dataset_name(db_name, schema_table) + else: + # Standard: {server}.{schema}.{table} or {server}.{table} + schema_table = remaining + + # Build dataset name + if database_name: + dataset_name = get_dataset_name(database_name, schema_table) + else: + dataset_name = schema_table + else: + # No server_name prefix or doesn't match - try best effort parsing + logger.debug( + f"Topic '{topic}' does not start with expected server_name '{server_name}' - attempting best-effort parsing" + ) + + if includes_database_in_topic: + # Assume format: {database}.{schema}.{table} + if len(parts) >= 2: + db_name = parts[0] + schema_table = ".".join(parts[1:]) + dataset_name = get_dataset_name(db_name, schema_table) + else: + continue + else: + # Assume format: {schema}.{table} or just {table} + if database_name: + dataset_name = get_dataset_name(database_name, topic) + else: + dataset_name = topic + + # Extract fine-grained lineage if enabled + fine_grained = self._extract_fine_grained_lineage( + dataset_name, source_platform, topic, KAFKA + ) + + # Create lineage mapping + lineage = KafkaConnectLineage( + source_dataset=dataset_name, + source_platform=source_platform, + target_dataset=topic, + target_platform=KAFKA, + fine_grained_lineages=fine_grained, + ) + lineages.append(lineage) + + logger.info( + f"Extracted {len(lineages)} lineages from {len(self.connector_manifest.topic_names)} topics " + f"for connector '{self.connector_manifest.name}'" + ) + return lineages + + def _expand_table_patterns( + self, + table_config: str, + source_platform: str, + database_name: Optional[str], + ) -> List[str]: + """ + Expand table patterns using DataHub schema metadata. + + Examples: + - "mydb.*" → ["mydb.table1", "mydb.table2", ...] + - "public.*" → ["public.table1", "public.table2", ...] + - "schema1.table1" → ["schema1.table1"] (no expansion) + + Args: + table_config: Comma-separated table patterns from connector config + source_platform: Source platform (e.g., 'postgres', 'mysql') + database_name: Database name for context (optional) + + Returns: + List of fully expanded table names + """ + # Check if feature is enabled + if ( + not self.config.use_schema_resolver + or not self.config.schema_resolver_expand_patterns + ): + # Fall back to original behavior - parse as-is + return parse_comma_separated_list(table_config) + + if not self.schema_resolver: + logger.debug( + f"SchemaResolver not available for connector {self.connector_manifest.name} - skipping pattern expansion" + ) + return parse_comma_separated_list(table_config) + + patterns = parse_comma_separated_list(table_config) + expanded_tables = [] + + logger.info( + f"Processing table patterns for connector {self.connector_manifest.name}: " + f"platform={source_platform}, database={database_name}, patterns={patterns}" + ) + + for pattern in patterns: + # Check if pattern needs expansion (contains regex special characters) + if self._is_regex_pattern(pattern): + logger.info( + f"Pattern '{pattern}' contains wildcards - will query DataHub for matching tables " + f"(platform={source_platform}, database={database_name})" + ) + tables = self._query_tables_from_datahub( + pattern, source_platform, database_name + ) + if tables: + logger.info( + f"Expanded pattern '{pattern}' to {len(tables)} tables: {tables[:5]}..." + ) + expanded_tables.extend(tables) + else: + logger.warning( + f"Pattern '{pattern}' did not match any tables in DataHub - keeping as-is" + ) + expanded_tables.append(pattern) + else: + # Already explicit table name - no expansion needed + logger.debug( + f"Table '{pattern}' is explicit (no wildcards) - using as-is without querying DataHub" + ) + expanded_tables.append(pattern) + + return expanded_tables + + def _is_regex_pattern(self, pattern: str) -> bool: + """ + Check if pattern contains Java regex special characters. + + Debezium uses Java regex for table.include.list, which supports: + - Wildcards: * (zero or more), + (one or more), ? (zero or one) + - Character classes: [abc], [0-9], [a-z] + - Grouping and alternation: (pattern1|pattern2) + - Quantifiers: {n}, {n,}, {n,m} + - Anchors and boundaries: ^, $, \\b + - Escapes: \\ (backslash for escaping special chars like \\.) + + Returns: + True if pattern contains regex special characters + """ + # Common regex special characters that indicate a pattern needs expansion + # Note: backslash (\) is included to detect escaped patterns like "public\\.users" + regex_chars = ["*", "+", "?", "[", "]", "(", ")", "|", "{", "}", "^", "$", "\\"] + return any(char in pattern for char in regex_chars) + + def _query_tables_from_datahub( + self, + pattern: str, + platform: str, + database: Optional[str], + ) -> List[str]: + """ + Query DataHub for tables matching the given pattern. + + Debezium uses Java regex for table.include.list patterns. Patterns are anchored, + meaning they must match the entire fully-qualified table name. + + Args: + pattern: Java regex pattern (e.g., "public.*", "mydb\\.(users|orders)") + platform: Source platform + database: Database name for context + + Returns: + List of matching table names + """ + if not self.schema_resolver: + return [] + + try: + # Get all URNs from schema resolver cache + all_urns = self.schema_resolver.get_urns() + + logger.info( + f"SchemaResolver returned {len(all_urns)} cached URNs for platform={platform}, " + f"database={database}, will match against pattern='{pattern}'" + ) + + if not all_urns: + logger.warning( + f"No cached schemas available in SchemaResolver for platform={platform}. " + f"Make sure you've ingested {platform} datasets into DataHub before running Kafka Connect ingestion." + ) + return [] + + matched_tables = [] + + # Try to use Java regex for exact compatibility with Debezium + try: + from java.util.regex import Pattern as JavaPattern + + regex_pattern = JavaPattern.compile(pattern) + use_java_regex = True + except (ImportError, RuntimeError): + # Fallback to Python re module for testing/environments without JPype + logger.debug( + "Java regex not available, falling back to Python re module" + ) + # Convert Java regex to Python regex (mostly compatible) + # Main difference: Java regex uses \. for literal dot, Python uses \. + # For simple patterns like "public.*" they're identical + python_pattern = pattern.replace(r"\\.", r"\.") + regex_pattern = re.compile(python_pattern) + use_java_regex = False + + for urn in all_urns: + # URN format: urn:li:dataset:(urn:li:dataPlatform:postgres,database.schema.table,PROD) + table_name = self._extract_table_name_from_urn(urn) + if not table_name: + continue + + # Try direct match first (handles patterns like "mydb.schema.*") + full_name_matches = ( + regex_pattern.matcher(table_name).matches() + if use_java_regex + else regex_pattern.fullmatch(table_name) is not None + ) + + if full_name_matches: + matched_tables.append(table_name) + continue + + # For patterns without database prefix (e.g., "schema.*" or "public.*"), + # also try matching against the table name without the first component. + # This handles Debezium patterns that don't include the database name: + # - PostgreSQL: "public.*" matches "testdb.public.users" (3-tier URN) + # - MySQL: "mydb.*" matches "mydb.table1" (2-tier URN, already matched above) + if "." in table_name: + table_without_database = table_name.split(".", 1)[1] + schema_name_matches = ( + regex_pattern.matcher(table_without_database).matches() + if use_java_regex + else regex_pattern.fullmatch(table_without_database) is not None + ) + + if schema_name_matches: + matched_tables.append(table_name) + + logger.debug( + f"Pattern '{pattern}' matched {len(matched_tables)} tables from DataHub" + ) + return matched_tables + + except (ConnectionError, TimeoutError) as e: + logger.error(f"Failed to connect to DataHub for pattern '{pattern}': {e}") + if self.report: + self.report.report_failure( + f"datahub_connection_{self.connector_manifest.name}", str(e) + ) + return [] + except Exception as e: + logger.warning( + f"Failed to query tables from DataHub for pattern '{pattern}': {e}", + exc_info=True, + ) + return [] + @dataclass class ConfigDrivenSourceConnector(BaseConnector): diff --git a/metadata-ingestion/src/datahub/ingestion/source/kafka_connect/transform_plugins.py b/metadata-ingestion/src/datahub/ingestion/source/kafka_connect/transform_plugins.py index 8ea69453fe7937..88024f14a08625 100644 --- a/metadata-ingestion/src/datahub/ingestion/source/kafka_connect/transform_plugins.py +++ b/metadata-ingestion/src/datahub/ingestion/source/kafka_connect/transform_plugins.py @@ -17,7 +17,7 @@ from functools import lru_cache from typing import Dict, List, Optional -from datahub.ingestion.source.kafka_connect.common import ( +from datahub.ingestion.source.kafka_connect.config_constants import ( ConnectorConfigKeys, parse_comma_separated_list, ) @@ -189,6 +189,37 @@ def should_apply_automatically(cls) -> bool: return False # Complex transforms require explicit user configuration +class ReplaceFieldPlugin(TransformPlugin): + """ + Plugin for ReplaceField transforms. + + ReplaceField transforms only affect message field names (include/exclude/rename), + not topic names, so they're a no-op for topic transformation but need to be + registered as known transforms to avoid warnings. + """ + + SUPPORTED_TYPES = { + "org.apache.kafka.connect.transforms.ReplaceField$Value", + "org.apache.kafka.connect.transforms.ReplaceField$Key", + } + + @classmethod + def supports_transform_type(cls, transform_type: str) -> bool: + return transform_type in cls.SUPPORTED_TYPES + + def apply_forward(self, topics: List[str], config: TransformConfig) -> List[str]: + """ReplaceField doesn't affect topic names, only field names within messages.""" + return topics + + def apply_reverse(self, topics: List[str], config: TransformConfig) -> List[str]: + """ReplaceField doesn't affect topic names, only field names within messages.""" + return topics + + @classmethod + def should_apply_automatically(cls) -> bool: + return True # Safe to apply automatically - it's a no-op for topic names + + class TransformPluginRegistry: """Registry for transform plugins.""" @@ -200,6 +231,7 @@ def _register_default_plugins(self): """Register default transform plugins.""" self.register(RegexRouterPlugin()) self.register(ComplexTransformPlugin()) + self.register(ReplaceFieldPlugin()) def register(self, plugin: TransformPlugin) -> None: """Register a transform plugin.""" diff --git a/metadata-ingestion/tests/unit/test_kafka_connect.py b/metadata-ingestion/tests/unit/test_kafka_connect.py index 6320502e58dd60..702a10b9dfc581 100644 --- a/metadata-ingestion/tests/unit/test_kafka_connect.py +++ b/metadata-ingestion/tests/unit/test_kafka_connect.py @@ -1,5 +1,5 @@ import logging -from typing import Any, Dict, List, Tuple +from typing import Any, Dict, List, Optional, Tuple from unittest.mock import Mock, patch import jpype @@ -8,6 +8,8 @@ # Import the classes we're testing from datahub.ingestion.source.kafka_connect.common import ( + CLOUD_JDBC_SOURCE_CLASSES, + POSTGRES_CDC_SOURCE_CLOUD, ConnectorManifest, KafkaConnectLineage, KafkaConnectSourceConfig, @@ -21,16 +23,29 @@ SnowflakeSinkConnector, ) from datahub.ingestion.source.kafka_connect.source_connectors import ( + JDBC_SOURCE_CONNECTOR_CLASS, ConfluentJDBCSourceConnector, + DebeziumSourceConnector, MongoSourceConnector, ) from datahub.ingestion.source.kafka_connect.transform_plugins import ( get_transform_pipeline, ) +from datahub.sql_parsing.schema_resolver import SchemaResolverInterface logger = logging.getLogger(__name__) +def create_mock_kafka_connect_config() -> Mock: + """Helper to create a properly configured KafkaConnectSourceConfig mock.""" + config = Mock(spec=KafkaConnectSourceConfig) + config.use_schema_resolver = False + config.schema_resolver_expand_patterns = True + config.schema_resolver_finegrained_lineage = True + config.env = "PROD" + return config + + @pytest.fixture(scope="session", autouse=True) def ensure_jvm_started(): """Ensure JVM is started for all tests requiring Java regex.""" @@ -188,7 +203,10 @@ def create_mock_manifest(self, config: Dict[str, str]) -> ConnectorManifest: def create_mock_dependencies(self) -> Tuple[Mock, Mock]: """Helper to create mock dependencies.""" - config: Mock = Mock(spec=KafkaConnectSourceConfig) + config: Mock = create_mock_kafka_connect_config() + config.use_schema_resolver = False + config.schema_resolver_expand_patterns = True + config.schema_resolver_finegrained_lineage = True report: Mock = Mock(spec=KafkaConnectSourceReport) return config, report @@ -301,7 +319,7 @@ def test_s3_with_regex_router(self) -> None: } manifest: ConnectorManifest = self.create_mock_manifest(connector_config) - config: Mock = Mock(spec=KafkaConnectSourceConfig) + config: Mock = create_mock_kafka_connect_config() report: Mock = Mock(spec=KafkaConnectSourceReport) connector: ConfluentS3SinkConnector = ConfluentS3SinkConnector( @@ -349,7 +367,7 @@ def test_bigquery_with_regex_router(self) -> None: } manifest: ConnectorManifest = self.create_mock_manifest(connector_config) - config: Mock = Mock(spec=KafkaConnectSourceConfig) + config: Mock = create_mock_kafka_connect_config() report: Mock = Mock(spec=KafkaConnectSourceReport) connector: BigQuerySinkConnector = BigQuerySinkConnector( @@ -394,7 +412,7 @@ def test_snowflake_with_regex_router(self) -> None: } manifest: ConnectorManifest = self.create_mock_manifest(connector_config) - config: Mock = Mock(spec=KafkaConnectSourceConfig) + config: Mock = create_mock_kafka_connect_config() report: Mock = Mock(spec=KafkaConnectSourceReport) connector: SnowflakeSinkConnector = SnowflakeSinkConnector( @@ -444,7 +462,7 @@ def test_mysql_source_with_regex_router(self) -> None: } manifest: ConnectorManifest = self.create_mock_manifest(connector_config) - config: Mock = Mock(spec=KafkaConnectSourceConfig) + config: Mock = create_mock_kafka_connect_config() report: Mock = Mock(spec=KafkaConnectSourceReport) connector: ConfluentJDBCSourceConnector = ConfluentJDBCSourceConnector( @@ -488,7 +506,7 @@ def test_end_to_end_bigquery_transformation(self) -> None: topic_names=["raw_users_data", "raw_orders_data", "other_topic"], ) - config: Mock = Mock(spec=KafkaConnectSourceConfig) + config: Mock = create_mock_kafka_connect_config() report: Mock = Mock(spec=KafkaConnectSourceReport) connector: BigQuerySinkConnector = BigQuerySinkConnector( @@ -536,7 +554,7 @@ def test_regex_router_error_handling(self) -> None: topic_names=["test-topic"], ) - config: Mock = Mock(spec=KafkaConnectSourceConfig) + config: Mock = create_mock_kafka_connect_config() report: Mock = Mock(spec=KafkaConnectSourceReport) # Should not raise an exception @@ -576,7 +594,7 @@ def test_mongo_source_lineage_topic_parsing(self) -> None: } manifest: ConnectorManifest = self.create_mock_manifest(connector_config) - config: Mock = Mock(spec=KafkaConnectSourceConfig) + config: Mock = create_mock_kafka_connect_config() report: Mock = Mock(spec=KafkaConnectSourceReport) connector: MongoSourceConnector = MongoSourceConnector(manifest, config, report) @@ -627,7 +645,7 @@ def test_platform_postgres_source_connector(self) -> None: } manifest: ConnectorManifest = self.create_platform_manifest(connector_config) - config: Mock = Mock(spec=KafkaConnectSourceConfig) + config: Mock = create_mock_kafka_connect_config() report: Mock = Mock(spec=KafkaConnectSourceReport) connector: ConfluentJDBCSourceConnector = ConfluentJDBCSourceConnector( @@ -667,7 +685,7 @@ def test_cloud_postgres_source_connector(self) -> None: } manifest: ConnectorManifest = self.create_cloud_manifest(connector_config) - config: Mock = Mock(spec=KafkaConnectSourceConfig) + config: Mock = create_mock_kafka_connect_config() report: Mock = Mock(spec=KafkaConnectSourceReport) connector: ConfluentJDBCSourceConnector = ConfluentJDBCSourceConnector( @@ -716,7 +734,7 @@ def test_cloud_mysql_source_connector(self) -> None: ], ) - config: Mock = Mock(spec=KafkaConnectSourceConfig) + config: Mock = create_mock_kafka_connect_config() report: Mock = Mock(spec=KafkaConnectSourceReport) connector: ConfluentJDBCSourceConnector = ConfluentJDBCSourceConnector( @@ -755,7 +773,7 @@ def test_mixed_field_name_fallback(self) -> None: topic_names=["cloud_server.public.cloud_table"], ) - config: Mock = Mock(spec=KafkaConnectSourceConfig) + config: Mock = create_mock_kafka_connect_config() report: Mock = Mock(spec=KafkaConnectSourceReport) connector: ConfluentJDBCSourceConnector = ConfluentJDBCSourceConnector( @@ -791,7 +809,7 @@ def test_cloud_connector_missing_required_fields(self) -> None: topic_names=[], ) - config: Mock = Mock(spec=KafkaConnectSourceConfig) + config: Mock = create_mock_kafka_connect_config() report: Mock = Mock(spec=KafkaConnectSourceReport) connector: ConfluentJDBCSourceConnector = ConfluentJDBCSourceConnector( @@ -847,7 +865,7 @@ def test_lineage_generation_platform_vs_cloud(self) -> None: topic_names=["db-server.public.users"], ) - config: Mock = Mock(spec=KafkaConnectSourceConfig) + config: Mock = create_mock_kafka_connect_config() report: Mock = Mock(spec=KafkaConnectSourceReport) # Test Platform connector @@ -918,7 +936,7 @@ def validate_lineage_fields( topic_names=topic_names, ) - mock_config: Mock = Mock(spec=KafkaConnectSourceConfig) + mock_config: Mock = create_mock_kafka_connect_config() mock_report: Mock = Mock(spec=KafkaConnectSourceReport) connector: ConfluentJDBCSourceConnector = ConfluentJDBCSourceConnector( @@ -2984,7 +3002,7 @@ def test_extract_flow_property_bag_masks_credentials(self) -> None: topic_names=[], ) - config = Mock(spec=KafkaConnectSourceConfig) + config = create_mock_kafka_connect_config() report = Mock(spec=KafkaConnectSourceReport) connector = ConfluentJDBCSourceConnector(manifest, config, report) @@ -3098,7 +3116,7 @@ def test_debezium_postgres_lineage_extraction(self) -> None: topic_names=["myserver.public.users", "myserver.public.orders"], ) - config = Mock(spec=KafkaConnectSourceConfig) + config = create_mock_kafka_connect_config() report = Mock(spec=KafkaConnectSourceReport) connector = DebeziumSourceConnector(manifest, config, report) @@ -3139,7 +3157,7 @@ def test_debezium_mysql_lineage_extraction(self) -> None: ], ) - config = Mock(spec=KafkaConnectSourceConfig) + config = create_mock_kafka_connect_config() report = Mock(spec=KafkaConnectSourceReport) connector = DebeziumSourceConnector(manifest, config, report) @@ -3172,7 +3190,7 @@ def test_debezium_sqlserver_with_database_and_schema(self) -> None: topic_names=["sqlserver.mydb.dbo.customers"], ) - config = Mock(spec=KafkaConnectSourceConfig) + config = create_mock_kafka_connect_config() report = Mock(spec=KafkaConnectSourceReport) connector = DebeziumSourceConnector(manifest, config, report) @@ -3205,7 +3223,7 @@ def test_debezium_with_topic_prefix(self) -> None: topic_names=["my-prefix.public.events"], ) - config = Mock(spec=KafkaConnectSourceConfig) + config = create_mock_kafka_connect_config() report = Mock(spec=KafkaConnectSourceReport) connector = DebeziumSourceConnector(manifest, config, report) @@ -3216,6 +3234,293 @@ def test_debezium_with_topic_prefix(self) -> None: assert lineages[0].target_dataset == "my-prefix.public.events" +class TestDebeziumDatabaseDiscovery: + """Test Debezium database discovery via SchemaResolver and filtering.""" + + def test_database_discovery_without_table_include_list(self) -> None: + """Test discovering tables from database.dbname when table.include.list is not configured.""" + from datahub.ingestion.source.kafka_connect.source_connectors import ( + DebeziumSourceConnector, + ) + + connector_config = { + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.server.name": "myserver", + "database.dbname": "appdb", + } + + manifest = ConnectorManifest( + name="postgres-cdc-discovery", + type="source", + config=connector_config, + tasks=[], + topic_names=[], + ) + + config = create_mock_kafka_connect_config() + config.use_schema_resolver = True + report = Mock(spec=KafkaConnectSourceReport) + + mock_schema_resolver = Mock() + mock_schema_resolver.get_urns.return_value = [ + "urn:li:dataset:(urn:li:dataPlatform:postgres,appdb.public.users,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:postgres,appdb.public.orders,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:postgres,appdb.public.products,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:mysql,other_db.table1,PROD)", + ] + + connector = DebeziumSourceConnector( + manifest, config, report, schema_resolver=mock_schema_resolver + ) + + topics = connector.get_topics_from_config() + + assert len(topics) == 3 + assert "myserver.public.users" in topics + assert "myserver.public.orders" in topics + assert "myserver.public.products" in topics + + def test_database_discovery_with_include_filter(self) -> None: + """Test discovering tables from database then filtering with table.include.list.""" + from datahub.ingestion.source.kafka_connect.source_connectors import ( + DebeziumSourceConnector, + ) + + connector_config = { + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.server.name": "myserver", + "database.dbname": "appdb", + "table.include.list": "public.users,public.orders", + } + + manifest = ConnectorManifest( + name="postgres-cdc-filtered", + type="source", + config=connector_config, + tasks=[], + topic_names=[], + ) + + config = create_mock_kafka_connect_config() + config.use_schema_resolver = True + report = Mock(spec=KafkaConnectSourceReport) + + mock_schema_resolver = Mock() + mock_schema_resolver.get_urns.return_value = [ + "urn:li:dataset:(urn:li:dataPlatform:postgres,appdb.public.users,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:postgres,appdb.public.orders,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:postgres,appdb.public.products,PROD)", + ] + + connector = DebeziumSourceConnector( + manifest, config, report, schema_resolver=mock_schema_resolver + ) + + topics = connector.get_topics_from_config() + + assert len(topics) == 2 + assert "myserver.public.users" in topics + assert "myserver.public.orders" in topics + assert "myserver.public.products" not in topics + + def test_database_discovery_with_exclude_filter(self) -> None: + """Test discovering tables from database then filtering with table.exclude.list.""" + from datahub.ingestion.source.kafka_connect.source_connectors import ( + DebeziumSourceConnector, + ) + + connector_config = { + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.server.name": "myserver", + "database.dbname": "appdb", + "table.exclude.list": "public.products", + } + + manifest = ConnectorManifest( + name="postgres-cdc-excluded", + type="source", + config=connector_config, + tasks=[], + topic_names=[], + ) + + config = create_mock_kafka_connect_config() + config.use_schema_resolver = True + report = Mock(spec=KafkaConnectSourceReport) + + mock_schema_resolver = Mock() + mock_schema_resolver.get_urns.return_value = [ + "urn:li:dataset:(urn:li:dataPlatform:postgres,appdb.public.users,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:postgres,appdb.public.orders,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:postgres,appdb.public.products,PROD)", + ] + + connector = DebeziumSourceConnector( + manifest, config, report, schema_resolver=mock_schema_resolver + ) + + topics = connector.get_topics_from_config() + + assert len(topics) == 2 + assert "myserver.public.users" in topics + assert "myserver.public.orders" in topics + assert "myserver.public.products" not in topics + + def test_database_discovery_with_include_and_exclude_filters(self) -> None: + """Test include filter followed by exclude filter.""" + from datahub.ingestion.source.kafka_connect.source_connectors import ( + DebeziumSourceConnector, + ) + + connector_config = { + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.server.name": "myserver", + "database.dbname": "testdb", + "table.include.list": "public.test_.*", + "table.exclude.list": "public.test_temp_.*", + } + + manifest = ConnectorManifest( + name="postgres-cdc-both-filters", + type="source", + config=connector_config, + tasks=[], + topic_names=[], + ) + + config = create_mock_kafka_connect_config() + config.use_schema_resolver = True + report = Mock(spec=KafkaConnectSourceReport) + + mock_schema_resolver = Mock() + mock_schema_resolver.get_urns.return_value = [ + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.public.test_users,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.public.test_orders,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.public.test_temp_data,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.public.production_users,PROD)", + ] + + connector = DebeziumSourceConnector( + manifest, config, report, schema_resolver=mock_schema_resolver + ) + + topics = connector.get_topics_from_config() + + assert len(topics) == 2 + assert "myserver.public.test_users" in topics + assert "myserver.public.test_orders" in topics + assert "myserver.public.test_temp_data" not in topics + assert "myserver.public.production_users" not in topics + + def test_pattern_matching_with_java_regex(self) -> None: + """Test pattern matching uses Java regex for compatibility with Debezium.""" + from datahub.ingestion.source.kafka_connect.source_connectors import ( + DebeziumSourceConnector, + ) + + connector_config = { + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.server.name": "myserver", + "database.dbname": "testdb", + "table.include.list": "public\\.users,schema_.*\\.orders", + } + + manifest = ConnectorManifest( + name="postgres-cdc-regex", + type="source", + config=connector_config, + tasks=[], + topic_names=[], + ) + + config = create_mock_kafka_connect_config() + config.use_schema_resolver = True + report = Mock(spec=KafkaConnectSourceReport) + + mock_schema_resolver = Mock() + mock_schema_resolver.get_urns.return_value = [ + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.public.users,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.schema_v1.orders,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.schema_v2.orders,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.public.products,PROD)", + ] + + connector = DebeziumSourceConnector( + manifest, config, report, schema_resolver=mock_schema_resolver + ) + + topics = connector.get_topics_from_config() + + assert len(topics) == 3 + assert "myserver.public.users" in topics + assert "myserver.schema_v1.orders" in topics + assert "myserver.schema_v2.orders" in topics + assert "myserver.public.products" not in topics + + def test_fallback_when_schema_resolver_unavailable(self) -> None: + """Test fallback to table.include.list only when SchemaResolver is not available.""" + from datahub.ingestion.source.kafka_connect.source_connectors import ( + DebeziumSourceConnector, + ) + + connector_config = { + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.server.name": "myserver", + "database.dbname": "appdb", + "table.include.list": "public.users,public.orders", + } + + manifest = ConnectorManifest( + name="postgres-cdc-no-resolver", + type="source", + config=connector_config, + tasks=[], + topic_names=[], + ) + + config = create_mock_kafka_connect_config() + config.use_schema_resolver = False + report = Mock(spec=KafkaConnectSourceReport) + + connector = DebeziumSourceConnector(manifest, config, report) + + topics = connector.get_topics_from_config() + + assert len(topics) == 2 + assert "myserver.public.users" in topics + assert "myserver.public.orders" in topics + + def test_no_topics_when_no_config_and_no_resolver(self) -> None: + """Test returns empty list when no table.include.list and no SchemaResolver.""" + from datahub.ingestion.source.kafka_connect.source_connectors import ( + DebeziumSourceConnector, + ) + + connector_config = { + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.server.name": "myserver", + "database.dbname": "appdb", + } + + manifest = ConnectorManifest( + name="postgres-cdc-no-config", + type="source", + config=connector_config, + tasks=[], + topic_names=[], + ) + + config = create_mock_kafka_connect_config() + config.use_schema_resolver = False + report = Mock(spec=KafkaConnectSourceReport) + + connector = DebeziumSourceConnector(manifest, config, report) + + topics = connector.get_topics_from_config() + + assert len(topics) == 0 + + class TestErrorHandling: """Test error handling in connector parsing and lineage extraction.""" @@ -3235,7 +3540,7 @@ def test_parser_creation_with_missing_database_in_url(self) -> None: topic_names=[], ) - config = Mock(spec=KafkaConnectSourceConfig) + config = create_mock_kafka_connect_config() report = Mock(spec=KafkaConnectSourceReport) connector = ConfluentJDBCSourceConnector(manifest, config, report) @@ -3258,7 +3563,7 @@ def test_parser_creation_with_cloud_connector_missing_fields(self) -> None: topic_names=[], ) - config = Mock(spec=KafkaConnectSourceConfig) + config = create_mock_kafka_connect_config() report = Mock(spec=KafkaConnectSourceReport) connector = ConfluentJDBCSourceConnector(manifest, config, report) @@ -3282,7 +3587,7 @@ def test_query_based_connector_with_no_source_dataset(self) -> None: topic_names=["custom_topic"], ) - config = Mock(spec=KafkaConnectSourceConfig) + config = create_mock_kafka_connect_config() report = Mock(spec=KafkaConnectSourceReport) connector = ConfluentJDBCSourceConnector(manifest, config, report) @@ -3309,7 +3614,7 @@ def test_get_topics_from_config_handles_exceptions_gracefully(self) -> None: topic_names=[], ) - config = Mock(spec=KafkaConnectSourceConfig) + config = create_mock_kafka_connect_config() report = Mock(spec=KafkaConnectSourceReport) connector = ConfluentJDBCSourceConnector(manifest, config, report) @@ -3343,7 +3648,7 @@ def test_infer_mappings_single_table_mode(self) -> None: topic_names=["users_topic"], ) - config = Mock(spec=KafkaConnectSourceConfig) + config = create_mock_kafka_connect_config() report = Mock(spec=KafkaConnectSourceReport) connector = ConfluentJDBCSourceConnector(manifest, config, report) @@ -3373,7 +3678,7 @@ def test_infer_mappings_multi_table_mode(self) -> None: topic_names=["db_products", "db_customers", "db_orders"], ) - config = Mock(spec=KafkaConnectSourceConfig) + config = create_mock_kafka_connect_config() report = Mock(spec=KafkaConnectSourceReport) connector = ConfluentJDBCSourceConnector(manifest, config, report) @@ -3408,7 +3713,7 @@ def test_infer_mappings_with_topic_prefix_only(self) -> None: topic_names=["staging_events", "staging_logs"], ) - config = Mock(spec=KafkaConnectSourceConfig) + config = create_mock_kafka_connect_config() report = Mock(spec=KafkaConnectSourceReport) connector = ConfluentJDBCSourceConnector(manifest, config, report) @@ -3440,7 +3745,7 @@ def test_infer_mappings_with_schema_qualified_tables(self) -> None: topic_names=["public.users", "analytics.events"], ) - config = Mock(spec=KafkaConnectSourceConfig) + config = create_mock_kafka_connect_config() report = Mock(spec=KafkaConnectSourceReport) connector = ConfluentJDBCSourceConnector(manifest, config, report) @@ -3469,7 +3774,7 @@ def test_infer_mappings_with_no_matching_topics(self) -> None: topic_names=[], ) - config = Mock(spec=KafkaConnectSourceConfig) + config = create_mock_kafka_connect_config() report = Mock(spec=KafkaConnectSourceReport) connector = ConfluentJDBCSourceConnector(manifest, config, report) @@ -3495,7 +3800,7 @@ def test_infer_mappings_handles_unknown_platform(self) -> None: topic_names=["test_topic"], ) - config = Mock(spec=KafkaConnectSourceConfig) + config = create_mock_kafka_connect_config() report = Mock(spec=KafkaConnectSourceReport) connector = ConfluentJDBCSourceConnector(manifest, config, report) @@ -3527,7 +3832,7 @@ def test_cloud_extraction_with_no_topics(self) -> None: topic_names=[], # No topics available ) - config = Mock(spec=KafkaConnectSourceConfig) + config = create_mock_kafka_connect_config() report = Mock(spec=KafkaConnectSourceReport) connector = ConfluentJDBCSourceConnector(manifest, config, report) @@ -3559,7 +3864,7 @@ def test_cloud_extraction_with_transforms_but_no_source_tables(self) -> None: topic_names=["transformed"], ) - config = Mock(spec=KafkaConnectSourceConfig) + config = create_mock_kafka_connect_config() report = Mock(spec=KafkaConnectSourceReport) connector = ConfluentJDBCSourceConnector(manifest, config, report) @@ -3603,7 +3908,7 @@ def test_cloud_environment_detection(self) -> None: topic_names=["topic1"], ) - config = Mock(spec=KafkaConnectSourceConfig) + config = create_mock_kafka_connect_config() report = Mock(spec=KafkaConnectSourceReport) # Both should work but use different extraction logic @@ -3637,7 +3942,7 @@ def test_extract_platform_from_jdbc_url_postgres(self) -> None: topic_names=[], ) - config = Mock(spec=KafkaConnectSourceConfig) + config = create_mock_kafka_connect_config() report = Mock(spec=KafkaConnectSourceReport) connector = ConfluentJDBCSourceConnector(manifest, config, report) @@ -3661,7 +3966,7 @@ def test_extract_platform_from_jdbc_url_mysql(self) -> None: topic_names=[], ) - config = Mock(spec=KafkaConnectSourceConfig) + config = create_mock_kafka_connect_config() report = Mock(spec=KafkaConnectSourceReport) connector = ConfluentJDBCSourceConnector(manifest, config, report) @@ -3685,7 +3990,7 @@ def test_extract_platform_from_jdbc_url_sqlserver(self) -> None: topic_names=[], ) - config = Mock(spec=KafkaConnectSourceConfig) + config = create_mock_kafka_connect_config() report = Mock(spec=KafkaConnectSourceReport) connector = ConfluentJDBCSourceConnector(manifest, config, report) @@ -3709,7 +4014,7 @@ def test_extract_platform_from_invalid_jdbc_url(self) -> None: topic_names=[], ) - config = Mock(spec=KafkaConnectSourceConfig) + config = create_mock_kafka_connect_config() report = Mock(spec=KafkaConnectSourceReport) connector = ConfluentJDBCSourceConnector(manifest, config, report) @@ -3731,7 +4036,7 @@ def test_extract_platform_from_empty_jdbc_url(self) -> None: topic_names=[], ) - config = Mock(spec=KafkaConnectSourceConfig) + config = create_mock_kafka_connect_config() report = Mock(spec=KafkaConnectSourceReport) connector = ConfluentJDBCSourceConnector(manifest, config, report) @@ -3797,3 +4102,734 @@ def test_regex_router_reverse_transform(self) -> None: # Reverse is not fully supported, should return topics unchanged result = plugin.apply_reverse(["transformed_topic"], config) assert result == ["transformed_topic"] + + +# ============================================================================ +# Integration Tests for Schema Resolver, Fine-Grained Lineage, and Environment Detection +# ============================================================================ + + +class MockSchemaResolver(SchemaResolverInterface): + """Mock SchemaResolver for integration testing.""" + + def __init__( + self, + platform: str, + mock_urns: Optional[List[str]] = None, + raise_on_resolve: bool = False, + ): + self._platform = platform + self._mock_urns = set(mock_urns or []) + self._schemas: Dict[str, Dict[str, str]] = {} + self._raise_on_resolve = raise_on_resolve + self.graph = None + self.env = "PROD" + self.platform_instance = None + + @property + def platform(self) -> str: + """Return the platform.""" + return self._platform + + def includes_temp_tables(self) -> bool: + """Return whether temp tables are included.""" + return False + + def get_urns(self): + """Return mock URNs.""" + return self._mock_urns + + def resolve_table(self, table: Any) -> Tuple[str, Optional[Dict[str, str]]]: + """Mock table resolution.""" + if self._raise_on_resolve: + raise Exception("Schema resolver error") + + table_name = table.table + urn = f"urn:li:dataset:(urn:li:dataPlatform:{self.platform},{table_name},PROD)" + schema = self._schemas.get(table_name) + return urn, schema + + def get_urn_for_table(self, table: Any) -> str: + """Mock URN generation for table.""" + table_name = table.table + return f"urn:li:dataset:(urn:li:dataPlatform:{self.platform},{table_name},PROD)" + + def add_schema(self, table_name: str, schema: Dict[str, str]) -> None: + """Add a schema for testing.""" + self._schemas[table_name] = schema + + +class TestSchemaResolverFallback: + """Test schema resolver fallback behavior when DataHub is unavailable or errors occur.""" + + def test_fallback_when_schema_resolver_not_configured(self): + """Test fallback to config-based approach when schema resolver is not configured.""" + config = KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + ) + report = KafkaConnectSourceReport() + + connector_manifest = ConnectorManifest( + name="postgres-source", + type="source", + config={ + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.dbname": "testdb", + "table.include.list": "public.users,public.orders", + "database.server.name": "testserver", + }, + tasks=[], + topic_names=["testserver.public.users", "testserver.public.orders"], + ) + + connector = DebeziumSourceConnector( + connector_manifest=connector_manifest, + config=config, + report=report, + schema_resolver=None, + ) + + lineages = connector.extract_lineages() + + assert len(lineages) == 2 + assert all(lin.fine_grained_lineages is None for lin in lineages) + source_datasets = {lin.source_dataset for lin in lineages} + assert "testdb.public.users" in source_datasets + assert "testdb.public.orders" in source_datasets + + def test_fallback_when_schema_resolver_throws_error(self): + """Test graceful fallback when schema resolver throws an error during resolution.""" + config = KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + use_schema_resolver=True, + schema_resolver_finegrained_lineage=True, + ) + report = KafkaConnectSourceReport() + + connector_manifest = ConnectorManifest( + name="postgres-source", + type="source", + config={ + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.dbname": "testdb", + "table.include.list": "public.users", + "database.server.name": "testserver", + }, + tasks=[], + topic_names=["testserver.public.users"], + ) + + mock_resolver = MockSchemaResolver(platform="postgres", raise_on_resolve=True) + + connector = DebeziumSourceConnector( + connector_manifest=connector_manifest, + config=config, + report=report, + schema_resolver=mock_resolver, # type: ignore[arg-type] + ) + + lineages = connector.extract_lineages() + + assert len(lineages) == 1 + assert lineages[0].fine_grained_lineages is None + assert lineages[0].source_dataset == "testdb.public.users" + assert lineages[0].target_dataset == "testserver.public.users" + + def test_fallback_when_no_schema_metadata_found(self): + """Test fallback when schema resolver returns empty schema metadata.""" + config = KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + use_schema_resolver=True, + schema_resolver_finegrained_lineage=True, + ) + report = KafkaConnectSourceReport() + + connector_manifest = ConnectorManifest( + name="postgres-source", + type="source", + config={ + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.dbname": "testdb", + "table.include.list": "public.users", + "database.server.name": "testserver", + }, + tasks=[], + topic_names=["testserver.public.users"], + ) + + mock_resolver = MockSchemaResolver(platform="postgres") + + connector = DebeziumSourceConnector( + connector_manifest=connector_manifest, + config=config, + report=report, + schema_resolver=mock_resolver, # type: ignore[arg-type] + ) + + lineages = connector.extract_lineages() + + assert len(lineages) == 1 + assert lineages[0].source_dataset == "testdb.public.users" + assert lineages[0].fine_grained_lineages is None + + def test_fallback_pattern_expansion_no_matches(self): + """Test fallback when pattern expansion finds no matching tables in DataHub.""" + config = KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + use_schema_resolver=True, + schema_resolver_expand_patterns=True, + ) + report = KafkaConnectSourceReport() + + connector_manifest = ConnectorManifest( + name="postgres-source", + type="source", + config={ + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.dbname": "testdb", + "table.include.list": "nonexistent.*", + "database.server.name": "testserver", + }, + tasks=[], + topic_names=[], + ) + + mock_resolver = MockSchemaResolver( + platform="postgres", + mock_urns=[ + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.public.users,PROD)", + ], + ) + + connector = DebeziumSourceConnector( + connector_manifest=connector_manifest, + config=config, + report=report, + schema_resolver=mock_resolver, # type: ignore[arg-type] + ) + + result = connector._expand_table_patterns("nonexistent.*", "postgres", "testdb") + assert result == ["nonexistent.*"] + + lineages = connector.extract_lineages() + assert len(lineages) == 1 + assert lineages[0].fine_grained_lineages is None + + def test_fallback_with_partial_schema_availability(self): + """Test behavior when schemas are available for some tables but not others.""" + config = KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + use_schema_resolver=True, + schema_resolver_finegrained_lineage=True, + ) + report = KafkaConnectSourceReport() + + connector_manifest = ConnectorManifest( + name="postgres-source", + type="source", + config={ + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.dbname": "testdb", + "table.include.list": "public.users,public.orders", + "database.server.name": "testserver", + }, + tasks=[], + topic_names=["testserver.public.users", "testserver.public.orders"], + ) + + mock_resolver = MockSchemaResolver(platform="postgres") + mock_resolver.add_schema( + "testdb.public.users", + {"id": "INT", "name": "VARCHAR"}, + ) + + connector = DebeziumSourceConnector( + connector_manifest=connector_manifest, + config=config, + report=report, + schema_resolver=mock_resolver, # type: ignore[arg-type] + ) + + lineages = connector.extract_lineages() + + assert len(lineages) == 2 + + users_lineage = next( + (lin for lin in lineages if "users" in lin.target_dataset), None + ) + orders_lineage = next( + (lin for lin in lineages if "orders" in lin.target_dataset), None + ) + + assert users_lineage is not None + assert users_lineage.fine_grained_lineages is not None + assert len(users_lineage.fine_grained_lineages) == 2 + + assert orders_lineage is not None + assert orders_lineage.fine_grained_lineages is None + + +class TestFineGrainedLineageWithReplaceField: + """Integration tests for fine-grained lineage with ReplaceField transforms.""" + + def test_fine_grained_lineage_with_field_exclusion(self): + """Test that excluded fields are dropped from fine-grained lineage.""" + config = KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + use_schema_resolver=True, + schema_resolver_finegrained_lineage=True, + ) + report = KafkaConnectSourceReport() + + connector_manifest = ConnectorManifest( + name="postgres-source", + type="source", + config={ + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.dbname": "testdb", + "table.include.list": "public.users", + "database.server.name": "testserver", + "transforms": "dropPassword", + "transforms.dropPassword.type": "org.apache.kafka.connect.transforms.ReplaceField$Value", + "transforms.dropPassword.exclude": "password", + }, + tasks=[], + topic_names=["testserver.public.users"], + ) + + mock_resolver = MockSchemaResolver(platform="postgres") + mock_resolver.add_schema( + "testdb.public.users", + { + "id": "INT", + "username": "VARCHAR", + "password": "VARCHAR", + "email": "VARCHAR", + }, + ) + + connector = DebeziumSourceConnector( + connector_manifest=connector_manifest, + config=config, + report=report, + schema_resolver=mock_resolver, # type: ignore[arg-type] + ) + + lineages = connector.extract_lineages() + + assert len(lineages) == 1 + lineage = lineages[0] + + assert lineage.fine_grained_lineages is not None + assert len(lineage.fine_grained_lineages) == 3 + + downstream_fields = [] + for fg_lineage in lineage.fine_grained_lineages: + for downstream_urn in fg_lineage["downstreams"]: + field_name = downstream_urn.split(",")[-1].rstrip(")") + downstream_fields.append(field_name) + + assert "password" not in downstream_fields + assert "id" in downstream_fields + assert "username" in downstream_fields + assert "email" in downstream_fields + + def test_fine_grained_lineage_with_field_inclusion(self): + """Test that only included fields appear in fine-grained lineage.""" + config = KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + use_schema_resolver=True, + schema_resolver_finegrained_lineage=True, + ) + report = KafkaConnectSourceReport() + + connector_manifest = ConnectorManifest( + name="postgres-source", + type="source", + config={ + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.dbname": "testdb", + "table.include.list": "public.users", + "database.server.name": "testserver", + "transforms": "keepOnly", + "transforms.keepOnly.type": "org.apache.kafka.connect.transforms.ReplaceField$Value", + "transforms.keepOnly.include": "id,username,email", + }, + tasks=[], + topic_names=["testserver.public.users"], + ) + + mock_resolver = MockSchemaResolver(platform="postgres") + mock_resolver.add_schema( + "testdb.public.users", + { + "id": "INT", + "username": "VARCHAR", + "password": "VARCHAR", + "email": "VARCHAR", + "internal_notes": "TEXT", + }, + ) + + connector = DebeziumSourceConnector( + connector_manifest=connector_manifest, + config=config, + report=report, + schema_resolver=mock_resolver, # type: ignore[arg-type] + ) + + lineages = connector.extract_lineages() + + assert len(lineages) == 1 + lineage = lineages[0] + + assert lineage.fine_grained_lineages is not None + assert len(lineage.fine_grained_lineages) == 3 + + downstream_fields = [] + for fg_lineage in lineage.fine_grained_lineages: + for downstream_urn in fg_lineage["downstreams"]: + field_name = downstream_urn.split(",")[-1].rstrip(")") + downstream_fields.append(field_name) + + assert set(downstream_fields) == {"id", "username", "email"} + assert "password" not in downstream_fields + assert "internal_notes" not in downstream_fields + + def test_fine_grained_lineage_with_field_renaming(self): + """Test that renamed fields appear with correct names in fine-grained lineage.""" + config = KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + use_schema_resolver=True, + schema_resolver_finegrained_lineage=True, + ) + report = KafkaConnectSourceReport() + + connector_manifest = ConnectorManifest( + name="postgres-source", + type="source", + config={ + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.dbname": "testdb", + "table.include.list": "public.users", + "database.server.name": "testserver", + "transforms": "renameFields", + "transforms.renameFields.type": "org.apache.kafka.connect.transforms.ReplaceField$Value", + "transforms.renameFields.renames": "user_id:id,user_name:name", + }, + tasks=[], + topic_names=["testserver.public.users"], + ) + + mock_resolver = MockSchemaResolver(platform="postgres") + mock_resolver.add_schema( + "testdb.public.users", + { + "user_id": "INT", + "user_name": "VARCHAR", + "email": "VARCHAR", + }, + ) + + connector = DebeziumSourceConnector( + connector_manifest=connector_manifest, + config=config, + report=report, + schema_resolver=mock_resolver, # type: ignore[arg-type] + ) + + lineages = connector.extract_lineages() + + assert len(lineages) == 1 + lineage = lineages[0] + + assert lineage.fine_grained_lineages is not None + assert len(lineage.fine_grained_lineages) == 3 + + downstream_fields = [] + for fg_lineage in lineage.fine_grained_lineages: + for downstream_urn in fg_lineage["downstreams"]: + field_name = downstream_urn.split(",")[-1].rstrip(")") + downstream_fields.append(field_name) + + assert "id" in downstream_fields + assert "name" in downstream_fields + assert "email" in downstream_fields + assert "user_id" not in downstream_fields + assert "user_name" not in downstream_fields + + def test_fine_grained_lineage_with_chained_transforms(self): + """Test fine-grained lineage with multiple chained ReplaceField transforms.""" + config = KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + use_schema_resolver=True, + schema_resolver_finegrained_lineage=True, + ) + report = KafkaConnectSourceReport() + + connector_manifest = ConnectorManifest( + name="postgres-source", + type="source", + config={ + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.dbname": "testdb", + "table.include.list": "public.users", + "database.server.name": "testserver", + "transforms": "dropSensitive,renameFields", + "transforms.dropSensitive.type": "org.apache.kafka.connect.transforms.ReplaceField$Value", + "transforms.dropSensitive.exclude": "password,ssn", + "transforms.renameFields.type": "org.apache.kafka.connect.transforms.ReplaceField$Value", + "transforms.renameFields.renames": "user_id:id,user_name:name", + }, + tasks=[], + topic_names=["testserver.public.users"], + ) + + mock_resolver = MockSchemaResolver(platform="postgres") + mock_resolver.add_schema( + "testdb.public.users", + { + "user_id": "INT", + "user_name": "VARCHAR", + "email": "VARCHAR", + "password": "VARCHAR", + "ssn": "VARCHAR", + }, + ) + + connector = DebeziumSourceConnector( + connector_manifest=connector_manifest, + config=config, + report=report, + schema_resolver=mock_resolver, # type: ignore[arg-type] + ) + + lineages = connector.extract_lineages() + + assert len(lineages) == 1 + lineage = lineages[0] + + assert lineage.fine_grained_lineages is not None + assert len(lineage.fine_grained_lineages) == 3 + + downstream_fields = [] + for fg_lineage in lineage.fine_grained_lineages: + for downstream_urn in fg_lineage["downstreams"]: + field_name = downstream_urn.split(",")[-1].rstrip(")") + downstream_fields.append(field_name) + + assert set(downstream_fields) == {"id", "name", "email"} + assert "password" not in downstream_fields + assert "ssn" not in downstream_fields + assert "user_id" not in downstream_fields + assert "user_name" not in downstream_fields + + +class TestPlatformCloudEnvironmentDetection: + """Integration tests for Platform vs Cloud environment detection.""" + + def test_platform_environment_detection_jdbc(self): + """Test that self-hosted JDBC connector is detected as Platform environment.""" + config = KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + ) + report = KafkaConnectSourceReport() + + connector_manifest = ConnectorManifest( + name="jdbc-source", + type="source", + config={ + "connector.class": JDBC_SOURCE_CONNECTOR_CLASS, + "connection.url": "jdbc:postgresql://localhost:5432/testdb", + "table.include.list": "public.users,public.orders", + "topic.prefix": "db_", + }, + tasks=[], + topic_names=["db_users", "db_orders"], + ) + + connector = ConfluentJDBCSourceConnector( + connector_manifest=connector_manifest, + config=config, + report=report, + ) + + lineages = connector.extract_lineages() + + assert len(lineages) == 2 + assert all(lin.target_platform == "kafka" for lin in lineages) + + target_datasets = {lin.target_dataset for lin in lineages} + assert "db_users" in target_datasets + assert "db_orders" in target_datasets + + def test_cloud_environment_detection_postgres_cdc(self): + """Test that Confluent Cloud CDC connector is detected as Cloud environment.""" + config = KafkaConnectSourceConfig( + connect_uri="https://api.confluent.cloud/connect/v1/environments/env-123/clusters/lkc-456", + cluster_name="test", + ) + report = KafkaConnectSourceReport() + + connector_manifest = ConnectorManifest( + name="postgres-cloud-source", + type="source", + config={ + "connector.class": POSTGRES_CDC_SOURCE_CLOUD, + "database.hostname": "postgres.example.com", + "database.port": "5432", + "database.dbname": "testdb", + "table.include.list": "public.users,public.orders", + "database.server.name": "cloudserver", + }, + tasks=[], + topic_names=["cloudserver.public.users", "cloudserver.public.orders"], + ) + + connector = DebeziumSourceConnector( + connector_manifest=connector_manifest, + config=config, + report=report, + ) + + assert connector_manifest.config["connector.class"] in CLOUD_JDBC_SOURCE_CLASSES + + lineages = connector.extract_lineages() + + assert len(lineages) == 2 + target_datasets = {lin.target_dataset for lin in lineages} + assert "cloudserver.public.users" in target_datasets + assert "cloudserver.public.orders" in target_datasets + + def test_platform_with_transforms_uses_actual_topics(self): + """Test that Platform environment uses actual runtime topics from API when transforms are present.""" + config = KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + ) + report = KafkaConnectSourceReport() + + connector_manifest = ConnectorManifest( + name="jdbc-with-transforms", + type="source", + config={ + "connector.class": JDBC_SOURCE_CONNECTOR_CLASS, + "connection.url": "jdbc:postgresql://localhost:5432/testdb", + "table.include.list": "public.users", + "topic.prefix": "db_", + "transforms": "route", + "transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter", + "transforms.route.regex": "db_(.*)", + "transforms.route.replacement": "prod_$1", + }, + tasks=[], + topic_names=["prod_users"], + ) + + connector = ConfluentJDBCSourceConnector( + connector_manifest=connector_manifest, + config=config, + report=report, + ) + + lineages = connector.extract_lineages() + + assert len(lineages) == 1 + assert lineages[0].target_dataset == "prod_users" + source_dataset = lineages[0].source_dataset + assert source_dataset is not None + assert "users" in source_dataset + + def test_cloud_with_transforms_without_jpype(self): + """Test that Cloud environment handles gracefully when JPype is not available.""" + config = KafkaConnectSourceConfig( + connect_uri="https://api.confluent.cloud/connect/v1/environments/env-123/clusters/lkc-456", + cluster_name="test", + ) + report = KafkaConnectSourceReport() + + connector_manifest = ConnectorManifest( + name="cloud-with-transforms", + type="source", + config={ + "connector.class": POSTGRES_CDC_SOURCE_CLOUD, + "database.hostname": "postgres.example.com", + "database.port": "5432", + "database.dbname": "testdb", + "table.include.list": "public.users,public.orders", + "database.server.name": "cloudserver", + }, + tasks=[], + topic_names=[ + "cloudserver.public.users", + "cloudserver.public.orders", + "other_connector_topic", + ], + ) + + connector = DebeziumSourceConnector( + connector_manifest=connector_manifest, + config=config, + report=report, + ) + + lineages = connector.extract_lineages() + + assert len(lineages) == 2 + target_datasets = {lin.target_dataset for lin in lineages} + assert "cloudserver.public.users" in target_datasets + assert "cloudserver.public.orders" in target_datasets + + def test_platform_single_table_multi_topic_transform(self): + """Test Platform environment with single source table producing multiple topics.""" + config = KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + ) + report = KafkaConnectSourceReport() + + connector_manifest = ConnectorManifest( + name="jdbc-multi-topic", + type="source", + config={ + "connector.class": JDBC_SOURCE_CONNECTOR_CLASS, + "connection.url": "jdbc:postgresql://localhost:5432/testdb", + "table.include.list": "public.events", + "topic.prefix": "db_", + "transforms": "extractTopic", + "transforms.extractTopic.type": "io.confluent.connect.transforms.ExtractTopic$Value", + "transforms.extractTopic.field": "event_type", + }, + tasks=[], + topic_names=["user_events", "order_events", "system_events"], + ) + + connector = ConfluentJDBCSourceConnector( + connector_manifest=connector_manifest, + config=config, + report=report, + ) + + lineages = connector.extract_lineages() + + assert len(lineages) == 3 + source_datasets = {lin.source_dataset for lin in lineages} + assert len(source_datasets) == 1 + source_dataset = list(source_datasets)[0] + assert source_dataset is not None + assert "events" in source_dataset + + target_datasets = {lin.target_dataset for lin in lineages} + assert "user_events" in target_datasets + assert "order_events" in target_datasets + assert "system_events" in target_datasets diff --git a/metadata-ingestion/tests/unit/test_kafka_connect_config_validation.py b/metadata-ingestion/tests/unit/test_kafka_connect_config_validation.py new file mode 100644 index 00000000000000..24d65396a486d8 --- /dev/null +++ b/metadata-ingestion/tests/unit/test_kafka_connect_config_validation.py @@ -0,0 +1,305 @@ +""" +Tests for Kafka Connect configuration validation. + +This module tests the pydantic validators that ensure proper configuration +interdependencies and provide clear error messages for invalid combinations. +""" + +from typing import Any + +import pytest + +from datahub.ingestion.source.kafka_connect.common import KafkaConnectSourceConfig + + +class TestConfigurationValidation: + """Tests for KafkaConnectSourceConfig validators.""" + + def test_schema_resolver_defaults_when_enabled(self): + """Test that schema resolver features default to True when use_schema_resolver=True.""" + config = KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + use_schema_resolver=True, + ) + + # Should auto-set both features to True + assert config.use_schema_resolver is True + assert config.schema_resolver_expand_patterns is True + assert config.schema_resolver_finegrained_lineage is True + + def test_schema_resolver_defaults_when_disabled(self): + """Test that schema resolver features default to False when use_schema_resolver=False.""" + config = KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + use_schema_resolver=False, + ) + + # Should auto-set both features to False + assert config.use_schema_resolver is False + assert config.schema_resolver_expand_patterns is False + assert config.schema_resolver_finegrained_lineage is False + + def test_schema_resolver_explicit_override_when_enabled(self): + """Test that explicit values override defaults when use_schema_resolver=True.""" + config = KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + use_schema_resolver=True, + schema_resolver_expand_patterns=False, + schema_resolver_finegrained_lineage=True, + ) + + assert config.use_schema_resolver is True + assert config.schema_resolver_expand_patterns is False + assert config.schema_resolver_finegrained_lineage is True + + def test_schema_resolver_explicit_override_when_disabled(self): + """Test that explicit True values are preserved even when use_schema_resolver=False.""" + # This allows users to pre-configure features before enabling schema resolver + config = KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + use_schema_resolver=False, + schema_resolver_expand_patterns=True, + schema_resolver_finegrained_lineage=False, + ) + + assert config.use_schema_resolver is False + # Explicit values are preserved (though they won't take effect until use_schema_resolver=True) + assert config.schema_resolver_expand_patterns is True + assert config.schema_resolver_finegrained_lineage is False + + def test_kafka_api_key_without_secret_raises_error(self): + """Test that kafka_api_key without kafka_api_secret raises error.""" + with pytest.raises(ValueError) as exc_info: + KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + kafka_api_key="test-key", + kafka_api_secret=None, + ) + + assert "kafka_api_key" in str(exc_info.value) + assert "kafka_api_secret" in str(exc_info.value) + assert "must be provided together" in str(exc_info.value) + + def test_kafka_api_secret_without_key_raises_error(self): + """Test that kafka_api_secret without kafka_api_key raises error.""" + with pytest.raises(ValueError) as exc_info: + KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + kafka_api_key=None, + kafka_api_secret="test-secret", + ) + + assert "kafka_api_key" in str(exc_info.value) + assert "kafka_api_secret" in str(exc_info.value) + assert "must be provided together" in str(exc_info.value) + + def test_kafka_api_credentials_valid_when_both_provided(self): + """Test that kafka API credentials work when both key and secret provided.""" + config = KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + kafka_api_key="test-key", + kafka_api_secret="test-secret", + ) + + assert config.kafka_api_key == "test-key" + assert config.kafka_api_secret == "test-secret" + + def test_environment_id_without_cluster_id_raises_error(self): + """Test that confluent_cloud_environment_id without cluster_id raises error.""" + with pytest.raises(ValueError) as exc_info: + KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + confluent_cloud_environment_id="env-123", + confluent_cloud_cluster_id=None, + ) + + assert "confluent_cloud_environment_id" in str(exc_info.value) + assert "confluent_cloud_cluster_id" in str(exc_info.value) + assert "must be provided together" in str(exc_info.value) + + def test_cluster_id_without_environment_id_raises_error(self): + """Test that confluent_cloud_cluster_id without environment_id raises error.""" + with pytest.raises(ValueError) as exc_info: + KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + confluent_cloud_environment_id=None, + confluent_cloud_cluster_id="lkc-123", + ) + + assert "confluent_cloud_environment_id" in str(exc_info.value) + assert "confluent_cloud_cluster_id" in str(exc_info.value) + assert "must be provided together" in str(exc_info.value) + + def test_cloud_ids_valid_when_both_provided(self): + """Test that Confluent Cloud IDs work when both provided.""" + config = KafkaConnectSourceConfig( + cluster_name="test", + confluent_cloud_environment_id="env-123", + confluent_cloud_cluster_id="lkc-456", + ) + + assert config.confluent_cloud_environment_id == "env-123" + assert config.confluent_cloud_cluster_id == "lkc-456" + # URI should be auto-constructed + assert "env-123" in config.connect_uri + assert "lkc-456" in config.connect_uri + + def test_invalid_kafka_rest_endpoint_raises_error(self): + """Test that invalid kafka_rest_endpoint format raises error.""" + with pytest.raises(ValueError) as exc_info: + KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + kafka_rest_endpoint="invalid-endpoint", + ) + + assert "kafka_rest_endpoint" in str(exc_info.value) + assert "HTTP" in str(exc_info.value) + + def test_valid_kafka_rest_endpoint_https(self): + """Test that valid HTTPS kafka_rest_endpoint is accepted.""" + config = KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + kafka_rest_endpoint="https://pkc-12345.us-west-2.aws.confluent.cloud", + ) + + assert ( + config.kafka_rest_endpoint + == "https://pkc-12345.us-west-2.aws.confluent.cloud" + ) + + def test_valid_kafka_rest_endpoint_http(self): + """Test that valid HTTP kafka_rest_endpoint is accepted.""" + config = KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + kafka_rest_endpoint="http://localhost:8082", + ) + + assert config.kafka_rest_endpoint == "http://localhost:8082" + + def test_auto_construct_uri_from_cloud_ids(self): + """Test automatic URI construction from Confluent Cloud IDs.""" + config = KafkaConnectSourceConfig( + cluster_name="test", + confluent_cloud_environment_id="env-xyz123", + confluent_cloud_cluster_id="lkc-abc456", + ) + + expected_uri = ( + "https://api.confluent.cloud/connect/v1/" + "environments/env-xyz123/" + "clusters/lkc-abc456" + ) + assert config.connect_uri == expected_uri + + def test_explicit_uri_not_overwritten_by_cloud_ids(self): + """Test that explicit connect_uri is preserved even with Cloud IDs.""" + explicit_uri = "http://my-custom-connect:8083" + config = KafkaConnectSourceConfig( + connect_uri=explicit_uri, + cluster_name="test", + confluent_cloud_environment_id="env-123", + confluent_cloud_cluster_id="lkc-456", + ) + + # Should keep explicit URI (but validator warns) + assert config.connect_uri == explicit_uri + + def test_default_config_is_valid(self): + """Test that default configuration is valid.""" + config = KafkaConnectSourceConfig( + cluster_name="test", + ) + + # Should not raise any validation errors + assert config.use_schema_resolver is False + # When use_schema_resolver=False, features should default to False + assert config.schema_resolver_expand_patterns is False + assert config.schema_resolver_finegrained_lineage is False + assert config.kafka_api_key is None + assert config.kafka_api_secret is None + + def test_schema_resolver_enabled_with_all_features_disabled_warns( + self, caplog: Any + ) -> None: + """Test warning when schema resolver enabled but all features disabled.""" + import logging + + with caplog.at_level(logging.WARNING): + KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + use_schema_resolver=True, + schema_resolver_expand_patterns=False, + schema_resolver_finegrained_lineage=False, + ) + + # Should log a warning about no features enabled + assert any( + "Schema resolver is enabled but all features are disabled" in record.message + for record in caplog.records + ) + + def test_multiple_validation_errors_caught_independently(self): + """Test that each validation error is caught independently.""" + # Test kafka_api_key without secret + with pytest.raises(ValueError, match="kafka_api_key.*kafka_api_secret"): + KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + kafka_api_key="key", + ) + + # Test environment_id without cluster_id + with pytest.raises( + ValueError, + match="confluent_cloud_environment_id.*confluent_cloud_cluster_id", + ): + KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + confluent_cloud_environment_id="env-123", + ) + + # Test invalid kafka_rest_endpoint format + with pytest.raises(ValueError, match="kafka_rest_endpoint.*HTTP"): + KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + kafka_rest_endpoint="not-a-url", + ) + + def test_complex_valid_configuration(self): + """Test a complex but valid configuration with multiple features.""" + config = KafkaConnectSourceConfig( + cluster_name="test", + confluent_cloud_environment_id="env-123", + confluent_cloud_cluster_id="lkc-456", + kafka_api_key="api-key", + kafka_api_secret="api-secret", + kafka_rest_endpoint="https://pkc-12345.us-west-2.aws.confluent.cloud", + use_schema_resolver=True, + schema_resolver_expand_patterns=True, + schema_resolver_finegrained_lineage=True, + use_connect_topics_api=True, + ) + + # All fields should be properly set + assert config.confluent_cloud_environment_id == "env-123" + assert config.confluent_cloud_cluster_id == "lkc-456" + assert config.kafka_api_key == "api-key" + assert config.kafka_api_secret == "api-secret" + assert config.use_schema_resolver is True + assert config.schema_resolver_expand_patterns is True + assert config.schema_resolver_finegrained_lineage is True diff --git a/metadata-ingestion/tests/unit/test_kafka_connect_debezium_table_discovery.py b/metadata-ingestion/tests/unit/test_kafka_connect_debezium_table_discovery.py new file mode 100644 index 00000000000000..db3e73ed40435a --- /dev/null +++ b/metadata-ingestion/tests/unit/test_kafka_connect_debezium_table_discovery.py @@ -0,0 +1,661 @@ +"""Tests for Debezium connector table discovery and filtering functionality.""" + +from typing import Optional +from unittest.mock import Mock, patch + +import jpype +import jpype.imports +import pytest + +from datahub.ingestion.source.kafka_connect.common import ( + ConnectorManifest, + KafkaConnectSourceConfig, + KafkaConnectSourceReport, +) +from datahub.ingestion.source.kafka_connect.source_connectors import ( + DebeziumSourceConnector, +) +from datahub.sql_parsing.schema_resolver import SchemaResolver + + +@pytest.fixture(scope="session", autouse=True) +def ensure_jvm_started(): + """Ensure JVM is started for all tests requiring Java regex.""" + if not jpype.isJVMStarted(): + jpype.startJVM(jpype.getDefaultJVMPath()) + yield + + +def create_debezium_connector( + connector_config: dict, + schema_resolver: Optional[SchemaResolver] = None, + use_schema_resolver: bool = False, +) -> DebeziumSourceConnector: + """Helper to create a DebeziumSourceConnector instance for testing.""" + manifest = ConnectorManifest( + name="test-connector", + type="source", + config=connector_config, + tasks=[], + topic_names=[], + ) + + config = Mock(spec=KafkaConnectSourceConfig) + config.use_schema_resolver = use_schema_resolver + config.schema_resolver_expand_patterns = True + config.env = "PROD" + + report = Mock(spec=KafkaConnectSourceReport) + + connector = DebeziumSourceConnector(manifest, config, report) + connector.schema_resolver = schema_resolver + + return connector + + +class TestGetTableNamesFromConfigOrDiscovery: + """Tests for _get_table_names_from_config_or_discovery method.""" + + def test_no_schema_resolver_uses_table_include_list(self) -> None: + """When SchemaResolver is not available, should use table.include.list from config.""" + connector_config = { + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.server.name": "myserver", + "database.dbname": "testdb", + "table.include.list": "public.users,public.orders,private.data", + } + + connector = create_debezium_connector( + connector_config, use_schema_resolver=False + ) + + result = connector._get_table_names_from_config_or_discovery( + connector_config, "testdb", "postgres" + ) + + assert result == ["public.users", "public.orders", "private.data"] + + def test_no_schema_resolver_no_table_config_returns_empty(self) -> None: + """When SchemaResolver disabled and no table config, should return empty list.""" + connector_config = { + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.server.name": "myserver", + "database.dbname": "testdb", + } + + connector = create_debezium_connector( + connector_config, use_schema_resolver=False + ) + + result = connector._get_table_names_from_config_or_discovery( + connector_config, "testdb", "postgres" + ) + + assert result == [] + + def test_schema_resolver_no_database_name_falls_back_to_config(self) -> None: + """When SchemaResolver enabled but no database name, should fall back to table.include.list.""" + connector_config = { + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.server.name": "myserver", + "table.include.list": "public.users", + } + + schema_resolver = Mock(spec=SchemaResolver) + connector = create_debezium_connector( + connector_config, schema_resolver=schema_resolver, use_schema_resolver=True + ) + + result = connector._get_table_names_from_config_or_discovery( + connector_config, None, "postgres" + ) + + assert result == ["public.users"] + + def test_schema_resolver_discovers_tables_from_database(self) -> None: + """When SchemaResolver enabled with database name, should discover tables.""" + connector_config = { + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.server.name": "myserver", + "database.dbname": "testdb", + } + + # Mock SchemaResolver to return discovered tables + schema_resolver = Mock(spec=SchemaResolver) + connector = create_debezium_connector( + connector_config, schema_resolver=schema_resolver, use_schema_resolver=True + ) + + # Mock _discover_tables_from_database to return tables + with patch.object( + connector, + "_discover_tables_from_database", + return_value=["public.users", "public.orders", "public.products"], + ): + result = connector._get_table_names_from_config_or_discovery( + connector_config, "testdb", "postgres" + ) + + assert result == ["public.users", "public.orders", "public.products"] + + def test_schema_resolver_no_tables_found_returns_empty(self) -> None: + """When SchemaResolver finds no tables, should return empty list.""" + connector_config = { + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.server.name": "myserver", + "database.dbname": "testdb", + } + + schema_resolver = Mock(spec=SchemaResolver) + connector = create_debezium_connector( + connector_config, schema_resolver=schema_resolver, use_schema_resolver=True + ) + + # Mock _discover_tables_from_database to return empty + with patch.object(connector, "_discover_tables_from_database", return_value=[]): + result = connector._get_table_names_from_config_or_discovery( + connector_config, "testdb", "postgres" + ) + + assert result == [] + + +class TestApplySchemaFilters: + """Tests for _apply_schema_filters method.""" + + def test_no_schema_filters_returns_all_tables(self) -> None: + """When no schema filters configured, should return all input tables.""" + connector_config = { + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.server.name": "myserver", + "database.dbname": "testdb", + } + + connector = create_debezium_connector(connector_config) + + tables = ["public.users", "public.orders", "private.data"] + result = connector._apply_schema_filters(connector_config, tables) + + assert result == tables + + def test_schema_include_list_filters_by_schema_name(self) -> None: + """schema.include.list should filter tables by schema name.""" + connector_config = { + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.server.name": "myserver", + "database.dbname": "testdb", + "schema.include.list": "public", + } + + connector = create_debezium_connector(connector_config) + + tables = ["public.users", "public.orders", "private.data"] + result = connector._apply_schema_filters(connector_config, tables) + + assert result == ["public.users", "public.orders"] + + def test_schema_include_list_with_regex_pattern(self) -> None: + """schema.include.list should support Java regex patterns.""" + connector_config = { + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.server.name": "myserver", + "database.dbname": "testdb", + "schema.include.list": "schema_v[0-9]+", + } + + connector = create_debezium_connector(connector_config) + + tables = [ + "schema_v1.orders", + "schema_v2.orders", + "schema_vX.orders", + "public.users", + ] + result = connector._apply_schema_filters(connector_config, tables) + + assert result == ["schema_v1.orders", "schema_v2.orders"] + + def test_schema_include_list_multiple_patterns(self) -> None: + """schema.include.list should support multiple comma-separated patterns.""" + connector_config = { + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.server.name": "myserver", + "database.dbname": "testdb", + "schema.include.list": "public,analytics", + } + + connector = create_debezium_connector(connector_config) + + tables = [ + "public.users", + "analytics.events", + "private.data", + "staging.temp", + ] + result = connector._apply_schema_filters(connector_config, tables) + + assert result == ["public.users", "analytics.events"] + + def test_schema_include_list_no_matches_returns_empty(self) -> None: + """schema.include.list with no matches should return empty list.""" + connector_config = { + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.server.name": "myserver", + "database.dbname": "testdb", + "schema.include.list": "nonexistent", + } + + connector = create_debezium_connector(connector_config) + + tables = ["public.users", "private.data"] + result = connector._apply_schema_filters(connector_config, tables) + + assert result == [] + + def test_schema_exclude_list_filters_out_matching_schemas(self) -> None: + """schema.exclude.list should filter out tables with matching schema names.""" + connector_config = { + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.server.name": "myserver", + "database.dbname": "testdb", + "schema.exclude.list": "private", + } + + connector = create_debezium_connector(connector_config) + + tables = ["public.users", "public.orders", "private.data", "private.secrets"] + result = connector._apply_schema_filters(connector_config, tables) + + assert result == ["public.users", "public.orders"] + + def test_schema_exclude_list_with_regex_pattern(self) -> None: + """schema.exclude.list should support Java regex patterns.""" + connector_config = { + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.server.name": "myserver", + "database.dbname": "testdb", + "schema.exclude.list": "temp.*", + } + + connector = create_debezium_connector(connector_config) + + tables = ["public.users", "temp_staging.data", "temp_test.data", "prod.users"] + result = connector._apply_schema_filters(connector_config, tables) + + assert result == ["public.users", "prod.users"] + + def test_schema_include_and_exclude_combined(self) -> None: + """schema.include.list and schema.exclude.list should work together.""" + connector_config = { + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.server.name": "myserver", + "database.dbname": "testdb", + "schema.include.list": "public,analytics", + "schema.exclude.list": "analytics", + } + + connector = create_debezium_connector(connector_config) + + tables = [ + "public.users", + "public.orders", + "analytics.events", + "private.data", + ] + result = connector._apply_schema_filters(connector_config, tables) + + # Include public and analytics, then exclude analytics + assert result == ["public.users", "public.orders"] + + def test_schema_filters_skip_tables_without_schema_separator(self) -> None: + """Schema filters should handle tables without '.' separator.""" + connector_config = { + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.server.name": "myserver", + "database.dbname": "testdb", + "schema.include.list": "public", + } + + connector = create_debezium_connector(connector_config) + + tables = ["public.users", "just_table_name", "private.data"] + result = connector._apply_schema_filters(connector_config, tables) + + # Tables without '.' are skipped by include filter + assert result == ["public.users"] + + +class TestApplyTableFilters: + """Tests for _apply_table_filters method.""" + + def test_no_table_filters_returns_all_tables(self) -> None: + """When no table filters configured, should return all input tables.""" + connector_config = { + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.server.name": "myserver", + "database.dbname": "testdb", + } + + connector = create_debezium_connector(connector_config) + + tables = ["public.users", "public.orders", "private.data"] + result = connector._apply_table_filters(connector_config, tables) + + assert result == tables + + def test_table_include_list_exact_match(self) -> None: + """table.include.list should match exact table names.""" + connector_config = { + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.server.name": "myserver", + "database.dbname": "testdb", + "table.include.list": "public.users,public.orders", + } + + connector = create_debezium_connector(connector_config) + + tables = ["public.users", "public.orders", "public.products", "private.data"] + result = connector._apply_table_filters(connector_config, tables) + + assert result == ["public.users", "public.orders"] + + def test_table_include_list_with_wildcard(self) -> None: + """table.include.list should support wildcard patterns.""" + connector_config = { + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.server.name": "myserver", + "database.dbname": "testdb", + "table.include.list": "public.*", + } + + connector = create_debezium_connector(connector_config) + + tables = ["public.users", "public.orders", "private.data"] + result = connector._apply_table_filters(connector_config, tables) + + assert result == ["public.users", "public.orders"] + + def test_table_include_list_with_character_class(self) -> None: + """table.include.list should support Java regex character classes.""" + connector_config = { + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.server.name": "myserver", + "database.dbname": "testdb", + "table.include.list": "public.user[0-9]+", + } + + connector = create_debezium_connector(connector_config) + + tables = ["public.user1", "public.user2", "public.userX", "public.orders"] + result = connector._apply_table_filters(connector_config, tables) + + assert result == ["public.user1", "public.user2"] + + def test_table_include_list_no_matches_returns_empty(self) -> None: + """table.include.list with no matches should return empty list.""" + connector_config = { + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.server.name": "myserver", + "database.dbname": "testdb", + "table.include.list": "nonexistent.*", + } + + connector = create_debezium_connector(connector_config) + + tables = ["public.users", "private.data"] + result = connector._apply_table_filters(connector_config, tables) + + assert result == [] + + def test_table_exclude_list_filters_out_matching_tables(self) -> None: + """table.exclude.list should filter out matching tables.""" + connector_config = { + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.server.name": "myserver", + "database.dbname": "testdb", + "table.exclude.list": "public.temp.*", + } + + connector = create_debezium_connector(connector_config) + + tables = ["public.users", "public.temp_staging", "public.temp_test"] + result = connector._apply_table_filters(connector_config, tables) + + assert result == ["public.users"] + + def test_table_include_and_exclude_combined(self) -> None: + """table.include.list and table.exclude.list should work together.""" + connector_config = { + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.server.name": "myserver", + "database.dbname": "testdb", + "table.include.list": "public.*", + "table.exclude.list": "public.temp.*", + } + + connector = create_debezium_connector(connector_config) + + tables = [ + "public.users", + "public.orders", + "public.temp_staging", + "private.data", + ] + result = connector._apply_table_filters(connector_config, tables) + + # Include public.*, then exclude public.temp.* + assert result == ["public.users", "public.orders"] + + def test_table_whitelist_legacy_config_name(self) -> None: + """Should support legacy 'table.whitelist' config name.""" + connector_config = { + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.server.name": "myserver", + "database.dbname": "testdb", + "table.whitelist": "public.users", + } + + connector = create_debezium_connector(connector_config) + + tables = ["public.users", "public.orders"] + result = connector._apply_table_filters(connector_config, tables) + + assert result == ["public.users"] + + def test_table_blacklist_legacy_config_name(self) -> None: + """Should support legacy 'table.blacklist' config name.""" + connector_config = { + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.server.name": "myserver", + "database.dbname": "testdb", + "table.blacklist": "public.temp.*", + } + + connector = create_debezium_connector(connector_config) + + tables = ["public.users", "public.temp_staging"] + result = connector._apply_table_filters(connector_config, tables) + + assert result == ["public.users"] + + +class TestDeriveTopicsFromTables: + """Tests for _derive_topics_from_tables method.""" + + def test_derive_topics_with_server_name(self) -> None: + """Should derive topics in format {server_name}.{schema.table}.""" + connector_config = { + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.server.name": "myserver", + "database.dbname": "testdb", + } + + connector = create_debezium_connector(connector_config) + + table_names = ["public.users", "public.orders", "analytics.events"] + result = connector._derive_topics_from_tables(table_names, "myserver") + + assert result == [ + "myserver.public.users", + "myserver.public.orders", + "myserver.analytics.events", + ] + + def test_derive_topics_without_server_name(self) -> None: + """Should use table name directly when no server name.""" + connector_config = { + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.dbname": "testdb", + } + + connector = create_debezium_connector(connector_config) + + table_names = ["public.users", "public.orders"] + result = connector._derive_topics_from_tables(table_names, None) + + assert result == ["public.users", "public.orders"] + + def test_derive_topics_preserves_schema_table_format(self) -> None: + """Should preserve schema.table format in topic names.""" + connector_config = { + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.server.name": "myserver", + "database.dbname": "testdb", + } + + connector = create_debezium_connector(connector_config) + + table_names = ["schema1.table1", "schema2.table2"] + result = connector._derive_topics_from_tables(table_names, "myserver") + + assert result == ["myserver.schema1.table1", "myserver.schema2.table2"] + + +class TestGetTopicsFromConfigIntegration: + """Integration tests for the full get_topics_from_config flow.""" + + def test_full_flow_without_schema_resolver(self) -> None: + """Test complete flow using only table.include.list.""" + connector_config = { + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.server.name": "myserver", + "database.dbname": "testdb", + "table.include.list": "public.users,public.orders", + } + + connector = create_debezium_connector( + connector_config, use_schema_resolver=False + ) + + result = connector.get_topics_from_config() + + assert result == ["myserver.public.users", "myserver.public.orders"] + + def test_full_flow_with_schema_and_table_filters(self) -> None: + """Test complete flow with both schema and table filters.""" + connector_config = { + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.server.name": "myserver", + "database.dbname": "testdb", + "schema.include.list": "public", + "table.include.list": "public.*", + } + + schema_resolver = Mock(spec=SchemaResolver) + connector = create_debezium_connector( + connector_config, schema_resolver=schema_resolver, use_schema_resolver=True + ) + + # Mock table discovery to return multiple schemas + with patch.object( + connector, + "_discover_tables_from_database", + return_value=[ + "public.users", + "public.orders", + "private.data", + "analytics.events", + ], + ): + result = connector.get_topics_from_config() + + # Should only include public schema tables + assert result == ["myserver.public.users", "myserver.public.orders"] + + def test_full_flow_with_exclude_filters(self) -> None: + """Test complete flow with exclude filters.""" + connector_config = { + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.server.name": "myserver", + "database.dbname": "testdb", + "schema.exclude.list": "temp.*", + "table.exclude.list": "public.test.*", + } + + schema_resolver = Mock(spec=SchemaResolver) + connector = create_debezium_connector( + connector_config, schema_resolver=schema_resolver, use_schema_resolver=True + ) + + # Mock table discovery + with patch.object( + connector, + "_discover_tables_from_database", + return_value=[ + "public.users", + "public.test_data", + "temp_staging.data", + "analytics.events", + ], + ): + result = connector.get_topics_from_config() + + # Should exclude temp_staging schema and public.test_* tables + assert result == ["myserver.public.users", "myserver.analytics.events"] + + def test_full_flow_no_tables_found(self) -> None: + """Test flow when no tables match filters.""" + connector_config = { + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.server.name": "myserver", + "database.dbname": "testdb", + "schema.include.list": "nonexistent", + } + + schema_resolver = Mock(spec=SchemaResolver) + connector = create_debezium_connector( + connector_config, schema_resolver=schema_resolver, use_schema_resolver=True + ) + + with patch.object( + connector, + "_discover_tables_from_database", + return_value=["public.users", "public.orders"], + ): + result = connector.get_topics_from_config() + + assert result == [] + + def test_full_flow_handles_errors_gracefully(self) -> None: + """Test that errors in discovery are handled gracefully.""" + connector_config = { + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.server.name": "myserver", + "database.dbname": "testdb", + } + + schema_resolver = Mock(spec=SchemaResolver) + connector = create_debezium_connector( + connector_config, schema_resolver=schema_resolver, use_schema_resolver=True + ) + + # Mock discovery to raise exception + with patch.object( + connector, + "_discover_tables_from_database", + side_effect=Exception("Connection error"), + ): + result = connector.get_topics_from_config() + + # Should return empty list instead of crashing + assert result == [] diff --git a/metadata-ingestion/tests/unit/test_kafka_connect_pattern_matchers.py b/metadata-ingestion/tests/unit/test_kafka_connect_pattern_matchers.py new file mode 100644 index 00000000000000..b30605dcd7fb03 --- /dev/null +++ b/metadata-ingestion/tests/unit/test_kafka_connect_pattern_matchers.py @@ -0,0 +1,262 @@ +"""Tests for Kafka Connect pattern matching utilities.""" + +from unittest.mock import Mock + +import jpype +import jpype.imports +import pytest + +from datahub.ingestion.source.kafka_connect.common import ( + ConnectorManifest, + KafkaConnectSourceConfig, + KafkaConnectSourceReport, +) +from datahub.ingestion.source.kafka_connect.pattern_matchers import ( + JavaRegexMatcher, + WildcardMatcher, +) +from datahub.ingestion.source.kafka_connect.source_connectors import ( + DebeziumSourceConnector, +) + + +@pytest.fixture(scope="session", autouse=True) +def ensure_jvm_started(): + """Ensure JVM is started for all tests requiring Java regex.""" + if not jpype.isJVMStarted(): + jpype.startJVM(jpype.getDefaultJVMPath()) + yield + + +class TestJavaRegexMatcher: + """Tests for JavaRegexMatcher with Java regex syntax.""" + + def test_simple_pattern_match(self) -> None: + matcher = JavaRegexMatcher() + + assert matcher.matches("public\\.users", "public.users") + assert not matcher.matches("public\\.users", "public.orders") + + def test_wildcard_pattern(self) -> None: + matcher = JavaRegexMatcher() + + assert matcher.matches("public\\..*", "public.users") + assert matcher.matches("public\\..*", "public.orders") + assert not matcher.matches("public\\..*", "private.users") + + def test_alternation_pattern(self) -> None: + matcher = JavaRegexMatcher() + + assert matcher.matches("public\\.(users|orders)", "public.users") + assert matcher.matches("public\\.(users|orders)", "public.orders") + assert not matcher.matches("public\\.(users|orders)", "public.products") + + def test_character_class_pattern(self) -> None: + matcher = JavaRegexMatcher() + + assert matcher.matches("schema_v[0-9]+\\.orders", "schema_v1.orders") + assert matcher.matches("schema_v[0-9]+\\.orders", "schema_v2.orders") + assert not matcher.matches("schema_v[0-9]+\\.orders", "schema_vX.orders") + + def test_filter_matches_basic(self) -> None: + matcher = JavaRegexMatcher() + + tables = ["public.users", "public.orders", "private.users", "public.products"] + patterns = ["public\\.users", "public\\.orders"] + + result = matcher.filter_matches(patterns, tables) + + assert "public.users" in result + assert "public.orders" in result + assert "private.users" not in result + assert "public.products" not in result + + def test_filter_matches_with_wildcard(self) -> None: + matcher = JavaRegexMatcher() + + tables = [ + "public.users", + "public.orders", + "private.users", + "schema_v1.orders", + "schema_v2.orders", + ] + patterns = ["public\\..*", "schema_.*\\.orders"] + + result = matcher.filter_matches(patterns, tables) + + assert "public.users" in result + assert "public.orders" in result + assert "schema_v1.orders" in result + assert "schema_v2.orders" in result + assert "private.users" not in result + + def test_filter_matches_no_duplicates(self) -> None: + matcher = JavaRegexMatcher() + + tables = ["public.users", "public.orders"] + patterns = ["public\\.users", "public\\..*"] + + result = matcher.filter_matches(patterns, tables) + + assert result.count("public.users") == 1 + + def test_invalid_pattern_returns_false(self) -> None: + matcher = JavaRegexMatcher() + + assert not matcher.matches("[invalid(regex", "public.users") + + def test_filter_matches_with_invalid_pattern(self) -> None: + matcher = JavaRegexMatcher() + + tables = ["public.users", "public.orders"] + patterns = ["[invalid(regex", "public\\.orders"] + + result = matcher.filter_matches(patterns, tables) + + assert "public.orders" in result + assert "public.users" not in result + + +class TestWildcardMatcher: + """Tests for WildcardMatcher with simple wildcard syntax.""" + + def test_exact_match(self) -> None: + matcher = WildcardMatcher() + + assert matcher.matches("ANALYTICS.PUBLIC.USERS", "ANALYTICS.PUBLIC.USERS") + assert not matcher.matches("ANALYTICS.PUBLIC.USERS", "ANALYTICS.PUBLIC.ORDERS") + + def test_star_wildcard(self) -> None: + matcher = WildcardMatcher() + + assert matcher.matches("ANALYTICS.PUBLIC.*", "ANALYTICS.PUBLIC.USERS") + assert matcher.matches("ANALYTICS.PUBLIC.*", "ANALYTICS.PUBLIC.ORDERS") + assert not matcher.matches("ANALYTICS.PUBLIC.*", "ANALYTICS.PRIVATE.USERS") + + def test_question_mark_wildcard(self) -> None: + matcher = WildcardMatcher() + + assert matcher.matches("DB.SCHEMA.USER?", "DB.SCHEMA.USER1") + assert matcher.matches("DB.SCHEMA.USER?", "DB.SCHEMA.USERS") + assert not matcher.matches("DB.SCHEMA.USER?", "DB.SCHEMA.USER12") + + def test_mixed_wildcards(self) -> None: + matcher = WildcardMatcher() + + assert matcher.matches("*.PUBLIC.TABLE?", "DB1.PUBLIC.TABLE1") + assert matcher.matches("*.PUBLIC.TABLE?", "DB2.PUBLIC.TABLEX") + assert not matcher.matches("*.PUBLIC.TABLE?", "DB1.PRIVATE.TABLE1") + + def test_filter_matches_basic(self) -> None: + matcher = WildcardMatcher() + + tables = [ + "ANALYTICS.PUBLIC.USERS", + "ANALYTICS.PUBLIC.ORDERS", + "ANALYTICS.PRIVATE.USERS", + ] + patterns = ["ANALYTICS.PUBLIC.USERS", "ANALYTICS.PUBLIC.ORDERS"] + + result = matcher.filter_matches(patterns, tables) + + assert "ANALYTICS.PUBLIC.USERS" in result + assert "ANALYTICS.PUBLIC.ORDERS" in result + assert "ANALYTICS.PRIVATE.USERS" not in result + + def test_filter_matches_with_wildcards(self) -> None: + matcher = WildcardMatcher() + + tables = [ + "ANALYTICS.PUBLIC.USERS", + "ANALYTICS.PUBLIC.ORDERS", + "ANALYTICS.PRIVATE.USERS", + "SALES.PUBLIC.CUSTOMERS", + ] + patterns = ["ANALYTICS.PUBLIC.*", "*.PUBLIC.CUSTOMERS"] + + result = matcher.filter_matches(patterns, tables) + + assert "ANALYTICS.PUBLIC.USERS" in result + assert "ANALYTICS.PUBLIC.ORDERS" in result + assert "SALES.PUBLIC.CUSTOMERS" in result + assert "ANALYTICS.PRIVATE.USERS" not in result + + def test_filter_matches_no_duplicates(self) -> None: + matcher = WildcardMatcher() + + tables = ["ANALYTICS.PUBLIC.USERS", "ANALYTICS.PUBLIC.ORDERS"] + patterns = ["ANALYTICS.PUBLIC.USERS", "ANALYTICS.PUBLIC.*"] + + result = matcher.filter_matches(patterns, tables) + + assert result.count("ANALYTICS.PUBLIC.USERS") == 1 + + def test_case_sensitivity(self) -> None: + matcher = WildcardMatcher() + + assert matcher.matches("analytics.public.users", "analytics.public.users") + assert not matcher.matches("ANALYTICS.PUBLIC.USERS", "analytics.public.users") + + +class TestDebeziumSourceConnectorPatternMatching: + """Tests for DebeziumSourceConnector pattern matcher integration.""" + + def test_get_pattern_matcher_returns_java_regex_matcher(self) -> None: + connector_config = { + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.server.name": "myserver", + "database.dbname": "testdb", + } + + manifest = ConnectorManifest( + name="test-connector", + type="source", + config=connector_config, + tasks=[], + topic_names=[], + ) + + config = Mock(spec=KafkaConnectSourceConfig) + report = Mock(spec=KafkaConnectSourceReport) + connector = DebeziumSourceConnector(manifest, config, report) + + matcher = connector.get_pattern_matcher() + + assert isinstance(matcher, JavaRegexMatcher) + + def test_filter_tables_by_patterns_uses_java_regex(self) -> None: + connector_config = { + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.server.name": "myserver", + "database.dbname": "testdb", + } + + manifest = ConnectorManifest( + name="test-connector", + type="source", + config=connector_config, + tasks=[], + topic_names=[], + ) + + config = Mock(spec=KafkaConnectSourceConfig) + report = Mock(spec=KafkaConnectSourceReport) + connector = DebeziumSourceConnector(manifest, config, report) + + tables = [ + "public.users", + "public.orders", + "schema_v1.products", + "schema_v2.products", + "private.data", + ] + patterns = ["public\\..*", "schema_v[0-9]+\\.products"] + + result = connector._filter_tables_by_patterns(tables, patterns) + + assert "public.users" in result + assert "public.orders" in result + assert "schema_v1.products" in result + assert "schema_v2.products" in result + assert "private.data" not in result diff --git a/metadata-ingestion/tests/unit/test_kafka_connect_replace_field_transform.py b/metadata-ingestion/tests/unit/test_kafka_connect_replace_field_transform.py new file mode 100644 index 00000000000000..8b4e2863658a9c --- /dev/null +++ b/metadata-ingestion/tests/unit/test_kafka_connect_replace_field_transform.py @@ -0,0 +1,489 @@ +""" +Tests for ReplaceField SMT (Single Message Transform) support in Kafka Connect lineage extraction. + +This module tests the implementation of ReplaceField transformations that can: +- Filter fields using include/exclude +- Rename fields using from:to format +- Apply multiple transformations in sequence + +Reference: https://docs.confluent.io/platform/current/connect/transforms/replacefield.html +""" + +import pytest + +from datahub.ingestion.source.kafka_connect.common import ( + BaseConnector, + ConnectorManifest, + KafkaConnectSourceConfig, + KafkaConnectSourceReport, +) + + +@pytest.fixture +def config(): + """Create test configuration.""" + return KafkaConnectSourceConfig( + connect_uri="http://localhost:8083", cluster_name="test" + ) + + +@pytest.fixture +def report(): + """Create test report.""" + return KafkaConnectSourceReport() + + +def test_no_transforms(): + """Test that columns pass through unchanged when no transforms are configured.""" + manifest = ConnectorManifest( + name="test-connector", + type="source", + config={ + "connector.class": "TestConnector", + }, + tasks=[], + ) + + config = KafkaConnectSourceConfig( + connect_uri="http://localhost:8083", cluster_name="test" + ) + report = KafkaConnectSourceReport() + connector = BaseConnector(manifest, config, report) + + source_columns = ["id", "name", "email", "created_at"] + column_mapping = connector._apply_replace_field_transform(source_columns) + + # All columns should map 1:1 when no transforms present + assert column_mapping == { + "id": "id", + "name": "name", + "email": "email", + "created_at": "created_at", + } + + +def test_exclude_single_field(): + """Test excluding a single field from the output.""" + manifest = ConnectorManifest( + name="test-connector", + type="source", + config={ + "connector.class": "TestConnector", + "transforms": "dropSensitive", + "transforms.dropSensitive.type": "org.apache.kafka.connect.transforms.ReplaceField$Value", + "transforms.dropSensitive.exclude": "password", + }, + tasks=[], + ) + + config = KafkaConnectSourceConfig( + connect_uri="http://localhost:8083", cluster_name="test" + ) + report = KafkaConnectSourceReport() + connector = BaseConnector(manifest, config, report) + + source_columns = ["id", "username", "password", "email"] + column_mapping = connector._apply_replace_field_transform(source_columns) + + assert column_mapping == { + "id": "id", + "username": "username", + "password": None, # Excluded field mapped to None + "email": "email", + } + + +def test_exclude_multiple_fields(): + """Test excluding multiple fields from the output.""" + manifest = ConnectorManifest( + name="test-connector", + type="source", + config={ + "connector.class": "TestConnector", + "transforms": "dropSensitive", + "transforms.dropSensitive.type": "org.apache.kafka.connect.transforms.ReplaceField$Value", + "transforms.dropSensitive.exclude": "password,ssn,credit_card", + }, + tasks=[], + ) + + config = KafkaConnectSourceConfig( + connect_uri="http://localhost:8083", cluster_name="test" + ) + report = KafkaConnectSourceReport() + connector = BaseConnector(manifest, config, report) + + source_columns = ["id", "name", "password", "ssn", "email", "credit_card"] + column_mapping = connector._apply_replace_field_transform(source_columns) + + assert column_mapping == { + "id": "id", + "name": "name", + "password": None, + "ssn": None, + "email": "email", + "credit_card": None, + } + + +def test_include_only_specified_fields(): + """Test keeping only specified fields (all others dropped).""" + manifest = ConnectorManifest( + name="test-connector", + type="source", + config={ + "connector.class": "TestConnector", + "transforms": "keepOnly", + "transforms.keepOnly.type": "org.apache.kafka.connect.transforms.ReplaceField$Value", + "transforms.keepOnly.include": "id,name,email", + }, + tasks=[], + ) + + config = KafkaConnectSourceConfig( + connect_uri="http://localhost:8083", cluster_name="test" + ) + report = KafkaConnectSourceReport() + connector = BaseConnector(manifest, config, report) + + source_columns = ["id", "name", "email", "password", "ssn", "internal_notes"] + column_mapping = connector._apply_replace_field_transform(source_columns) + + assert column_mapping == { + "id": "id", + "name": "name", + "email": "email", + "password": None, # Not in include list + "ssn": None, # Not in include list + "internal_notes": None, # Not in include list + } + + +def test_rename_single_field(): + """Test renaming a single field.""" + manifest = ConnectorManifest( + name="test-connector", + type="source", + config={ + "connector.class": "TestConnector", + "transforms": "renameField", + "transforms.renameField.type": "org.apache.kafka.connect.transforms.ReplaceField$Value", + "transforms.renameField.renames": "user_id:id", + }, + tasks=[], + ) + + config = KafkaConnectSourceConfig( + connect_uri="http://localhost:8083", cluster_name="test" + ) + report = KafkaConnectSourceReport() + connector = BaseConnector(manifest, config, report) + + source_columns = ["user_id", "name", "email"] + column_mapping = connector._apply_replace_field_transform(source_columns) + + assert column_mapping == { + "user_id": "id", # Renamed + "name": "name", + "email": "email", + } + + +def test_rename_multiple_fields(): + """Test renaming multiple fields.""" + manifest = ConnectorManifest( + name="test-connector", + type="source", + config={ + "connector.class": "TestConnector", + "transforms": "renameFields", + "transforms.renameFields.type": "org.apache.kafka.connect.transforms.ReplaceField$Value", + "transforms.renameFields.renames": "user_id:id,user_name:name,user_email:email", + }, + tasks=[], + ) + + config = KafkaConnectSourceConfig( + connect_uri="http://localhost:8083", cluster_name="test" + ) + report = KafkaConnectSourceReport() + connector = BaseConnector(manifest, config, report) + + source_columns = ["user_id", "user_name", "user_email", "created_at"] + column_mapping = connector._apply_replace_field_transform(source_columns) + + assert column_mapping == { + "user_id": "id", + "user_name": "name", + "user_email": "email", + "created_at": "created_at", + } + + +def test_exclude_and_rename_combined(): + """Test combining exclude and rename operations.""" + manifest = ConnectorManifest( + name="test-connector", + type="source", + config={ + "connector.class": "TestConnector", + "transforms": "cleanupFields", + "transforms.cleanupFields.type": "org.apache.kafka.connect.transforms.ReplaceField$Value", + "transforms.cleanupFields.exclude": "password,ssn", + "transforms.cleanupFields.renames": "user_id:id,user_name:name", + }, + tasks=[], + ) + + config = KafkaConnectSourceConfig( + connect_uri="http://localhost:8083", cluster_name="test" + ) + report = KafkaConnectSourceReport() + connector = BaseConnector(manifest, config, report) + + source_columns = ["user_id", "user_name", "email", "password", "ssn"] + column_mapping = connector._apply_replace_field_transform(source_columns) + + assert column_mapping == { + "user_id": "id", # Renamed + "user_name": "name", # Renamed + "email": "email", # Unchanged + "password": None, # Excluded + "ssn": None, # Excluded + } + + +def test_multiple_transforms_in_sequence(): + """Test applying multiple ReplaceField transforms in sequence.""" + manifest = ConnectorManifest( + name="test-connector", + type="source", + config={ + "connector.class": "TestConnector", + "transforms": "first,second", + "transforms.first.type": "org.apache.kafka.connect.transforms.ReplaceField$Value", + "transforms.first.renames": "old_id:user_id", + "transforms.second.type": "org.apache.kafka.connect.transforms.ReplaceField$Value", + "transforms.second.renames": "user_id:id", + }, + tasks=[], + ) + + config = KafkaConnectSourceConfig( + connect_uri="http://localhost:8083", cluster_name="test" + ) + report = KafkaConnectSourceReport() + connector = BaseConnector(manifest, config, report) + + source_columns = ["old_id", "name"] + column_mapping = connector._apply_replace_field_transform(source_columns) + + # First transform: old_id -> user_id + # Second transform: user_id -> id + assert column_mapping == { + "old_id": "id", # Chained renames applied + "name": "name", + } + + +def test_non_replacefield_transforms_ignored(): + """Test that non-ReplaceField transforms are ignored.""" + manifest = ConnectorManifest( + name="test-connector", + type="source", + config={ + "connector.class": "TestConnector", + "transforms": "mask,rename", + "transforms.mask.type": "org.apache.kafka.connect.transforms.MaskField$Value", + "transforms.mask.fields": "password", + "transforms.rename.type": "org.apache.kafka.connect.transforms.ReplaceField$Value", + "transforms.rename.renames": "user_id:id", + }, + tasks=[], + ) + + config = KafkaConnectSourceConfig( + connect_uri="http://localhost:8083", cluster_name="test" + ) + report = KafkaConnectSourceReport() + connector = BaseConnector(manifest, config, report) + + source_columns = ["user_id", "name", "password"] + column_mapping = connector._apply_replace_field_transform(source_columns) + + # Only the ReplaceField transform should be applied + assert column_mapping == { + "user_id": "id", + "name": "name", + "password": "password", # MaskField ignored, password unchanged + } + + +def test_replacefield_key_transform_ignored(): + """Test that ReplaceField$Key transforms are ignored (we only support $Value).""" + manifest = ConnectorManifest( + name="test-connector", + type="source", + config={ + "connector.class": "TestConnector", + "transforms": "keyTransform,valueTransform", + "transforms.keyTransform.type": "org.apache.kafka.connect.transforms.ReplaceField$Key", + "transforms.keyTransform.exclude": "internal_key", + "transforms.valueTransform.type": "org.apache.kafka.connect.transforms.ReplaceField$Value", + "transforms.valueTransform.exclude": "password", + }, + tasks=[], + ) + + config = KafkaConnectSourceConfig( + connect_uri="http://localhost:8083", cluster_name="test" + ) + report = KafkaConnectSourceReport() + connector = BaseConnector(manifest, config, report) + + source_columns = ["id", "password", "internal_key"] + column_mapping = connector._apply_replace_field_transform(source_columns) + + # Only Value transform applied, Key transform ignored + assert column_mapping == { + "id": "id", + "password": None, # Excluded by Value transform + "internal_key": "internal_key", # Key transform ignored + } + + +def test_empty_transform_config(): + """Test handling of empty transform configurations.""" + manifest = ConnectorManifest( + name="test-connector", + type="source", + config={ + "connector.class": "TestConnector", + "transforms": "empty", + "transforms.empty.type": "org.apache.kafka.connect.transforms.ReplaceField$Value", + # No include, exclude, or renames specified + }, + tasks=[], + ) + + config = KafkaConnectSourceConfig( + connect_uri="http://localhost:8083", cluster_name="test" + ) + report = KafkaConnectSourceReport() + connector = BaseConnector(manifest, config, report) + + source_columns = ["id", "name", "email"] + column_mapping = connector._apply_replace_field_transform(source_columns) + + # Empty config should result in 1:1 mapping + assert column_mapping == { + "id": "id", + "name": "name", + "email": "email", + } + + +def test_whitespace_in_field_names(): + """Test that whitespace in configuration is handled correctly.""" + manifest = ConnectorManifest( + name="test-connector", + type="source", + config={ + "connector.class": "TestConnector", + "transforms": "cleanup", + "transforms.cleanup.type": "org.apache.kafka.connect.transforms.ReplaceField$Value", + "transforms.cleanup.exclude": " password , ssn ", # Extra whitespace + "transforms.cleanup.renames": " user_id : id , user_name : name ", # Extra whitespace + }, + tasks=[], + ) + + config = KafkaConnectSourceConfig( + connect_uri="http://localhost:8083", cluster_name="test" + ) + report = KafkaConnectSourceReport() + connector = BaseConnector(manifest, config, report) + + source_columns = ["user_id", "user_name", "password", "ssn", "email"] + column_mapping = connector._apply_replace_field_transform(source_columns) + + # Whitespace should be trimmed + assert column_mapping == { + "user_id": "id", + "user_name": "name", + "password": None, + "ssn": None, + "email": "email", + } + + +def test_integration_with_fine_grained_lineage(): + """ + Test that ReplaceField transforms are properly integrated with fine-grained lineage extraction. + + This is an integration test that verifies the transform is applied when extracting column-level lineage. + """ + from unittest.mock import Mock + + from datahub.emitter.mce_builder import schema_field_urn_to_key + + manifest = ConnectorManifest( + name="test-connector", + type="source", + config={ + "connector.class": "TestConnector", + "transforms": "cleanup", + "transforms.cleanup.type": "org.apache.kafka.connect.transforms.ReplaceField$Value", + "transforms.cleanup.exclude": "password", + "transforms.cleanup.renames": "user_id:id", + }, + tasks=[], + ) + + config = KafkaConnectSourceConfig( + connect_uri="http://localhost:8083", + cluster_name="test", + use_schema_resolver=True, + schema_resolver_finegrained_lineage=True, + ) + report = KafkaConnectSourceReport() + connector = BaseConnector(manifest, config, report) + + # Mock schema resolver + mock_resolver = Mock() + mock_resolver.resolve_table.return_value = ( + "urn:li:dataset:(urn:li:dataPlatform:postgres,public.users,PROD)", + { # SchemaInfo is Dict[str, str] mapping column names to types + "user_id": "INT", + "name": "VARCHAR", + "email": "VARCHAR", + "password": "VARCHAR", + }, + ) + connector.schema_resolver = mock_resolver + + # Extract fine-grained lineage + lineages = connector._extract_fine_grained_lineage( + source_dataset="public.users", + source_platform="postgres", + target_dataset="users_topic", + target_platform="kafka", + ) + + # Verify lineages were generated + assert lineages is not None + assert len(lineages) == 3 # 4 columns - 1 excluded (password) = 3 + + # Verify password was excluded and fields were renamed + downstream_fields = [] + for lineage in lineages: + for downstream_urn in lineage["downstreams"]: + # Use proper URN parser to extract field path + key = schema_field_urn_to_key(downstream_urn) + if key: + downstream_fields.append(key.fieldPath) + + assert "password" not in downstream_fields + assert "id" in downstream_fields # user_id renamed to id + assert "name" in downstream_fields + assert "email" in downstream_fields diff --git a/metadata-ingestion/tests/unit/test_kafka_connect_schema_resolver.py b/metadata-ingestion/tests/unit/test_kafka_connect_schema_resolver.py new file mode 100644 index 00000000000000..c1d7d51141a6e3 --- /dev/null +++ b/metadata-ingestion/tests/unit/test_kafka_connect_schema_resolver.py @@ -0,0 +1,1014 @@ +"""Tests for Kafka Connect schema resolver integration.""" + +import logging +from typing import Any, Dict, List, Optional, Tuple + +from datahub.ingestion.source.kafka_connect.common import ( + ConnectorManifest, + KafkaConnectSourceConfig, + KafkaConnectSourceReport, +) +from datahub.ingestion.source.kafka_connect.source_connectors import ( + DebeziumSourceConnector, +) +from datahub.sql_parsing.schema_resolver import SchemaResolverInterface + +logger = logging.getLogger(__name__) + + +class MockSchemaResolver(SchemaResolverInterface): + """Mock SchemaResolver for testing.""" + + def __init__(self, platform: str, mock_urns: Optional[List[str]] = None): + self._platform = platform + self._mock_urns = set(mock_urns or []) + self._schemas: Dict[str, Dict[str, str]] = {} + # Additional attributes that production code may access + self.graph = None + self.env = "PROD" + + @property + def platform(self) -> str: + """Return the platform.""" + return self._platform + + def includes_temp_tables(self) -> bool: + """Return whether temp tables are included.""" + return False + + def get_urns(self): + """Return mock URNs.""" + return self._mock_urns + + def resolve_table(self, table: Any) -> Tuple[str, Optional[Dict[str, str]]]: + """Mock table resolution.""" + table_name = table.table + urn = f"urn:li:dataset:(urn:li:dataPlatform:{self.platform},{table_name},PROD)" + schema = self._schemas.get(table_name) + return urn, schema + + def get_urn_for_table(self, table: Any) -> str: + """Mock URN generation for table.""" + table_name = table.table + return f"urn:li:dataset:(urn:li:dataPlatform:{self.platform},{table_name},PROD)" + + def add_schema(self, table_name: str, schema: Dict[str, str]) -> None: + """Add a schema for testing.""" + self._schemas[table_name] = schema + + +class TestSchemaResolverTableExpansion: + """Tests for table pattern expansion using SchemaResolver.""" + + def test_pattern_expansion_disabled_by_default(self): + """Test that pattern expansion is disabled by default.""" + config = KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + ) + report = KafkaConnectSourceReport() + + connector_manifest = ConnectorManifest( + name="postgres-source", + type="source", + config={ + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.dbname": "testdb", + "table.include.list": "public.*", + "database.server.name": "testserver", + }, + tasks=[], + ) + + connector = DebeziumSourceConnector( + connector_manifest=connector_manifest, + config=config, + report=report, + schema_resolver=None, + ) + + # Pattern should not be expanded + result = connector._expand_table_patterns("public.*", "postgres", "testdb") + + # Should return the pattern as-is (not expanded) + assert result == ["public.*"] + + def test_pattern_expansion_with_wildcard(self): + """Test expanding pattern with wildcard using SchemaResolver.""" + config = KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + use_schema_resolver=True, + schema_resolver_expand_patterns=True, + ) + report = KafkaConnectSourceReport() + + connector_manifest = ConnectorManifest( + name="postgres-source", + type="source", + config={ + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.dbname": "testdb", + "table.include.list": "public.*", + "database.server.name": "testserver", + }, + tasks=[], + ) + + # Create mock schema resolver with matching URNs + mock_resolver = MockSchemaResolver( + platform="postgres", + mock_urns=[ + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.public.users,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.public.orders,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.private.secrets,PROD)", + ], + ) + + connector = DebeziumSourceConnector( + connector_manifest=connector_manifest, + config=config, + report=report, + schema_resolver=mock_resolver, # type: ignore[arg-type] + ) + + # Pattern should be expanded + result = connector._expand_table_patterns("public.*", "postgres", "testdb") + + # Should match only public schema tables + assert len(result) == 2 + assert "testdb.public.users" in result + assert "testdb.public.orders" in result + assert "testdb.private.secrets" not in result + + def test_pattern_expansion_no_matches(self): + """Test pattern expansion when no tables match.""" + config = KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + use_schema_resolver=True, + schema_resolver_expand_patterns=True, + ) + report = KafkaConnectSourceReport() + + connector_manifest = ConnectorManifest( + name="postgres-source", + type="source", + config={ + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.dbname": "testdb", + "table.include.list": "nonexistent.*", + "database.server.name": "testserver", + }, + tasks=[], + ) + + # Create mock schema resolver with URNs that don't match the pattern + mock_resolver = MockSchemaResolver( + platform="postgres", + mock_urns=[ + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.public.users,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.public.orders,PROD)", + ], + ) + + connector = DebeziumSourceConnector( + connector_manifest=connector_manifest, + config=config, + report=report, + schema_resolver=mock_resolver, # type: ignore[arg-type] + ) + + # Pattern should not match anything, return as-is + result = connector._expand_table_patterns("nonexistent.*", "postgres", "testdb") + + # Should keep the pattern as-is since no matches found + assert result == ["nonexistent.*"] + + def test_pattern_expansion_mixed_patterns_and_explicit(self): + """Test expanding a mix of patterns and explicit table names.""" + config = KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + use_schema_resolver=True, + schema_resolver_expand_patterns=True, + ) + report = KafkaConnectSourceReport() + + connector_manifest = ConnectorManifest( + name="postgres-source", + type="source", + config={ + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.dbname": "testdb", + "table.include.list": "public.*,private.accounts", + "database.server.name": "testserver", + }, + tasks=[], + ) + + # Create mock schema resolver + mock_resolver = MockSchemaResolver( + platform="postgres", + mock_urns=[ + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.public.users,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.public.orders,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.private.accounts,PROD)", + ], + ) + + connector = DebeziumSourceConnector( + connector_manifest=connector_manifest, + config=config, + report=report, + schema_resolver=mock_resolver, # type: ignore[arg-type] + ) + + # Mixed pattern and explicit should work + result = connector._expand_table_patterns( + "public.*,private.accounts", "postgres", "testdb" + ) + + # Should expand pattern and keep explicit table name + assert len(result) == 3 + assert "testdb.public.users" in result + assert "testdb.public.orders" in result + assert "private.accounts" in result + + def test_pattern_expansion_disabled_via_config(self): + """Test that pattern expansion can be disabled via config.""" + config = KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + use_schema_resolver=True, + schema_resolver_expand_patterns=False, # Disabled + ) + report = KafkaConnectSourceReport() + + connector_manifest = ConnectorManifest( + name="postgres-source", + type="source", + config={ + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.dbname": "testdb", + "table.include.list": "public.*", + "database.server.name": "testserver", + }, + tasks=[], + ) + + mock_resolver = MockSchemaResolver( + platform="postgres", + mock_urns=[ + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.public.users,PROD)", + ], + ) + + connector = DebeziumSourceConnector( + connector_manifest=connector_manifest, + config=config, + report=report, + schema_resolver=mock_resolver, # type: ignore[arg-type] + ) + + # Pattern should not be expanded even though schema resolver is available + result = connector._expand_table_patterns("public.*", "postgres", "testdb") + + assert result == ["public.*"] + + +class TestSchemaResolverFineGrainedLineage: + """Tests for fine-grained lineage extraction using SchemaResolver.""" + + def test_fine_grained_lineage_disabled_by_default(self): + """Test that fine-grained lineage is disabled by default.""" + config = KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + ) + report = KafkaConnectSourceReport() + + connector_manifest = ConnectorManifest( + name="postgres-source", + type="source", + config={ + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.dbname": "testdb", + "table.include.list": "public.users", + "database.server.name": "testserver", + }, + tasks=[], + ) + + connector = DebeziumSourceConnector( + connector_manifest=connector_manifest, + config=config, + report=report, + schema_resolver=None, + ) + + # Should return None (feature disabled) + result = connector._extract_fine_grained_lineage( + "testdb.public.users", + "postgres", + "testserver.public.users", + "kafka", + ) + + assert result is None + + def test_fine_grained_lineage_generation(self): + """Test generating fine-grained column-level lineage.""" + config = KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + use_schema_resolver=True, + schema_resolver_finegrained_lineage=True, + ) + report = KafkaConnectSourceReport() + + connector_manifest = ConnectorManifest( + name="postgres-source", + type="source", + config={ + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.dbname": "testdb", + "table.include.list": "public.users", + "database.server.name": "testserver", + }, + tasks=[], + ) + + # Create mock schema resolver with schema metadata + mock_resolver = MockSchemaResolver(platform="postgres") + mock_resolver.add_schema( + "testdb.public.users", + { + "id": "INTEGER", + "username": "VARCHAR", + "email": "VARCHAR", + "created_at": "TIMESTAMP", + }, + ) + + connector = DebeziumSourceConnector( + connector_manifest=connector_manifest, + config=config, + report=report, + schema_resolver=mock_resolver, # type: ignore[arg-type] + ) + + # Should generate fine-grained lineage + result = connector._extract_fine_grained_lineage( + "testdb.public.users", + "postgres", + "testserver.public.users", + "kafka", + ) + + # Should have 4 column lineages (one for each column) + assert result is not None + assert len(result) == 4 + + # Check structure of first lineage + first_lineage = result[0] + assert first_lineage["upstreamType"] == "FIELD_SET" + assert first_lineage["downstreamType"] == "FIELD" + assert len(first_lineage["upstreams"]) == 1 + assert len(first_lineage["downstreams"]) == 1 + + # Verify all columns are present + columns = [] + for lineage in result: + # Extract column name from URN + upstream_urn = lineage["upstreams"][0] + column_name = upstream_urn.split(",")[-1].rstrip(")") + columns.append(column_name) + + assert "id" in columns + assert "username" in columns + assert "email" in columns + assert "created_at" in columns + + def test_fine_grained_lineage_no_schema_metadata(self): + """Test fine-grained lineage when schema metadata is unavailable.""" + config = KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + use_schema_resolver=True, + schema_resolver_finegrained_lineage=True, + ) + report = KafkaConnectSourceReport() + + connector_manifest = ConnectorManifest( + name="postgres-source", + type="source", + config={ + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.dbname": "testdb", + "table.include.list": "public.users", + "database.server.name": "testserver", + }, + tasks=[], + ) + + # Create mock schema resolver without any schema metadata + mock_resolver = MockSchemaResolver(platform="postgres") + + connector = DebeziumSourceConnector( + connector_manifest=connector_manifest, + config=config, + report=report, + schema_resolver=mock_resolver, # type: ignore[arg-type] + ) + + # Should return None when schema metadata is unavailable + result = connector._extract_fine_grained_lineage( + "testdb.public.users", + "postgres", + "testserver.public.users", + "kafka", + ) + + assert result is None + + def test_fine_grained_lineage_disabled_via_config(self): + """Test that fine-grained lineage can be disabled via config.""" + config = KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + use_schema_resolver=True, + schema_resolver_finegrained_lineage=False, # Disabled + ) + report = KafkaConnectSourceReport() + + connector_manifest = ConnectorManifest( + name="postgres-source", + type="source", + config={ + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.dbname": "testdb", + "table.include.list": "public.users", + "database.server.name": "testserver", + }, + tasks=[], + ) + + mock_resolver = MockSchemaResolver(platform="postgres") + mock_resolver.add_schema( + "testdb.public.users", + {"id": "INTEGER", "username": "VARCHAR"}, + ) + + connector = DebeziumSourceConnector( + connector_manifest=connector_manifest, + config=config, + report=report, + schema_resolver=mock_resolver, # type: ignore[arg-type] + ) + + # Should return None even though schema metadata is available + result = connector._extract_fine_grained_lineage( + "testdb.public.users", + "postgres", + "testserver.public.users", + "kafka", + ) + + assert result is None + + +class TestSchemaResolverIntegration: + """Integration tests for schema resolver with lineage extraction.""" + + def test_lineage_extraction_with_fine_grained_lineage(self): + """Test that extract_lineages includes fine-grained lineage.""" + config = KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + use_schema_resolver=True, + schema_resolver_finegrained_lineage=True, + ) + report = KafkaConnectSourceReport() + + connector_manifest = ConnectorManifest( + name="postgres-source", + type="source", + config={ + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.dbname": "testdb", + "table.include.list": "public.users", + "database.server.name": "testserver", + }, + tasks=[], + topic_names=["testserver.public.users"], + ) + + # Create mock schema resolver with schema metadata + mock_resolver = MockSchemaResolver(platform="postgres") + mock_resolver.add_schema( + "testdb.public.users", + {"id": "INTEGER", "username": "VARCHAR", "email": "VARCHAR"}, + ) + + connector = DebeziumSourceConnector( + connector_manifest=connector_manifest, + config=config, + report=report, + schema_resolver=mock_resolver, # type: ignore[arg-type] + ) + + # Extract lineages + lineages = connector.extract_lineages() + + # Should have one lineage + assert len(lineages) == 1 + + # Check that fine-grained lineage is included + lineage = lineages[0] + assert lineage.fine_grained_lineages is not None + assert len(lineage.fine_grained_lineages) == 3 # 3 columns + assert lineage.source_dataset == "testdb.public.users" + assert lineage.target_dataset == "testserver.public.users" + + def test_lineage_extraction_without_schema_resolver(self): + """Test that lineage extraction works without schema resolver.""" + config = KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + ) + report = KafkaConnectSourceReport() + + connector_manifest = ConnectorManifest( + name="postgres-source", + type="source", + config={ + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.dbname": "testdb", + "table.include.list": "public.users", + "database.server.name": "testserver", + }, + tasks=[], + topic_names=["testserver.public.users"], + ) + + connector = DebeziumSourceConnector( + connector_manifest=connector_manifest, + config=config, + report=report, + schema_resolver=None, # No schema resolver + ) + + # Extract lineages + lineages = connector.extract_lineages() + + # Should have one lineage without fine-grained lineage + assert len(lineages) == 1 + lineage = lineages[0] + assert lineage.fine_grained_lineages is None + assert lineage.source_dataset == "testdb.public.users" + assert lineage.target_dataset == "testserver.public.users" + + +class TestSchemaResolverEdgeCases: + """Test edge cases and error handling for schema resolver.""" + + def test_extract_table_name_from_urn_valid(self): + """Test extracting table name from valid URN.""" + config = KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + ) + report = KafkaConnectSourceReport() + + connector_manifest = ConnectorManifest( + name="test", + type="source", + config={}, + tasks=[], + ) + + connector = DebeziumSourceConnector( + connector_manifest=connector_manifest, + config=config, + report=report, + ) + + urn = "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.public.users,PROD)" + result = connector._extract_table_name_from_urn(urn) + + assert result == "testdb.public.users" + + def test_extract_table_name_from_urn_invalid(self): + """Test extracting table name from invalid URN.""" + config = KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + ) + report = KafkaConnectSourceReport() + + connector_manifest = ConnectorManifest( + name="test", + type="source", + config={}, + tasks=[], + ) + + connector = DebeziumSourceConnector( + connector_manifest=connector_manifest, + config=config, + report=report, + ) + + # Invalid URN format + urn = "invalid-urn" + result = connector._extract_table_name_from_urn(urn) + + assert result is None + + def test_pattern_expansion_empty_urns(self): + """Test pattern expansion when schema resolver has no URNs.""" + config = KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + use_schema_resolver=True, + schema_resolver_expand_patterns=True, + ) + report = KafkaConnectSourceReport() + + connector_manifest = ConnectorManifest( + name="postgres-source", + type="source", + config={ + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.dbname": "testdb", + "table.include.list": "public.*", + "database.server.name": "testserver", + }, + tasks=[], + ) + + # Create mock schema resolver with no URNs + mock_resolver = MockSchemaResolver(platform="postgres", mock_urns=[]) + + connector = DebeziumSourceConnector( + connector_manifest=connector_manifest, + config=config, + report=report, + schema_resolver=mock_resolver, # type: ignore[arg-type] + ) + + # Should return pattern as-is when no URNs available + result = connector._expand_table_patterns("public.*", "postgres", "testdb") + + assert result == ["public.*"] + + +class TestJavaRegexPatternMatching: + """Tests for Java regex pattern matching in table expansion.""" + + def test_alternation_pattern(self): + """Test alternation pattern: public\\.(bg|cp)_.*""" + config = KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + use_schema_resolver=True, + schema_resolver_expand_patterns=True, + ) + report = KafkaConnectSourceReport() + + connector_manifest = ConnectorManifest( + name="postgres-source", + type="source", + config={ + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.dbname": "testdb", + "table.include.list": "public\\.(bg|cp)_.*", + "database.server.name": "testserver", + }, + tasks=[], + ) + + mock_resolver = MockSchemaResolver( + platform="postgres", + mock_urns=[ + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.public.bg_users,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.public.cp_orders,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.public.fg_data,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.public.users,PROD)", + ], + ) + + connector = DebeziumSourceConnector( + connector_manifest=connector_manifest, + config=config, + report=report, + schema_resolver=mock_resolver, # type: ignore[arg-type] + ) + + result = connector._expand_table_patterns( + "public\\.(bg|cp)_.*", "postgres", "testdb" + ) + + # Should match only tables starting with bg_ or cp_ in public schema + assert len(result) == 2 + assert "testdb.public.bg_users" in result + assert "testdb.public.cp_orders" in result + assert "testdb.public.fg_data" not in result + assert "testdb.public.users" not in result + + def test_character_class_pattern(self): + """Test character class pattern: public\\.test[0-9]+""" + config = KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + use_schema_resolver=True, + schema_resolver_expand_patterns=True, + ) + report = KafkaConnectSourceReport() + + connector_manifest = ConnectorManifest( + name="postgres-source", + type="source", + config={ + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.dbname": "testdb", + "table.include.list": "public\\.test[0-9]+", + "database.server.name": "testserver", + }, + tasks=[], + ) + + mock_resolver = MockSchemaResolver( + platform="postgres", + mock_urns=[ + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.public.test1,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.public.test23,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.public.test,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.public.testA,PROD)", + ], + ) + + connector = DebeziumSourceConnector( + connector_manifest=connector_manifest, + config=config, + report=report, + schema_resolver=mock_resolver, # type: ignore[arg-type] + ) + + result = connector._expand_table_patterns( + "public\\.test[0-9]+", "postgres", "testdb" + ) + + # Should match only test followed by one or more digits + assert len(result) == 2 + assert "testdb.public.test1" in result + assert "testdb.public.test23" in result + assert "testdb.public.test" not in result + assert "testdb.public.testA" not in result + + def test_complex_grouping_pattern(self): + """Test complex grouping: (public|private)\\.(users|orders)""" + config = KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + use_schema_resolver=True, + schema_resolver_expand_patterns=True, + ) + report = KafkaConnectSourceReport() + + connector_manifest = ConnectorManifest( + name="postgres-source", + type="source", + config={ + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.dbname": "testdb", + "table.include.list": "(public|private)\\.(users|orders)", + "database.server.name": "testserver", + }, + tasks=[], + ) + + mock_resolver = MockSchemaResolver( + platform="postgres", + mock_urns=[ + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.public.users,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.public.orders,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.private.users,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.private.orders,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.public.products,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.admin.users,PROD)", + ], + ) + + connector = DebeziumSourceConnector( + connector_manifest=connector_manifest, + config=config, + report=report, + schema_resolver=mock_resolver, # type: ignore[arg-type] + ) + + result = connector._expand_table_patterns( + "(public|private)\\.(users|orders)", "postgres", "testdb" + ) + + # Should match exactly: public.users, public.orders, private.users, private.orders + assert len(result) == 4 + assert "testdb.public.users" in result + assert "testdb.public.orders" in result + assert "testdb.private.users" in result + assert "testdb.private.orders" in result + assert "testdb.public.products" not in result + assert "testdb.admin.users" not in result + + def test_mysql_two_tier_pattern(self): + """Test MySQL 2-tier pattern: mydb\\.user.*""" + config = KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + use_schema_resolver=True, + schema_resolver_expand_patterns=True, + ) + report = KafkaConnectSourceReport() + + connector_manifest = ConnectorManifest( + name="mysql-source", + type="source", + config={ + "connector.class": "io.debezium.connector.mysql.MySqlConnector", + "database.dbname": "mydb", + "table.include.list": "mydb\\.user.*", + "database.server.name": "mysqlserver", + }, + tasks=[], + ) + + mock_resolver = MockSchemaResolver( + platform="mysql", + mock_urns=[ + "urn:li:dataset:(urn:li:dataPlatform:mysql,mydb.users,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:mysql,mydb.user_roles,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:mysql,mydb.user_permissions,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:mysql,mydb.orders,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:mysql,otherdb.users,PROD)", + ], + ) + + connector = DebeziumSourceConnector( + connector_manifest=connector_manifest, + config=config, + report=report, + schema_resolver=mock_resolver, # type: ignore[arg-type] + ) + + result = connector._expand_table_patterns("mydb\\.user.*", "mysql", "mydb") + + # Should match mydb tables starting with "user" + assert len(result) == 3 + assert "mydb.users" in result + assert "mydb.user_roles" in result + assert "mydb.user_permissions" in result + assert "mydb.orders" not in result + assert "otherdb.users" not in result + + def test_escaped_dots_vs_any_char(self): + """Test that escaped dots (\\.) match literal dots, not any character.""" + config = KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + use_schema_resolver=True, + schema_resolver_expand_patterns=True, + ) + report = KafkaConnectSourceReport() + + connector_manifest = ConnectorManifest( + name="postgres-source", + type="source", + config={ + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.dbname": "testdb", + "table.include.list": "public\\.user", + "database.server.name": "testserver", + }, + tasks=[], + ) + + mock_resolver = MockSchemaResolver( + platform="postgres", + mock_urns=[ + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.public.user,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.publicXuser,PROD)", + ], + ) + + connector = DebeziumSourceConnector( + connector_manifest=connector_manifest, + config=config, + report=report, + schema_resolver=mock_resolver, # type: ignore[arg-type] + ) + + result = connector._expand_table_patterns("public\\.user", "postgres", "testdb") + + # Escaped dot should match only literal dot, not any character + assert len(result) == 1 + assert "testdb.public.user" in result + assert "testdb.publicXuser" not in result + + def test_postgres_schema_without_database_prefix(self): + """Test PostgreSQL pattern without database prefix: public\\..*""" + config = KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + use_schema_resolver=True, + schema_resolver_expand_patterns=True, + ) + report = KafkaConnectSourceReport() + + connector_manifest = ConnectorManifest( + name="postgres-source", + type="source", + config={ + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.dbname": "testdb", + "table.include.list": "public\\..*", + "database.server.name": "testserver", + }, + tasks=[], + ) + + mock_resolver = MockSchemaResolver( + platform="postgres", + mock_urns=[ + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.public.users,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.public.orders,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.private.secrets,PROD)", + ], + ) + + connector = DebeziumSourceConnector( + connector_manifest=connector_manifest, + config=config, + report=report, + schema_resolver=mock_resolver, # type: ignore[arg-type] + ) + + result = connector._expand_table_patterns("public\\..*", "postgres", "testdb") + + # Should match all tables in public schema (without database in pattern) + assert len(result) == 2 + assert "testdb.public.users" in result + assert "testdb.public.orders" in result + assert "testdb.private.secrets" not in result + + def test_quantifier_patterns(self): + """Test various quantifiers: +, *""" + config = KafkaConnectSourceConfig( + connect_uri="http://test:8083", + cluster_name="test", + use_schema_resolver=True, + schema_resolver_expand_patterns=True, + ) + report = KafkaConnectSourceReport() + + connector_manifest = ConnectorManifest( + name="postgres-source", + type="source", + config={ + "connector.class": "io.debezium.connector.postgresql.PostgresConnector", + "database.dbname": "testdb", + "table.include.list": "public\\.user_[a-z]+", + "database.server.name": "testserver", + }, + tasks=[], + ) + + mock_resolver = MockSchemaResolver( + platform="postgres", + mock_urns=[ + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.public.user_ab,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.public.user_abc,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.public.user_abcd,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.public.user_,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:postgres,testdb.public.user_123,PROD)", + ], + ) + + connector = DebeziumSourceConnector( + connector_manifest=connector_manifest, + config=config, + report=report, + schema_resolver=mock_resolver, # type: ignore[arg-type] + ) + + result = connector._expand_table_patterns( + "public\\.user_[a-z]+", "postgres", "testdb" + ) + + # Should match only tables with one or more lowercase letters after user_ + assert len(result) == 3 + assert "testdb.public.user_ab" in result + assert "testdb.public.user_abc" in result + assert "testdb.public.user_abcd" in result + assert "testdb.public.user_" not in result + assert "testdb.public.user_123" not in result diff --git a/metadata-ingestion/tests/unit/test_kafka_connect_snowflake_source.py b/metadata-ingestion/tests/unit/test_kafka_connect_snowflake_source.py new file mode 100644 index 00000000000000..792b0c7e2dc89c --- /dev/null +++ b/metadata-ingestion/tests/unit/test_kafka_connect_snowflake_source.py @@ -0,0 +1,653 @@ +""" +Tests for Snowflake Source Connector lineage extraction. + +This module tests the SnowflakeSourceConnector class which handles +lineage extraction for Confluent Cloud Snowflake Source connectors. +""" + +import pytest + +from datahub.ingestion.source.kafka_connect.common import ( + SNOWFLAKE_SOURCE_CLOUD, + ConnectorManifest, + KafkaConnectSourceConfig, + KafkaConnectSourceReport, +) +from datahub.ingestion.source.kafka_connect.connector_registry import ( + ConnectorRegistry, +) +from datahub.ingestion.source.kafka_connect.source_connectors import ( + SnowflakeSourceConnector, +) + + +@pytest.fixture +def config(): + """Create test configuration.""" + return KafkaConnectSourceConfig( + connect_uri="http://localhost:8083", cluster_name="test" + ) + + +@pytest.fixture +def report(): + """Create test report.""" + return KafkaConnectSourceReport() + + +def test_snowflake_source_connector_supports_class(): + """Test that SnowflakeSourceConnector recognizes correct connector class.""" + assert SnowflakeSourceConnector.supports_connector_class(SNOWFLAKE_SOURCE_CLOUD) + assert SnowflakeSourceConnector.supports_connector_class("SnowflakeSource") + assert not SnowflakeSourceConnector.supports_connector_class("SnowflakeSink") + assert not SnowflakeSourceConnector.supports_connector_class("PostgresCdcSource") + + +def test_snowflake_source_connector_platform(config, report): + """Test that SnowflakeSourceConnector returns correct platform.""" + manifest = ConnectorManifest( + name="test-connector", + type="source", + config={"connector.class": SNOWFLAKE_SOURCE_CLOUD}, + tasks=[], + ) + connector = SnowflakeSourceConnector(manifest, config, report) + platform = connector.get_platform() + assert platform == "snowflake" + + +def test_snowflake_source_parser_basic(config, report): + """Test basic parsing of Snowflake Source connector configuration.""" + manifest = ConnectorManifest( + name="snowflake-source-test", + type="source", + config={ + "connector.class": SNOWFLAKE_SOURCE_CLOUD, + "snowflake.database.name": "ANALYTICS", + "table.include.list": "ANALYTICS.PUBLIC.USERS,ANALYTICS.PUBLIC.ORDERS", + "topic.prefix": "snowflake_", + }, + tasks=[], + topic_names=[], + lineages=[], + ) + + connector = SnowflakeSourceConnector(manifest, config, report) + parser = connector.get_parser(manifest) + + assert parser.source_platform == "snowflake" + assert parser.database_name == "ANALYTICS" + assert parser.topic_prefix == "snowflake_" + assert len(parser.table_names) == 2 + assert "ANALYTICS.PUBLIC.USERS" in parser.table_names + assert "ANALYTICS.PUBLIC.ORDERS" in parser.table_names + + +def test_snowflake_source_get_topics_from_config(config, report): + """Test topic generation from Snowflake Source configuration.""" + manifest = ConnectorManifest( + name="snowflake-source-test", + type="source", + config={ + "connector.class": SNOWFLAKE_SOURCE_CLOUD, + "snowflake.database.name": "ANALYTICS", + "table.include.list": "ANALYTICS.PUBLIC.USERS,ANALYTICS.PUBLIC.ORDERS", + "topic.prefix": "snowflake_", + }, + tasks=[], + topic_names=[], + lineages=[], + ) + + connector = SnowflakeSourceConnector(manifest, config, report) + topics = connector.get_topics_from_config() + + assert len(topics) == 2 + # Topics are lowercased to match DataHub's normalization + assert "snowflake_analytics.public.users" in topics + assert "snowflake_analytics.public.orders" in topics + + +def test_snowflake_source_get_topics_without_prefix(config, report): + """Test topic generation without topic prefix.""" + manifest = ConnectorManifest( + name="snowflake-source-test", + type="source", + config={ + "connector.class": SNOWFLAKE_SOURCE_CLOUD, + "snowflake.database.name": "ANALYTICS", + "table.include.list": "ANALYTICS.PUBLIC.USERS,ANALYTICS.PUBLIC.ORDERS", + }, + tasks=[], + topic_names=[], + lineages=[], + ) + + connector = SnowflakeSourceConnector(manifest, config, report) + topics = connector.get_topics_from_config() + + assert len(topics) == 2 + # Topics are lowercased to match DataHub's normalization + assert "analytics.public.users" in topics + assert "analytics.public.orders" in topics + + +def test_snowflake_source_lineage_extraction(config, report): + """Test complete lineage extraction for Snowflake Source connector.""" + manifest = ConnectorManifest( + name="snowflake-source-test", + type="source", + config={ + "connector.class": SNOWFLAKE_SOURCE_CLOUD, + "snowflake.database.name": "ANALYTICS", + "table.include.list": "ANALYTICS.PUBLIC.USERS,ANALYTICS.PUBLIC.ORDERS", + "topic.prefix": "snowflake_", + }, + tasks=[], + topic_names=[ + # Topics are lowercased to match DataHub's normalization + "snowflake_analytics.public.users", + "snowflake_analytics.public.orders", + ], + lineages=[], + ) + + connector = SnowflakeSourceConnector(manifest, config, report) + lineages = connector.extract_lineages() + + assert len(lineages) == 2 + + # Verify first lineage (table names and topics are lowercase) + users_lineage = next( + (lineage for lineage in lineages if "users" in lineage.target_dataset), None + ) + assert users_lineage is not None + assert users_lineage.source_dataset == "analytics.public.users" + assert users_lineage.source_platform == "snowflake" + assert users_lineage.target_dataset == "snowflake_analytics.public.users" + assert users_lineage.target_platform == "kafka" + + # Verify second lineage (table names and topics are lowercase) + orders_lineage = next( + (lineage for lineage in lineages if "orders" in lineage.target_dataset), None + ) + assert orders_lineage is not None + assert orders_lineage.source_dataset == "analytics.public.orders" + assert orders_lineage.source_platform == "snowflake" + assert orders_lineage.target_dataset == "snowflake_analytics.public.orders" + assert orders_lineage.target_platform == "kafka" + + +def test_snowflake_source_lineage_no_matching_topics(config, report): + """Test lineage extraction when topics don't match configuration.""" + manifest = ConnectorManifest( + name="snowflake-source-test", + type="source", + config={ + "connector.class": SNOWFLAKE_SOURCE_CLOUD, + "snowflake.database.name": "ANALYTICS", + "table.include.list": "ANALYTICS.PUBLIC.USERS", + "topic.prefix": "snowflake_", + }, + tasks=[], + topic_names=["other_topic"], # Topic doesn't match expected pattern + lineages=[], + ) + + connector = SnowflakeSourceConnector(manifest, config, report) + lineages = connector.extract_lineages() + + # Should return empty list when no topics match + assert len(lineages) == 0 + + +def test_snowflake_source_flow_property_bag(config, report): + """Test that sensitive fields are excluded from flow property bag.""" + manifest = ConnectorManifest( + name="snowflake-source-test", + type="source", + config={ + "connector.class": SNOWFLAKE_SOURCE_CLOUD, + "snowflake.database.name": "ANALYTICS", + "connection.user": "admin", + "connection.password": "secret123", + "snowflake.private.key": "private_key_data", + "snowflake.private.key.passphrase": "passphrase123", + "table.include.list": "ANALYTICS.PUBLIC.USERS", + }, + tasks=[], + topic_names=[], + lineages=[], + ) + + connector = SnowflakeSourceConnector(manifest, config, report) + flow_props = connector.extract_flow_property_bag() + + # Verify sensitive fields are excluded + assert "connection.password" not in flow_props + assert "connection.user" not in flow_props + assert "snowflake.private.key" not in flow_props + assert "snowflake.private.key.passphrase" not in flow_props + + # Verify non-sensitive fields are included + assert "connector.class" in flow_props + assert "snowflake.database.name" in flow_props + assert "table.include.list" in flow_props + + +def test_connector_registry_recognizes_snowflake_source(config, report): + """Test that ConnectorRegistry creates SnowflakeSourceConnector for Snowflake Source.""" + manifest = ConnectorManifest( + name="snowflake-source-test", + type="source", + config={ + "connector.class": SNOWFLAKE_SOURCE_CLOUD, + "snowflake.database.name": "ANALYTICS", + "table.include.list": "ANALYTICS.PUBLIC.USERS", + }, + tasks=[], + topic_names=[], + lineages=[], + ) + + connector = ConnectorRegistry.get_connector_for_manifest(manifest, config, report) + + assert connector is not None + assert isinstance(connector, SnowflakeSourceConnector) + + +def test_snowflake_source_with_database_name_fallback(config, report): + """Test parsing with database.name fallback.""" + manifest = ConnectorManifest( + name="snowflake-source-test", + type="source", + config={ + "connector.class": SNOWFLAKE_SOURCE_CLOUD, + "database.name": "ANALYTICS", # Fallback field + "table.include.list": "ANALYTICS.PUBLIC.USERS", + }, + tasks=[], + topic_names=[], + lineages=[], + ) + + connector = SnowflakeSourceConnector(manifest, config, report) + parser = connector.get_parser(manifest) + + assert parser.database_name == "ANALYTICS" + + +def test_snowflake_source_with_table_whitelist(config, report): + """Test parsing with deprecated table.whitelist field.""" + manifest = ConnectorManifest( + name="snowflake-source-test", + type="source", + config={ + "connector.class": SNOWFLAKE_SOURCE_CLOUD, + "snowflake.database.name": "ANALYTICS", + "table.whitelist": "ANALYTICS.PUBLIC.USERS,ANALYTICS.PUBLIC.ORDERS", + }, + tasks=[], + topic_names=[], + lineages=[], + ) + + connector = SnowflakeSourceConnector(manifest, config, report) + parser = connector.get_parser(manifest) + + assert len(parser.table_names) == 2 + assert "ANALYTICS.PUBLIC.USERS" in parser.table_names + assert "ANALYTICS.PUBLIC.ORDERS" in parser.table_names + + +def test_snowflake_source_with_patterns_no_schema_resolver(config, report): + """Test that patterns without schema resolver are skipped with warning.""" + manifest = ConnectorManifest( + name="snowflake-source-test", + type="source", + config={ + "connector.class": SNOWFLAKE_SOURCE_CLOUD, + "snowflake.database.name": "ANALYTICS", + "table.include.list": "ANALYTICS.PUBLIC.*,ANALYTICS.PRIVATE.USER_*", + }, + tasks=[], + topic_names=["some_topic"], + lineages=[], + ) + + connector = SnowflakeSourceConnector(manifest, config, report) + # schema_resolver is None by default + assert connector.schema_resolver is None + + lineages = connector.extract_lineages() + + # Should return empty lineages when patterns exist but no schema resolver + assert len(lineages) == 0 + + # Should have warning in report + assert len(report.warnings) > 0 + warning_messages = [w.message for w in report.warnings if hasattr(w, "message")] + assert any("table patterns" in msg.lower() for msg in warning_messages) + assert any( + "schema resolver is not available" in msg.lower() for msg in warning_messages + ) + + +def test_snowflake_source_static_tables_without_patterns(config, report): + """Test that static table lists work without schema resolver.""" + manifest = ConnectorManifest( + name="snowflake-source-test", + type="source", + config={ + "connector.class": SNOWFLAKE_SOURCE_CLOUD, + "snowflake.database.name": "ANALYTICS", + "table.include.list": "ANALYTICS.PUBLIC.USERS,ANALYTICS.PUBLIC.ORDERS", + "topic.prefix": "snowflake_", + }, + tasks=[], + topic_names=[ + # Topics are lowercased to match DataHub's normalization + "snowflake_analytics.public.users", + "snowflake_analytics.public.orders", + ], + lineages=[], + ) + + connector = SnowflakeSourceConnector(manifest, config, report) + # schema_resolver is None by default + assert connector.schema_resolver is None + + lineages = connector.extract_lineages() + + # Should work fine with static table lists + assert len(lineages) == 2 + + # Should have no warnings + assert len(report.warnings) == 0 + + +def test_snowflake_source_pattern_expansion_with_schema_resolver(config, report): + """Test successful pattern expansion when schema resolver is available.""" + from unittest.mock import Mock + + manifest = ConnectorManifest( + name="snowflake-source-test", + type="source", + config={ + "connector.class": SNOWFLAKE_SOURCE_CLOUD, + "snowflake.database.name": "ANALYTICS", + "table.include.list": "ANALYTICS.PUBLIC.*", + "topic.prefix": "snowflake_", + }, + tasks=[], + topic_names=[ + "snowflake_analytics.public.users", + "snowflake_analytics.public.orders", + "snowflake_analytics.public.products", + ], + lineages=[], + ) + + connector = SnowflakeSourceConnector(manifest, config, report) + + # Mock schema resolver with test URNs + # Note: DataHub normalizes table names to lowercase in URNs + mock_resolver = Mock() + mock_graph = Mock() + # Mock graph.get_urns_by_filter to return test URNs + mock_graph.get_urns_by_filter.return_value = iter( + [ + "urn:li:dataset:(urn:li:dataPlatform:snowflake,analytics.public.users,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:snowflake,analytics.public.orders,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:snowflake,analytics.public.products,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:snowflake,analytics.private.secrets,PROD)", + ] + ) + mock_resolver.graph = mock_graph + mock_resolver.env = "PROD" + connector.schema_resolver = mock_resolver + + lineages = connector.extract_lineages() + + # Should expand pattern to 3 matching tables + assert len(lineages) == 3 + + # Verify lineages (table names are lowercase as returned by DataHub) + source_datasets = {lin.source_dataset for lin in lineages} + assert "analytics.public.users" in source_datasets + assert "analytics.public.orders" in source_datasets + assert "analytics.public.products" in source_datasets + + # Verify no warnings + assert len(report.warnings) == 0 + + +def test_snowflake_source_pattern_expansion_mixed_patterns_and_explicit(config, report): + """Test pattern expansion with mix of patterns and explicit table names.""" + from unittest.mock import Mock + + manifest = ConnectorManifest( + name="snowflake-source-test", + type="source", + config={ + "connector.class": SNOWFLAKE_SOURCE_CLOUD, + "snowflake.database.name": "ANALYTICS", + "table.include.list": "ANALYTICS.PUBLIC.*,ANALYTICS.PRIVATE.SPECIFIC_TABLE", + "topic.prefix": "snowflake_", + }, + tasks=[], + topic_names=[ + "snowflake_analytics.public.users", + "snowflake_analytics.public.orders", + "snowflake_analytics.private.specific_table", + ], + lineages=[], + ) + + connector = SnowflakeSourceConnector(manifest, config, report) + + # Mock schema resolver (table names in lowercase as in real DataHub) + mock_resolver = Mock() + mock_graph = Mock() + # Mock graph.get_urns_by_filter to return test URNs + mock_graph.get_urns_by_filter.return_value = iter( + [ + "urn:li:dataset:(urn:li:dataPlatform:snowflake,analytics.public.users,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:snowflake,analytics.public.orders,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:snowflake,analytics.private.specific_table,PROD)", + ] + ) + mock_resolver.graph = mock_graph + mock_resolver.env = "PROD" + connector.schema_resolver = mock_resolver + + lineages = connector.extract_lineages() + + # Should have 3 lineages (2 from pattern + 1 explicit) + assert len(lineages) == 3 + + source_datasets = {lin.source_dataset for lin in lineages} + assert "analytics.public.users" in source_datasets + assert "analytics.public.orders" in source_datasets + assert "analytics.private.specific_table" in source_datasets + + +def test_snowflake_source_pattern_expansion_no_matches(config, report): + """Test pattern expansion when no tables match the pattern.""" + from unittest.mock import Mock + + manifest = ConnectorManifest( + name="snowflake-source-test", + type="source", + config={ + "connector.class": SNOWFLAKE_SOURCE_CLOUD, + "snowflake.database.name": "ANALYTICS", + "table.include.list": "ANALYTICS.PUBLIC.*", + "topic.prefix": "snowflake_", + }, + tasks=[], + topic_names=["snowflake_ANALYTICS.PUBLIC.USERS"], + lineages=[], + ) + + connector = SnowflakeSourceConnector(manifest, config, report) + + # Mock schema resolver with no matching tables + mock_resolver = Mock() + mock_resolver.get_urns.return_value = [ + "urn:li:dataset:(urn:li:dataPlatform:snowflake,ANALYTICS.PRIVATE.USERS,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:postgres,ANALYTICS.PUBLIC.ORDERS,PROD)", + ] + connector.schema_resolver = mock_resolver + + lineages = connector.extract_lineages() + + # Should return empty list when no matches + assert len(lineages) == 0 + + +def test_snowflake_source_pattern_expansion_multiple_patterns(config, report): + """Test expansion of multiple patterns.""" + from unittest.mock import Mock + + manifest = ConnectorManifest( + name="snowflake-source-test", + type="source", + config={ + "connector.class": SNOWFLAKE_SOURCE_CLOUD, + "snowflake.database.name": "ANALYTICS", + "table.include.list": "ANALYTICS.PUBLIC.USER_*,ANALYTICS.PUBLIC.ORDER_*", + "topic.prefix": "snowflake_", + }, + tasks=[], + topic_names=[ + "snowflake_analytics.public.user_profiles", + "snowflake_analytics.public.user_settings", + "snowflake_analytics.public.order_items", + "snowflake_analytics.public.order_history", + ], + lineages=[], + ) + + connector = SnowflakeSourceConnector(manifest, config, report) + + # Mock schema resolver (table names in lowercase as in real DataHub) + mock_resolver = Mock() + mock_graph = Mock() + # Mock graph.get_urns_by_filter to return test URNs + # Use side_effect to return a fresh iterator each time (needed for multiple pattern expansions) + mock_graph.get_urns_by_filter.side_effect = lambda **kwargs: iter( + [ + "urn:li:dataset:(urn:li:dataPlatform:snowflake,analytics.public.user_profiles,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:snowflake,analytics.public.user_settings,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:snowflake,analytics.public.order_items,PROD)", + "urn:li:dataset:(urn:li:dataPlatform:snowflake,analytics.public.order_history,PROD)", + ] + ) + mock_resolver.graph = mock_graph + mock_resolver.env = "PROD" + connector.schema_resolver = mock_resolver + + lineages = connector.extract_lineages() + + # Should expand both patterns + assert len(lineages) == 4 + + source_datasets = {lin.source_dataset for lin in lineages} + assert "analytics.public.user_profiles" in source_datasets + assert "analytics.public.user_settings" in source_datasets + assert "analytics.public.order_items" in source_datasets + assert "analytics.public.order_history" in source_datasets + + +def test_snowflake_source_pattern_expansion_empty_datahub_response(config, report): + """ + Test pattern expansion when DataHub returns no tables at all. + + This is a realistic scenario where: + 1. User configures a pattern (e.g., "ANALYTICS.PUBLIC.*") + 2. Schema resolver queries DataHub successfully + 3. DataHub returns empty result (no tables matching the pattern exist) + 4. Connector should handle this gracefully and log a warning + """ + from unittest.mock import Mock + + manifest = ConnectorManifest( + name="snowflake-source-test", + type="source", + config={ + "connector.class": SNOWFLAKE_SOURCE_CLOUD, + "snowflake.database.name": "ANALYTICS", + "table.include.list": "ANALYTICS.NONEXISTENT.*", + "topic.prefix": "snowflake_", + }, + tasks=[], + topic_names=[], + lineages=[], + ) + + connector = SnowflakeSourceConnector(manifest, config, report) + + # Mock schema resolver that returns empty results from DataHub + mock_resolver = Mock() + mock_graph = Mock() + # DataHub returns empty iterator - no tables found + mock_graph.get_urns_by_filter.return_value = iter([]) + mock_resolver.graph = mock_graph + mock_resolver.env = "PROD" + connector.schema_resolver = mock_resolver + + # Test get_topics_from_config - should return empty list + topics = connector.get_topics_from_config() + assert len(topics) == 0 + + # Test extract_lineages - should also return empty list + lineages = connector.extract_lineages() + assert len(lineages) == 0 + + # Verify that cached expanded tables is set to empty list + assert connector._cached_expanded_tables == [] + + +def test_snowflake_source_parser_extracts_transforms(config, report): + """Test that parser correctly extracts transform configuration.""" + manifest = ConnectorManifest( + name="snowflake-source-test", + type="source", + config={ + "connector.class": SNOWFLAKE_SOURCE_CLOUD, + "snowflake.database.name": "ANALYTICS", + "table.include.list": "ANALYTICS.PUBLIC.USERS", + "transforms": "route,timestamp", + "transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter", + "transforms.route.regex": "snowflake_(.*)", + "transforms.route.replacement": "prod_$1", + "transforms.timestamp.type": "org.apache.kafka.connect.transforms.TimestampConverter", + "transforms.timestamp.field": "updated_at", + }, + tasks=[], + topic_names=[], + lineages=[], + ) + + connector = SnowflakeSourceConnector(manifest, config, report) + parser = connector.get_parser(manifest) + + # Verify transforms are parsed + assert len(parser.transforms) == 2 + + # Verify first transform (route) + route_transform = parser.transforms[0] + assert route_transform["name"] == "route" + assert route_transform["type"] == "org.apache.kafka.connect.transforms.RegexRouter" + assert route_transform["regex"] == "snowflake_(.*)" + assert route_transform["replacement"] == "prod_$1" + + # Verify second transform (timestamp) + timestamp_transform = parser.transforms[1] + assert timestamp_transform["name"] == "timestamp" + assert ( + timestamp_transform["type"] + == "org.apache.kafka.connect.transforms.TimestampConverter" + ) + assert timestamp_transform["field"] == "updated_at"