This class provides the command-line interface for running the TLS-Crawler in two modes: + * + *
The application uses RabbitMQ for communication between controllers and workers, and MongoDB + * for persistence of scan results and job status. + * + *
Usage examples: + * + *
+ * java -jar crawler-core.jar controller --config controller.properties + * java -jar crawler-core.jar worker --config worker.properties + *+ * + * @see Controller + * @see Worker + * @see ControllerCommandConfig + * @see WorkerCommandConfig + */ public class CommonMain { private static final Logger LOGGER = LogManager.getLogger(); + /** Private constructor to prevent instantiation of utility class. */ + private CommonMain() { + // Utility class should not be instantiated + } + + /** + * Main entry point for the TLS-Crawler application. + * + *
Parses command line arguments to determine whether to run as a controller or worker, + * initializes the appropriate configuration and dependencies, and starts the selected mode. + * + * @param args command line arguments including the mode ("controller" or "worker") and + * configuration parameters + * @param controllerCommandConfig configuration for controller mode + * @param workerCommandConfig configuration for worker mode + */ public static void main( String[] args, ControllerCommandConfig controllerCommandConfig, @@ -71,6 +112,15 @@ public static void main( } } + /** + * Convenience method for running the application with only controller configuration. + * + *
Creates a default worker configuration and delegates to the main method. This is useful + * when only controller functionality is needed. + * + * @param args command line arguments + * @param controllerConfig configuration for controller mode + */ public static void main(String[] args, ControllerCommandConfig controllerConfig) { main(args, controllerConfig, new WorkerCommandConfig()); } diff --git a/src/main/java/de/rub/nds/crawler/config/ControllerCommandConfig.java b/src/main/java/de/rub/nds/crawler/config/ControllerCommandConfig.java index becc425..31527f5 100644 --- a/src/main/java/de/rub/nds/crawler/config/ControllerCommandConfig.java +++ b/src/main/java/de/rub/nds/crawler/config/ControllerCommandConfig.java @@ -22,6 +22,56 @@ import org.apache.commons.validator.routines.UrlValidator; import org.quartz.CronScheduleBuilder; +/** + * Abstract base configuration class for TLS-Crawler controller command-line arguments. + * + *
This class defines the common configuration parameters needed by controller implementations to + * orchestrate large-scale TLS scanning operations. It uses JCommander annotations for command-line + * parsing and provides comprehensive validation of input parameters. + * + *
Key configuration areas: + * + *
Target List Priority: When multiple target sources are specified, the + * following priority is used: + * + *
Validation Rules: + * + *
Extension Points: Subclasses must implement: + * + *
This constructor initializes the delegate objects that handle RabbitMQ and MongoDB + * configuration parameters. The delegates use JCommander's @ParametersDelegate annotation to + * include their parameters in the overall command-line parsing. + * + *
Delegate Initialization: + * + *
This method performs comprehensive validation of all configuration parameters to ensure + * they form a valid and consistent configuration. It checks for required parameters, validates + * dependencies between parameters, and verifies format requirements. + * + *
Validation Rules: + * + *
Parameter Dependencies: + * + *
This validator ensures that integer parameters have positive values (>= 0). It is used for + * timeout and reexecution parameters where negative values would be meaningless. + * + *
Validation Logic: + * + *
This validator ensures that cron expression parameters conform to valid Quartz cron + * syntax. It is used for the scanCronInterval parameter to validate recurring scan schedules. + * + *
Validation Method: + * + *
This method implements the target source priority logic, selecting the appropriate + * provider based on which parameters were specified. It provides a single point of target list + * creation with consistent priority ordering. + * + *
Priority Order: + * + *
Provider Types: + * + *
This abstract method must be implemented by subclasses to provide the appropriate + * ScanConfig instance for their specific scanner type. The scan configuration defines how + * individual scan jobs should be executed. + * + *
Implementation Requirements: Subclasses should create a ScanConfig that + * includes: + * + *
This factory method constructs a BulkScan object with all necessary metadata and + * configuration for a scanning campaign. The BulkScan serves as the central coordination object + * for the entire scanning operation. + * + *
BulkScan Components: + * + *
This method provides the controller implementation class for tracking which version of the + * crawler was used to create a bulk scan. This information is stored in the BulkScan metadata + * for debugging and compatibility purposes. + * + * @return the concrete controller class that extends this configuration + */ public Class> getCrawlerClassForVersion() { return this.getClass(); } + /** + * Returns the scanner implementation class for version tracking. + * + *
This abstract method must be implemented by subclasses to provide the specific scanner + * class they use. This information is stored in BulkScan metadata for version tracking and + * worker compatibility verification. + * + *
Implementation Notes: + * + *
This class defines the configuration parameters needed by worker instances to participate in + * distributed TLS scanning operations. Workers consume scan jobs from the message queue, execute + * TLS scans, and store results in the database. The configuration controls worker performance, + * concurrency, and integration with the distributed infrastructure. + * + *
Key configuration areas: + * + *
Threading Architecture: + * + *
Timeout Coordination: + * + *
Resource Management: + * + *
Infrastructure Integration: Uses delegate pattern for RabbitMQ and MongoDB + * configuration to maintain separation of concerns and enable reuse across controller and worker + * configurations. + * + * @see RabbitMqDelegate + * @see MongoDbDelegate + * @see ControllerCommandConfig + */ public class WorkerCommandConfig { @ParametersDelegate private final RabbitMqDelegate rabbitMqDelegate; @@ -38,39 +91,130 @@ public class WorkerCommandConfig { + "After the timeout the worker tries to shutdown the scan but a shutdown can not be guaranteed due to the TLS-Scanner implementation.") private int scanTimeout = 840000; + /** + * Creates a new worker command configuration with default delegate instances. + * + *
This constructor initializes the delegate objects that handle RabbitMQ and MongoDB + * configuration parameters. The delegates use JCommander's @ParametersDelegate annotation to + * include their parameters in the worker's command-line parsing. + * + *
Delegate Initialization: + * + *
Default Values: + * + *
Each scan thread runs a separate TLS scanner instance, allowing the worker to process + * multiple scan jobs simultaneously. The default value equals the number of available CPU cores + * for optimal processor utilization. + * + * @return the number of parallel scan threads (default: CPU count) + */ public int getParallelScanThreads() { return parallelScanThreads; } + /** + * Gets the number of parallel connection threads for network operations. + * + *
These threads are shared across all scan threads within a bulk scan to handle concurrent + * network connections efficiently. A higher count allows more simultaneous connections but + * increases resource usage. + * + * @return the number of parallel connection threads (default: 20) + */ public int getParallelConnectionThreads() { return parallelConnectionThreads; } + /** + * Gets the overall timeout for individual scan operations. + * + *
Critical Timing Constraint: This timeout must be lower than the RabbitMQ + * consumer acknowledgment timeout (default 15 minutes) to prevent connection closure due to + * unacknowledged messages. + * + *
Timeout Behavior: + * + *
Configures how many TLS scanner instances can run simultaneously within this worker. + * Higher values increase throughput but also CPU and memory usage. + * + * @param parallelScanThreads the number of parallel scan threads + */ public void setParallelScanThreads(int parallelScanThreads) { this.parallelScanThreads = parallelScanThreads; } + /** + * Sets the number of parallel connection threads for network operations. + * + *
Configures the shared thread pool size for concurrent network connections across all scan + * operations. Balance between connection capacity and resource usage. + * + * @param parallelConnectionThreads the number of parallel connection threads + */ public void setParallelConnectionThreads(int parallelConnectionThreads) { this.parallelConnectionThreads = parallelConnectionThreads; } + /** + * Sets the overall timeout for individual scan operations. + * + *
Important: Must be less than RabbitMQ consumer ACK timeout to prevent + * message queue connection issues. + * + * @param scanTimeout the scan timeout in milliseconds + */ public void setScanTimeout(int scanTimeout) { this.scanTimeout = scanTimeout; } diff --git a/src/main/java/de/rub/nds/crawler/config/delegate/MongoDbDelegate.java b/src/main/java/de/rub/nds/crawler/config/delegate/MongoDbDelegate.java index 3cfd571..5a293ab 100644 --- a/src/main/java/de/rub/nds/crawler/config/delegate/MongoDbDelegate.java +++ b/src/main/java/de/rub/nds/crawler/config/delegate/MongoDbDelegate.java @@ -10,8 +10,56 @@ import com.beust.jcommander.Parameter; +/** + * Configuration delegate for MongoDB database connection parameters in TLS-Crawler. + * + *
The MongoDbDelegate encapsulates all MongoDB-specific configuration parameters used for + * database connectivity in the TLS-Crawler distributed architecture. It uses JCommander annotations + * to provide command-line parameter parsing and supports both password-based and file-based + * authentication methods. + * + *
Key features: + * + *
Authentication Methods: + * + *
Usage Pattern: This delegate is embedded in both ControllerCommandConfig and + * WorkerCommandConfig using JCommander's @ParametersDelegate annotation, allowing the same MongoDB + * configuration to be shared across all application components. + * + *
Security Considerations: + * + *
Default Behavior: All parameters are optional and default to null, allowing + * for environment-specific configuration or default MongoDB connection settings. + * + *
Used by ControllerCommandConfig and WorkerCommandConfig for database configuration. Creates + * IPersistenceProvider instances, typically MongoPersistenceProvider implementations. + */ public class MongoDbDelegate { + /** Creates a new MongoDB configuration delegate with default settings. */ + public MongoDbDelegate() { + // Default constructor for JCommander parameter injection + } + @Parameter( names = "-mongoDbHost", description = "Host of the MongoDB instance this crawler saves to.") @@ -42,50 +90,119 @@ public class MongoDbDelegate { description = "The DB within the MongoDB instance, in which the user:pass is defined.") private String mongoDbAuthSource; + /** + * Gets the MongoDB host address. + * + * @return the MongoDB hostname or IP address, or null if not configured + */ public String getMongoDbHost() { return mongoDbHost; } + /** + * Gets the MongoDB port number. + * + * @return the MongoDB port number, or 0 if not configured (uses MongoDB default) + */ public int getMongoDbPort() { return mongoDbPort; } + /** + * Gets the MongoDB authentication username. + * + * @return the username for MongoDB authentication, or null if not configured + */ public String getMongoDbUser() { return mongoDbUser; } + /** + * Gets the MongoDB authentication password. + * + *
Security Note: Consider using mongoDbPassFile for production deployments + * to avoid exposing passwords in command-line arguments. + * + * @return the password for MongoDB authentication, or null if not configured + */ public String getMongoDbPass() { return mongoDbPass; } + /** + * Gets the path to the MongoDB password file. + * + *
This provides a more secure alternative to specifying passwords directly in command-line + * arguments by reading the password from a file. + * + * @return the path to the password file, or null if not configured + */ public String getMongoDbPassFile() { return mongoDbPassFile; } + /** + * Gets the MongoDB authentication source database. + * + *
This specifies which database contains the user credentials for authentication. Commonly + * set to "admin" for centralized user management. + * + * @return the authentication source database name, or null if not configured + */ public String getMongoDbAuthSource() { return mongoDbAuthSource; } + /** + * Sets the MongoDB host address. + * + * @param mongoDbHost the MongoDB hostname or IP address + */ public void setMongoDbHost(String mongoDbHost) { this.mongoDbHost = mongoDbHost; } + /** + * Sets the MongoDB port number. + * + * @param mongoDbPort the MongoDB port number (typically 27017) + */ public void setMongoDbPort(int mongoDbPort) { this.mongoDbPort = mongoDbPort; } + /** + * Sets the MongoDB authentication username. + * + * @param mongoDbUser the username for MongoDB authentication + */ public void setMongoDbUser(String mongoDbUser) { this.mongoDbUser = mongoDbUser; } + /** + * Sets the MongoDB authentication password. + * + * @param mongoDbPass the password for MongoDB authentication + */ public void setMongoDbPass(String mongoDbPass) { this.mongoDbPass = mongoDbPass; } + /** + * Sets the path to the MongoDB password file. + * + * @param mongoDbPassFile the path to the file containing the MongoDB password + */ public void setMongoDbPassFile(String mongoDbPassFile) { this.mongoDbPassFile = mongoDbPassFile; } + /** + * Sets the MongoDB authentication source database. + * + * @param mongoDbAuthSource the database name containing user credentials + */ public void setMongoDbAuthSource(String mongoDbAuthSource) { this.mongoDbAuthSource = mongoDbAuthSource; } diff --git a/src/main/java/de/rub/nds/crawler/config/delegate/RabbitMqDelegate.java b/src/main/java/de/rub/nds/crawler/config/delegate/RabbitMqDelegate.java index 9d89180..03454dc 100644 --- a/src/main/java/de/rub/nds/crawler/config/delegate/RabbitMqDelegate.java +++ b/src/main/java/de/rub/nds/crawler/config/delegate/RabbitMqDelegate.java @@ -10,8 +10,63 @@ import com.beust.jcommander.Parameter; +/** + * Configuration delegate for RabbitMQ message queue connection parameters in TLS-Crawler. + * + *
The RabbitMqDelegate encapsulates all RabbitMQ-specific configuration parameters used for + * message queue connectivity in the TLS-Crawler distributed architecture. It provides connection + * settings, authentication credentials, and security options for the messaging infrastructure that + * coordinates work between controllers and workers. + * + *
Key features: + * + *
Authentication Methods: + * + *
Security Configuration: + * + *
Usage Pattern: This delegate is embedded in both ControllerCommandConfig and + * WorkerCommandConfig using JCommander's @ParametersDelegate annotation, ensuring consistent + * RabbitMQ configuration across all distributed components. + * + *
Distributed Architecture: RabbitMQ serves as the central coordination + * mechanism in TLS-Crawler, handling scan job distribution, completion notifications, and progress + * monitoring between controllers and multiple worker instances. + * + *
Default Behavior: All parameters are optional and default to appropriate + * values (null for strings, false for TLS, 0 for port), allowing for environment-specific + * configuration or RabbitMQ default connection settings. + * + *
Used by ControllerCommandConfig and WorkerCommandConfig for message queue configuration. + * Creates IOrchestrationProvider instances, typically RabbitMqOrchestrationProvider + * implementations. + */ public class RabbitMqDelegate { + /** Creates a new RabbitMQ configuration delegate with default settings. */ + public RabbitMqDelegate() { + // Default constructor for JCommander parameter injection + } + @Parameter(names = "-rabbitMqHost") private String rabbitMqHost; @@ -30,50 +85,120 @@ public class RabbitMqDelegate { @Parameter(names = "-rabbitMqTLS") private boolean rabbitMqTLS; + /** + * Gets the RabbitMQ broker host address. + * + * @return the RabbitMQ hostname or IP address, or null if not configured + */ public String getRabbitMqHost() { return rabbitMqHost; } + /** + * Gets the RabbitMQ broker port number. + * + * @return the RabbitMQ port number, or 0 if not configured (uses RabbitMQ defaults: 5672 for + * plain, 5671 for TLS) + */ public int getRabbitMqPort() { return rabbitMqPort; } + /** + * Gets the RabbitMQ authentication username. + * + * @return the username for RabbitMQ authentication, or null if not configured + */ public String getRabbitMqUser() { return rabbitMqUser; } + /** + * Gets the RabbitMQ authentication password. + * + *
Security Note: Consider using rabbitMqPassFile for production deployments + * to avoid exposing passwords in command-line arguments. + * + * @return the password for RabbitMQ authentication, or null if not configured + */ public String getRabbitMqPass() { return rabbitMqPass; } + /** + * Gets the path to the RabbitMQ password file. + * + *
This provides a more secure alternative to specifying passwords directly in command-line + * arguments by reading the password from a file. + * + * @return the path to the password file, or null if not configured + */ public String getRabbitMqPassFile() { return rabbitMqPassFile; } + /** + * Checks if TLS encryption is enabled for RabbitMQ connections. + * + *
When TLS is enabled, all communication between the application and RabbitMQ broker is + * encrypted. This typically requires connecting to port 5671 instead of the default port 5672. + * + * @return true if TLS is enabled, false otherwise + */ public boolean isRabbitMqTLS() { return rabbitMqTLS; } + /** + * Sets the RabbitMQ broker host address. + * + * @param rabbitMqHost the RabbitMQ hostname or IP address + */ public void setRabbitMqHost(String rabbitMqHost) { this.rabbitMqHost = rabbitMqHost; } + /** + * Sets the RabbitMQ broker port number. + * + * @param rabbitMqPort the RabbitMQ port number (typically 5672 for plain or 5671 for TLS) + */ public void setRabbitMqPort(int rabbitMqPort) { this.rabbitMqPort = rabbitMqPort; } + /** + * Sets the RabbitMQ authentication username. + * + * @param rabbitMqUser the username for RabbitMQ authentication + */ public void setRabbitMqUser(String rabbitMqUser) { this.rabbitMqUser = rabbitMqUser; } + /** + * Sets the RabbitMQ authentication password. + * + * @param rabbitMqPass the password for RabbitMQ authentication + */ public void setRabbitMqPass(String rabbitMqPass) { this.rabbitMqPass = rabbitMqPass; } + /** + * Sets the path to the RabbitMQ password file. + * + * @param rabbitMqPassFile the path to the file containing the RabbitMQ password + */ public void setRabbitMqPassFile(String rabbitMqPassFile) { this.rabbitMqPassFile = rabbitMqPassFile; } + /** + * Sets whether TLS encryption should be used for RabbitMQ connections. + * + * @param rabbitMqTLS true to enable TLS encryption, false for plain connections + */ public void setRabbitMqTLS(boolean rabbitMqTLS) { this.rabbitMqTLS = rabbitMqTLS; } diff --git a/src/main/java/de/rub/nds/crawler/constant/CruxListNumber.java b/src/main/java/de/rub/nds/crawler/constant/CruxListNumber.java index 8eafb0e..0efd885 100644 --- a/src/main/java/de/rub/nds/crawler/constant/CruxListNumber.java +++ b/src/main/java/de/rub/nds/crawler/constant/CruxListNumber.java @@ -8,13 +8,71 @@ */ package de.rub.nds.crawler.constant; +/** + * Enumeration of supported Chrome UX Report (CrUX) target list sizes for distributed TLS scanning. + * + *
The CruxListNumber enum defines predefined target list sizes available from the Chrome User + * Experience Report dataset. These lists contain popular websites ranked by real user traffic + * patterns, providing realistic target sets for TLS security evaluations. + * + *
Key characteristics: + * + *
List Sizes: + * + *
Selection Guidelines: + * + *
Usage Example: + * + *
{@code + * CruxListProvider provider = new CruxListProvider(CruxListNumber.TOP_10K); + * List+ * + * Used by CruxListProvider to configure target list sizes. Part of the ITargetListProvider system + * for scan target management. + */ public enum CruxListNumber { + /** Top 1,000 most popular websites from Chrome UX Report data. */ TOP_1k(1000), + /** Top 5,000 most popular websites from Chrome UX Report data. */ TOP_5K(5000), + /** Top 10,000 most popular websites from Chrome UX Report data. */ TOP_10K(10000), + /** Top 50,000 most popular websites from Chrome UX Report data. */ TOP_50K(50000), + /** Top 100,000 most popular websites from Chrome UX Report data. */ TOP_100K(100000), + /** Top 500,000 most popular websites from Chrome UX Report data. */ TOP_500k(500000), + /** Top 1,000,000 most popular websites from Chrome UX Report data. */ TOP_1M(1000000); private final int number; @@ -23,6 +81,11 @@ public enum CruxListNumber { this.number = number; } + /** + * Returns the numeric value representing the number of targets in this list size. + * + * @return the number of targets (e.g., 1000 for TOP_1k, 10000 for TOP_10K) + */ public int getNumber() { return number; } diff --git a/src/main/java/de/rub/nds/crawler/constant/JobStatus.java b/src/main/java/de/rub/nds/crawler/constant/JobStatus.java index fe6d26d..051b8fb 100644 --- a/src/main/java/de/rub/nds/crawler/constant/JobStatus.java +++ b/src/main/java/de/rub/nds/crawler/constant/JobStatus.java @@ -8,6 +8,60 @@ */ package de.rub.nds.crawler.constant; +/** + * Enumeration of possible scan job execution statuses in the TLS-Crawler distributed system. + * + *targets = provider.getTargetList(); + * }
The JobStatus enum categorizes the final outcome of scan job processing, providing detailed + * status information for monitoring, debugging, and result analysis. Each status indicates both the + * execution outcome and whether it represents an error condition. + * + *
Key characteristics: + * + *
Status Categories: + * + *
Database Behavior: + * + *
Usage in Monitoring: + * + *
{@code + * // Error rate calculation + * long errorCount = results.stream() + * .map(ScanResult::getJobStatus) + * .filter(JobStatus::isError) + * .count(); + * + * // Status-specific handling + * switch (jobStatus) { + * case SUCCESS -> processResult(result); + * case UNRESOLVABLE -> logDNSIssue(target); + * case ERROR -> reportError(error); + * } + * }+ * + * Used by ScanJobDescription.getStatus() and ScanResult.getJobStatus() methods. Set during + * processing by Worker.handleScanJob(ScanJobDescription) method. + */ public enum JobStatus { /** Job is waiting to be executed. */ TO_BE_EXECUTED(false), @@ -42,6 +96,22 @@ public enum JobStatus { this.isError = isError; } + /** + * Determines whether this status represents an error condition. + * + *
This method categorizes job statuses into successful and error states for monitoring and + * reporting purposes. Error states indicate problems that prevented normal scan completion, + * while non-error states represent successful processing (even if no data was obtained). + * + *
Error Status Classification: + * + *
This class provides the framework for implementing specific scanner workers that can process + * multiple scan targets concurrently. It handles the lifecycle management, thread pool + * coordination, and resource cleanup for scanning operations. + * + *
Key responsibilities: + * + *
Implementations must provide: + * + *
Thread Safety: This class is designed to be thread-safe and can handle + * multiple concurrent scan requests. The initialization and cleanup methods are synchronized to + * prevent race conditions. + * + *
Resource Management: The worker automatically manages its lifecycle,
+ * performing initialization on first use and cleanup when no active jobs remain.
+ *
+ * @param This executor wraps scanner functions in Future objects to enable proper timeout handling
+ * and concurrent execution of multiple scans.
*/
private final ThreadPoolExecutor timeoutExecutor;
+ /**
+ * Creates a new BulkScanWorker with the specified configuration and thread pool size.
+ *
+ * @param bulkScanId the identifier of the bulk scan this worker belongs to
+ * @param scanConfig the scan configuration containing scan parameters
+ * @param parallelScanThreads the number of threads to use for parallel scanning
+ */
protected BulkScanWorker(String bulkScanId, T scanConfig, int parallelScanThreads) {
this.bulkScanId = bulkScanId;
this.scanConfig = scanConfig;
@@ -47,6 +103,21 @@ protected BulkScanWorker(String bulkScanId, T scanConfig, int parallelScanThread
new NamedThreadFactory("crawler-worker: scan executor"));
}
+ /**
+ * Handles a scan request for the specified target.
+ *
+ * This method manages the complete lifecycle of a scan operation:
+ *
+ * This method must be implemented by concrete worker classes to provide the specific
+ * scanning logic for their scanner type.
+ *
+ * @param scanTarget the target to scan
+ * @return a MongoDB document containing the scan results
+ */
public abstract Document scan(ScanTarget scanTarget);
+ /**
+ * Initializes the worker if not already initialized.
+ *
+ * This method ensures thread-safe initialization using double-checked locking. Only one
+ * thread will perform the actual initialization, while others will wait for completion.
+ *
+ * @return true if this call performed the initialization, false if already initialized
+ */
public final boolean init() {
// synchronize such that no thread runs before being initialized
// but only synchronize if not already initialized
@@ -77,6 +165,15 @@ public final boolean init() {
return false;
}
+ /**
+ * Cleans up the worker resources if no jobs are currently active.
+ *
+ * This method performs thread-safe cleanup using synchronization to prevent race conditions
+ * with initialization and active jobs. If jobs are still running, cleanup is deferred until all
+ * jobs complete.
+ *
+ * @return true if cleanup was performed, false if deferred or already cleaned up
+ */
public final boolean cleanup() {
// synchronize such that init and cleanup do not run simultaneously
// but only synchronize if already initialized
@@ -98,7 +195,19 @@ public final boolean cleanup() {
return false;
}
+ /**
+ * Performs worker-specific initialization.
+ *
+ * This method is called once during the worker's lifecycle and should set up any resources
+ * needed for scanning operations.
+ */
protected abstract void initInternal();
+ /**
+ * Performs worker-specific cleanup.
+ *
+ * This method is called when the worker is being shut down and should release any resources
+ * allocated during initialization.
+ */
protected abstract void cleanupInternal();
}
diff --git a/src/main/java/de/rub/nds/crawler/core/BulkScanWorkerManager.java b/src/main/java/de/rub/nds/crawler/core/BulkScanWorkerManager.java
index d9df6cb..4861882 100644
--- a/src/main/java/de/rub/nds/crawler/core/BulkScanWorkerManager.java
+++ b/src/main/java/de/rub/nds/crawler/core/BulkScanWorkerManager.java
@@ -22,10 +22,68 @@
import org.apache.logging.log4j.Logger;
import org.bson.Document;
+/**
+ * Singleton manager for bulk scan workers that handles worker lifecycle and caching.
+ *
+ * This class implements a caching mechanism for {@link BulkScanWorker} instances to optimize
+ * resource usage in distributed scanning operations. Workers are cached by bulk scan ID and
+ * automatically cleaned up after periods of inactivity.
+ *
+ * Key responsibilities:
+ *
+ * Caching Strategy:
+ *
+ * Thread Safety: This class is thread-safe and can handle concurrent worker
+ * requests from multiple threads. The underlying Guava cache provides the necessary synchronization
+ * guarantees.
+ *
+ * Usage Example:
+ *
+ * This method implements lazy initialization of the singleton instance. The instance is
+ * created on first access and reused for subsequent calls.
+ *
+ * @return the singleton BulkScanWorkerManager instance
+ */
public static BulkScanWorkerManager getInstance() {
if (instance == null) {
instance = new BulkScanWorkerManager();
@@ -33,6 +91,18 @@ public static BulkScanWorkerManager getInstance() {
return instance;
}
+ /**
+ * Static convenience method for handling scan jobs without explicit instance management.
+ *
+ * This method provides a simplified interface for processing scan jobs by automatically
+ * obtaining the singleton instance and delegating to the instance method.
+ *
+ * @param scanJobDescription the scan job to execute
+ * @param parallelConnectionThreads the number of threads for connection management
+ * @param parallelScanThreads the number of threads for parallel scanning
+ * @return a Future representing the scan operation result
+ * @see #handle(ScanJobDescription, int, int)
+ */
public static Future Initializes the worker cache with the following configuration:
+ *
+ * This method implements the core caching logic for worker management:
+ *
+ * Thread Safety: This method is thread-safe and can be called concurrently.
+ * The cache handles synchronization of worker creation.
+ *
+ * @param bulkScanId the unique identifier of the bulk scan
+ * @param scanConfig the scan configuration for creating new workers
+ * @param parallelConnectionThreads the number of threads for connection management
+ * @param parallelScanThreads the number of threads for parallel scanning
+ * @return the cached or newly created bulk scan worker
+ * @throws UncheckedException if worker creation fails
+ */
public BulkScanWorker> getBulkScanWorker(
String bulkScanId,
ScanConfig scanConfig,
@@ -79,6 +183,28 @@ public BulkScanWorker> getBulkScanWorker(
}
}
+ /**
+ * Handles a scan job by obtaining the appropriate worker and executing the scan.
+ *
+ * This method orchestrates the complete scan job execution:
+ *
+ * The method leverages worker caching to ensure efficient resource utilization across
+ * multiple scan jobs belonging to the same bulk scan operation.
+ *
+ * @param scanJobDescription the scan job containing target and configuration information
+ * @param parallelConnectionThreads the number of threads for connection management
+ * @param parallelScanThreads the number of threads for parallel scanning
+ * @return a Future representing the scan operation result as a MongoDB document
+ * @throws UncheckedException if worker creation or initialization fails
+ * @see ScanJobDescription
+ * @see BulkScanWorker#handle(de.rub.nds.crawler.data.ScanTarget)
+ */
public Future Central coordination component managing TLS scanning campaigns. Uses Quartz scheduler for
+ * timing, integrates with orchestration providers for job distribution, and supports progress
+ * monitoring.
+ *
+ * @see ControllerCommandConfig
+ * @see PublishBulkScanJob
+ * @see IOrchestrationProvider
+ * @see IPersistenceProvider
+ */
public class Controller {
private static final Logger LOGGER = LogManager.getLogger();
+ /** Provider for distributing scan jobs to worker instances. */
private final IOrchestrationProvider orchestrationProvider;
+
+ /** Provider for scan result storage and retrieval. */
private final IPersistenceProvider persistenceProvider;
+
+ /** Configuration containing controller parameters and scheduling options. */
private final ControllerCommandConfig config;
+
+ /** Optional provider for filtering prohibited scan targets. */
private IDenylistProvider denylistProvider;
+ /**
+ * Creates a new Controller with the specified configuration and providers.
+ *
+ * This constructor initializes the controller with all necessary dependencies for
+ * orchestrating bulk scanning operations. If a denylist file is specified in the configuration,
+ * a denylist provider is automatically created.
+ *
+ * @param config the controller configuration containing scheduling and scan parameters
+ * @param orchestrationProvider the provider for distributing scan jobs to workers
+ * @param persistenceProvider the provider for storing and retrieving scan results
+ */
public Controller(
ControllerCommandConfig config,
IOrchestrationProvider orchestrationProvider,
@@ -45,6 +74,31 @@ public Controller(
}
}
+ /**
+ * Starts the controller and begins scheduling bulk scan operations.
+ *
+ * This method performs the complete initialization and startup sequence:
+ *
+ * Progress Monitoring: If monitoring is enabled in the configuration, a
+ * {@link ProgressMonitor} is created to track scan progress and send notifications.
+ *
+ * Automatic Shutdown: The scheduler is configured to automatically shut
+ * down when all scheduled jobs complete execution.
+ *
+ * @throws RuntimeException if scheduler initialization or startup fails
+ * @see ControllerCommandConfig#isMonitored()
+ * @see PublishBulkScanJob
+ * @see ProgressMonitor
+ */
public void start() {
ITargetListProvider targetListProvider = config.getTargetListProvider();
@@ -82,6 +136,21 @@ public void start() {
}
}
+ /**
+ * Creates the appropriate schedule builder based on configuration.
+ *
+ * This method determines the scheduling strategy:
+ *
+ * This utility method provides graceful scheduler shutdown by checking the state of all
+ * registered triggers. The scheduler is shut down only when no triggers are capable of firing
+ * again, indicating that all scheduled work is complete.
+ *
+ * Trigger State Checking:
+ *
+ * Error Handling: If trigger state cannot be determined due to scheduler
+ * exceptions, the trigger is conservatively treated as still active to prevent premature
+ * shutdown.
+ *
+ * @param scheduler the Quartz scheduler to potentially shut down
+ * @see Scheduler#shutdown()
+ * @see Trigger#mayFireAgain()
+ */
public static void shutdownSchedulerIfAllTriggersFinalized(Scheduler scheduler) {
try {
boolean allTriggersFinalized =
diff --git a/src/main/java/de/rub/nds/crawler/core/ProgressMonitor.java b/src/main/java/de/rub/nds/crawler/core/ProgressMonitor.java
index 5965801..813c670 100644
--- a/src/main/java/de/rub/nds/crawler/core/ProgressMonitor.java
+++ b/src/main/java/de/rub/nds/crawler/core/ProgressMonitor.java
@@ -29,9 +29,52 @@
import org.quartz.SchedulerException;
/**
- * The ProgressMonitor keeps track of the progress of the running bulk scans. It consumes the done
- * notifications from the workers and counts for each bulk scan how many scans are done, how many
- * timed out and how many results were written to the DB.
+ * Real-time progress monitoring system for TLS-Crawler bulk scanning operations.
+ *
+ * The ProgressMonitor provides comprehensive tracking and reporting of bulk scan progress by
+ * consuming completion notifications from worker instances. It maintains detailed statistics,
+ * calculates performance metrics, and provides estimated completion times for running scans.
+ *
+ * Key capabilities:
+ *
+ * Monitoring Architecture:
+ *
+ * Performance Analysis:
+ *
+ * Status Categories: Tracks completion status including SUCCESS, EMPTY,
+ * TIMEOUT, ERROR, SERIALIZATION_ERROR, and INTERNAL_ERROR for detailed failure analysis.
+ *
+ * Notification Integration: Supports HTTP POST notifications with
+ * JSON-serialized BulkScan objects for external system integration and workflow automation.
+ *
+ * @see BulkScanJobCounters
+ * @see IOrchestrationProvider
+ * @see IPersistenceProvider
+ * @see DoneNotificationConsumer
+ * @see JobStatus
*/
public class ProgressMonitor {
@@ -47,6 +90,29 @@ public class ProgressMonitor {
private boolean listenerRegistered;
+ /**
+ * Creates a new progress monitor with required dependencies for scan tracking.
+ *
+ * This constructor initializes the progress monitoring system with the necessary components
+ * for tracking bulk scan progress, managing job counters, and coordinating with the distributed
+ * scanning infrastructure.
+ *
+ * Component Responsibilities:
+ *
+ * Initialization: Sets up the internal job counter map and prepares the
+ * monitor for tracking multiple concurrent bulk scan operations.
+ *
+ * @param orchestrationProvider the provider for worker communication and notifications
+ * @param persistenceProvider the provider for database operations and result storage
+ * @param scheduler the Quartz scheduler for controller lifecycle management
+ */
public ProgressMonitor(
IOrchestrationProvider orchestrationProvider,
IPersistenceProvider persistenceProvider,
@@ -57,6 +123,30 @@ public ProgressMonitor(
this.scheduler = scheduler;
}
+ /**
+ * Inner class that implements completion notification consumption for individual bulk scans.
+ *
+ * This class handles the real-time processing of scan job completion notifications,
+ * maintaining performance metrics, calculating ETAs, and providing detailed progress logging
+ * for a specific bulk scan operation.
+ *
+ * Performance Tracking:
+ *
+ * Logging Features: Provides comprehensive progress logging including
+ * completion counts, performance metrics, status breakdowns, and estimated completion times.
+ *
+ * @see DoneNotificationConsumer
+ * @see BulkScan
+ * @see BulkScanJobCounters
+ */
private class BulkscanMonitor implements DoneNotificationConsumer {
private final BulkScan bulkScan;
private final BulkScanJobCounters counters;
@@ -64,12 +154,37 @@ private class BulkscanMonitor implements DoneNotificationConsumer {
private double movingAverageDuration = -1;
private long lastTime = System.currentTimeMillis();
+ /**
+ * Creates a new bulk scan monitor for the specified scan and counters.
+ *
+ * @param bulkScan the bulk scan to monitor
+ * @param counters the job counters for tracking completion statistics
+ */
public BulkscanMonitor(BulkScan bulkScan, BulkScanJobCounters counters) {
this.bulkScan = bulkScan;
this.counters = counters;
this.bulkScanId = bulkScan.get_id();
}
+ /**
+ * Formats a time duration in milliseconds into a human-readable string.
+ *
+ * This method provides adaptive time formatting that automatically selects the most
+ * appropriate time unit based on the magnitude of the duration.
+ *
+ * Format Rules:
+ *
+ * This method implements the core progress tracking logic, updating job counters,
+ * calculating performance metrics, logging progress information, and determining when the
+ * bulk scan is complete.
+ *
+ * Processing Steps:
+ *
+ * Performance Metrics:
+ *
+ * This method sets up real-time progress tracking for the specified bulk scan by creating
+ * job counters, registering notification consumers, and preparing the monitoring infrastructure
+ * for scan job completion notifications.
+ *
+ * Setup Process:
*
- * @param bulkScan that should be monitored
+ * Monitoring Features:
+ *
+ * Note: The listener registration is performed only once per
+ * ProgressMonitor instance to avoid duplicate registrations.
+ *
+ * @param bulkScan the bulk scan operation to monitor for progress
+ * @see BulkScanJobCounters
+ * @see BulkscanMonitor
+ * @see IOrchestrationProvider#registerDoneNotificationConsumer(BulkScan,
+ * DoneNotificationConsumer)
*/
public void startMonitoringBulkScanProgress(BulkScan bulkScan) {
final BulkScanJobCounters counters = new BulkScanJobCounters(bulkScan);
@@ -158,10 +330,39 @@ public void startMonitoringBulkScanProgress(BulkScan bulkScan) {
}
/**
- * Finishes the monitoring, updates the bulk scan in DB, sends HTTP notification if configured
- * and shuts the controller down if all bulk scans are finished.
+ * Finalizes a completed bulk scan and performs cleanup operations.
+ *
+ * This method handles the complete finalization workflow when a bulk scan reaches
+ * completion, including database updates, notification delivery, resource cleanup, and
+ * controller shutdown coordination.
*
- * @param bulkScanId of the bulk scan for which the monitoring should be stopped.
+ * Finalization Workflow:
+ *
+ * Notification Handling:
+ *
+ * Automatic Shutdown: When all monitored bulk scans complete and the
+ * scheduler is shut down, automatically closes orchestration provider connections for clean
+ * termination.
+ *
+ * @param bulkScanId the unique identifier of the bulk scan to finalize
+ * @see #notify(BulkScan)
+ * @see IPersistenceProvider#updateBulkScan(BulkScan)
+ * @see IOrchestrationProvider#closeConnection()
*/
public void stopMonitoringAndFinalizeBulkScan(String bulkScanId) {
LOGGER.info("BulkScan '{}' is finished", bulkScanId);
@@ -209,11 +410,35 @@ public void stopMonitoringAndFinalizeBulkScan(String bulkScanId) {
}
/**
- * Sends an HTTP POST request containing the bulk scan object as json as body to the url that is
- * specified for the bulk scan.
+ * Sends an HTTP POST notification with bulk scan completion data.
+ *
+ * This method implements the HTTP notification feature for external system integration. It
+ * serializes the completed BulkScan object as JSON and sends it via HTTP POST to the configured
+ * notification URL.
+ *
+ * Request Configuration:
+ *
+ * JSON Serialization: Uses Jackson ObjectMapper with default
+ * pretty-printing to create a comprehensive JSON representation including all scan metadata,
+ * statistics, and results.
+ *
+ * HTTP Client: Uses Java 11+ HttpClient for modern, efficient HTTP
+ * communication with automatic connection management.
*
- * @param bulkScan for which a done notification request should be sent
- * @return body of the http response as string
+ * @param bulkScan the completed bulk scan to send notification for
+ * @return the HTTP response body as a string
+ * @throws IOException if network communication fails
+ * @throws InterruptedException if the HTTP request is interrupted
+ * @see ObjectMapper
+ * @see HttpClient
+ * @see HttpRequest
*/
private static String notify(BulkScan bulkScan) throws IOException, InterruptedException {
ObjectMapper objectMapper = new ObjectMapper();
diff --git a/src/main/java/de/rub/nds/crawler/core/Worker.java b/src/main/java/de/rub/nds/crawler/core/Worker.java
index 1608e10..67fb2dc 100644
--- a/src/main/java/de/rub/nds/crawler/core/Worker.java
+++ b/src/main/java/de/rub/nds/crawler/core/Worker.java
@@ -21,8 +21,81 @@
import org.bson.Document;
/**
- * Worker that subscribe to scan job queue, initializes thread pool and submits received scan jobs
- * to thread pool.
+ * Distributed TLS-Crawler worker instance responsible for consuming scan jobs and executing TLS
+ * scans.
+ *
+ * The Worker forms the core execution unit of the TLS-Crawler distributed scanning architecture.
+ * It consumes scan job messages from the orchestration provider (typically RabbitMQ), executes TLS
+ * scans using configurable thread pools, and persists results to the database. Each worker instance
+ * can handle multiple concurrent scan jobs while providing comprehensive error handling and timeout
+ * management.
+ *
+ * Key capabilities:
+ *
+ * Threading Architecture:
+ *
+ * Execution Workflow:
+ *
+ * Timeout Management:
+ *
+ * Error Categories:
+ *
+ * Resource Safety: The worker ensures proper resource cleanup through thread
+ * pool management, graceful shutdown handling, and comprehensive exception catching to prevent
+ * resource leaks in long-running distributed environments.
+ *
+ * @see WorkerCommandConfig
+ * @see IOrchestrationProvider
+ * @see IPersistenceProvider
+ * @see BulkScanWorkerManager
+ * @see ScanJobDescription
+ * @see ScanResult
+ * @see JobStatus
*/
public class Worker {
private static final Logger LOGGER = LogManager.getLogger();
@@ -38,11 +111,29 @@ public class Worker {
private final ThreadPoolExecutor workerExecutor;
/**
- * TLS-Crawler constructor.
+ * Creates a new TLS-Crawler worker with the specified configuration and providers.
*
- * @param commandConfig The config for this worker.
- * @param orchestrationProvider A non-null orchestration provider.
- * @param persistenceProvider A non-null persistence provider.
+ * This constructor initializes the worker with all necessary components for distributed TLS
+ * scanning operations. It extracts configuration parameters from the command config and sets up
+ * the thread pool executor for result handling.
+ *
+ * Thread Pool Configuration:
+ *
+ * Configuration Extraction: The constructor extracts key parameters from
+ * the WorkerCommandConfig including thread counts and timeout values for scan execution.
+ *
+ * @param commandConfig the worker configuration containing thread counts and timeout settings
+ * @param orchestrationProvider the provider for message queue communication and job consumption
+ * @param persistenceProvider the provider for database operations and result storage
+ * @throws NullPointerException if any parameter is null
*/
public Worker(
WorkerCommandConfig commandConfig,
@@ -64,11 +155,62 @@ public Worker(
new NamedThreadFactory("crawler-worker: result handler"));
}
+ /**
+ * Starts the worker by registering for scan job consumption from the orchestration provider.
+ *
+ * This method initiates the worker's primary function by subscribing to the scan job queue.
+ * The orchestration provider will begin delivering scan jobs to this worker's handleScanJob
+ * method based on the configured parallel scan thread count.
+ *
+ * Registration Details:
+ *
+ * Post-Start Behavior: After calling this method, the worker will begin
+ * receiving and processing scan jobs asynchronously until the application shuts down or the
+ * orchestration provider connection is closed.
+ */
public void start() {
this.orchestrationProvider.registerScanJobConsumer(
this::handleScanJob, this.parallelScanThreads);
}
+ /**
+ * Waits for scan completion and handles timeout scenarios with graceful cancellation.
+ *
+ * This method implements the core timeout and cancellation logic for scan jobs. It waits for
+ * the scan to complete within the configured timeout period, and if the timeout is exceeded, it
+ * attempts graceful cancellation before enforcing a final deadline.
+ *
+ * Timeout Handling Strategy:
+ *
+ * Result Processing:
+ *
+ * This method serves as the main entry point for scan job processing. It receives scan job
+ * descriptions from the orchestration provider, delegates the actual scanning to
+ * BulkScanWorkerManager, and submits the result handling to the worker thread pool.
+ *
+ * Processing Flow:
+ *
+ * Exception Categories:
+ *
+ * Result Persistence: All scan results are persisted unless an
+ * InterruptedException occurs, indicating the worker is shutting down and persistence should be
+ * avoided.
+ *
+ * @param scanJobDescription the scan job to process, containing target and configuration
+ * details
+ */
private void handleScanJob(ScanJobDescription scanJobDescription) {
LOGGER.info("Received scan job for {}", scanJobDescription.getScanTarget());
Future This method handles the final phase of scan job processing by storing results in the
+ * persistence layer and sending completion notifications to the orchestration provider. It
+ * provides comprehensive error handling to ensure completion notifications are always sent,
+ * even if persistence fails.
+ *
+ * Persistence Flow:
+ *
+ * Error Recovery:
+ *
+ * Status Synchronization: The method ensures the ScanJobDescription status
+ * matches the ScanResult status before persistence, maintaining consistency across the system.
+ *
+ * @param scanJobDescription the job description to update and use for notification
+ * @param scanResult the scan result to persist, may be null in error scenarios
+ */
private void persistResult(ScanJobDescription scanJobDescription, ScanResult scanResult) {
try {
if (scanResult != null) {
diff --git a/src/main/java/de/rub/nds/crawler/core/jobs/PublishBulkScanJob.java b/src/main/java/de/rub/nds/crawler/core/jobs/PublishBulkScanJob.java
index 1459b1a..ebc0b7e 100644
--- a/src/main/java/de/rub/nds/crawler/core/jobs/PublishBulkScanJob.java
+++ b/src/main/java/de/rub/nds/crawler/core/jobs/PublishBulkScanJob.java
@@ -26,10 +26,119 @@
import org.quartz.JobExecutionContext;
import org.quartz.JobExecutionException;
+/**
+ * Quartz job implementation responsible for initializing and publishing bulk scan operations.
+ *
+ * The PublishBulkScanJob serves as the main orchestration component that transforms a bulk scan
+ * configuration into individual scan jobs distributed to worker instances. It handles the complete
+ * job creation workflow including target list processing, filtering, validation, and submission to
+ * the message queue infrastructure.
+ *
+ * Key responsibilities:
+ *
+ * Execution Workflow:
+ *
+ * Target Filtering Pipeline:
+ *
+ * Error Handling: The job implements comprehensive error handling that
+ * categorizes failures into specific JobStatus types (UNRESOLVABLE, DENYLISTED, RESOLUTION_ERROR)
+ * and persists error results for analysis while continuing processing of valid targets.
+ *
+ * Parallel Processing: Uses Java parallel streams for efficient processing of
+ * large target lists, with the JobSubmitter functional interface handling individual target
+ * processing and submission.
+ *
+ * Monitoring Integration: For monitored scans, sets up ProgressMonitor tracking
+ * and handles the special case where no jobs are submitted (immediate completion).
+ *
+ * @see Job
+ * @see ControllerCommandConfig
+ * @see BulkScan
+ * @see ScanJobDescription
+ * @see ProgressMonitor
+ * @see IOrchestrationProvider
+ * @see ITargetListProvider
+ */
public class PublishBulkScanJob implements Job {
private static final Logger LOGGER = LogManager.getLogger();
+ /**
+ * Creates a new bulk scan job publisher instance.
+ *
+ * Default constructor required by the Quartz scheduler framework. The job execution context
+ * provides all necessary configuration and dependencies at execution time.
+ */
+ public PublishBulkScanJob() {
+ // Default constructor for Quartz scheduler instantiation
+ }
+
+ /**
+ * Executes the bulk scan job creation and publication process.
+ *
+ * This method implements the Quartz Job interface and performs the complete workflow for
+ * transforming a bulk scan configuration into individual scan jobs distributed to workers. It
+ * handles all aspects of job creation including filtering, validation, and submission while
+ * providing comprehensive error handling and statistics collection.
+ *
+ * Required JobDataMap Entries:
+ *
+ * Execution Steps:
+ *
+ * Error Handling: Any exception during execution is caught, logged, and
+ * converted to a JobExecutionException with unscheduleAllTriggers=true to prevent retry
+ * attempts that would likely fail with the same error.
+ *
+ * @param context the Quartz job execution context containing configuration and providers
+ * @throws JobExecutionException if any error occurs during job execution
+ */
public void execute(JobExecutionContext context) throws JobExecutionException {
try {
JobDataMap data = context.getMergedJobDataMap();
@@ -102,6 +211,35 @@ public void execute(JobExecutionContext context) throws JobExecutionException {
}
}
+ /**
+ * Functional interface implementation for processing individual target strings into scan jobs.
+ *
+ * The JobSubmitter class implements the Function interface to enable parallel processing of
+ * target lists using Java streams. Each instance processes target strings by parsing,
+ * validating, filtering, and either submitting valid jobs or persisting error results.
+ *
+ * Processing Pipeline:
+ *
+ * Status Determination:
+ *
+ * Error Persistence: All error cases result in ScanResult objects being
+ * persisted to maintain complete audit trails and enable analysis of filtering effectiveness
+ * and target list quality.
+ */
private static class JobSubmitter implements Function This method implements the core target processing logic, handling parsing, validation,
+ * filtering, and job submission or error persistence. It uses the
+ * ScanTarget.fromTargetString method for DNS resolution and denylist checking.
+ *
+ * Multi-target Support: For hostnames that resolve to multiple IP
+ * addresses, multiple ScanJobDescription objects are created and processed. The returned
+ * JobStatus represents the primary outcome, with TO_BE_EXECUTED taking precedence if any
+ * targets were successfully submitted.
+ *
+ * Processing Flow:
+ *
+ * Error Handling: Exceptions during target parsing are caught and
+ * result in RESOLUTION_ERROR status with the exception persisted in the ScanResult for
+ * debugging purposes.
+ *
+ * @param targetString the target string to process (e.g., "example.com:443")
+ * @return the JobStatus indicating how the target was processed (TO_BE_EXECUTED if any
+ * targets were submitted successfully, otherwise the error status)
+ */
@Override
public JobStatus apply(String targetString) {
- ScanJobDescription jobDescription;
- ScanResult errorResult = null;
try {
- var targetInfo =
+ var targetInfoList =
ScanTarget.fromTargetString(targetString, defaultPort, denylistProvider);
- jobDescription =
- new ScanJobDescription(
- targetInfo.getLeft(), bulkScan, targetInfo.getRight());
+
+ boolean hasSuccessfulSubmission = false;
+ JobStatus primaryStatus = JobStatus.RESOLUTION_ERROR;
+
+ for (var targetInfo : targetInfoList) {
+ ScanJobDescription jobDescription =
+ new ScanJobDescription(
+ targetInfo.getLeft(), bulkScan, targetInfo.getRight());
+
+ if (jobDescription.getStatus() == JobStatus.TO_BE_EXECUTED) {
+ orchestrationProvider.submitScanJob(jobDescription);
+ hasSuccessfulSubmission = true;
+ primaryStatus = JobStatus.TO_BE_EXECUTED;
+ } else {
+ ScanResult errorResult = new ScanResult(jobDescription, null);
+ persistenceProvider.insertScanResult(errorResult, jobDescription);
+
+ // Update primary status if we haven't had a successful submission
+ if (!hasSuccessfulSubmission) {
+ primaryStatus = jobDescription.getStatus();
+ }
+ }
+ }
+
+ return primaryStatus;
} catch (Exception e) {
- jobDescription =
+ ScanJobDescription jobDescription =
new ScanJobDescription(
new ScanTarget(), bulkScan, JobStatus.RESOLUTION_ERROR);
- errorResult = ScanResult.fromException(jobDescription, e);
+ String errorContext = "Failed to parse target string: '" + targetString + "'";
+ ScanResult errorResult = ScanResult.fromException(jobDescription, e, errorContext);
LOGGER.error(
"Error while creating ScanJobDescription for target '{}'", targetString, e);
- }
-
- if (jobDescription.getStatus() == JobStatus.TO_BE_EXECUTED) {
- orchestrationProvider.submitScanJob(jobDescription);
- } else {
- if (errorResult == null) {
- errorResult = new ScanResult(jobDescription, null);
- }
persistenceProvider.insertScanResult(errorResult, jobDescription);
+ return JobStatus.RESOLUTION_ERROR;
}
- return jobDescription.getStatus();
}
}
}
diff --git a/src/main/java/de/rub/nds/crawler/data/BulkScan.java b/src/main/java/de/rub/nds/crawler/data/BulkScan.java
index 980c089..8b80366 100644
--- a/src/main/java/de/rub/nds/crawler/data/BulkScan.java
+++ b/src/main/java/de/rub/nds/crawler/data/BulkScan.java
@@ -17,45 +17,95 @@
import java.util.Map;
import javax.persistence.Id;
+/**
+ * Represents a bulk scanning operation with configuration, progress tracking, and metadata.
+ *
+ * Encapsulates large-scale TLS scanning operations with distributed coordination, progress
+ * monitoring, version tracking, and time recording. Designed for MongoDB persistence.
+ *
+ * @see ScanConfig
+ * @see JobStatus
+ * @see ScanTarget
+ */
public class BulkScan implements Serializable {
+ /** Unique identifier for the bulk scan (managed by MongoDB). */
@Id private String _id;
+ /** Human-readable name for the scan operation. */
private String name;
+ /** MongoDB collection name where scan results are stored (auto-generated). */
private String collectionName;
+ /** Configuration parameters for the scanning operation. */
private ScanConfig scanConfig;
+ /** Whether this scan should be monitored for progress updates. */
private boolean monitored;
+ /** Whether the scan operation has completed. */
private boolean finished;
+ /** Start time of the scan operation (epoch milliseconds). */
private long startTime;
+ /** End time of the scan operation (epoch milliseconds). */
private long endTime;
+ /** Total number of targets provided for scanning. */
private int targetsGiven;
+ /** Number of scan jobs successfully published to worker queues. */
private long scanJobsPublished;
+
+ /** Number of targets that failed hostname resolution. */
private long scanJobsResolutionErrors;
+
+ /** Number of targets excluded due to denylist filtering. */
private long scanJobsDenylisted;
+ /** Number of successfully completed scans. */
private int successfulScans;
+ /** Counters for tracking job states during scan execution. */
private Map This constructor is used by serialization frameworks and should not be called directly.
+ */
@SuppressWarnings("unused")
private BulkScan() {}
+ /**
+ * Creates a new BulkScan with the specified configuration and metadata.
+ *
+ * This constructor initializes a new bulk scan operation with version information extracted
+ * from the provided scanner and crawler classes. The collection name is automatically generated
+ * using the scan name and start time.
+ *
+ * @param scannerClass the scanner class to extract version information from
+ * @param crawlerClass the crawler class to extract version information from
+ * @param name the human-readable name for this scan operation
+ * @param scanConfig the scan configuration defining scan parameters
+ * @param startTime the start time in epoch milliseconds
+ * @param monitored whether this scan should be monitored for progress
+ * @param notifyUrl optional URL for completion notifications (may be null)
+ */
public BulkScan(
Class> scannerClass,
Class> crawlerClass,
@@ -76,140 +126,325 @@ public BulkScan(
this.notifyUrl = notifyUrl;
}
- // Getter naming important for correct serialization, do not change!
+ /**
+ * Gets the unique identifier for this bulk scan.
+ *
+ * Important: Getter naming is critical for MongoDB serialization. Do not
+ * change this method name without considering serialization compatibility.
+ *
+ * @return the MongoDB document ID
+ */
public String get_id() {
return _id;
}
+ /**
+ * Gets the human-readable name of the bulk scan.
+ *
+ * @return the scan name
+ */
public String getName() {
return this.name;
}
+ /**
+ * Gets the MongoDB collection name where scan results are stored.
+ *
+ * The collection name is automatically generated from the scan name and start time in the
+ * format: {name}_{yyyy-MM-dd_HH-mm}
+ *
+ * @return the collection name for scan results
+ */
public String getCollectionName() {
return this.collectionName;
}
+ /**
+ * Gets the scan configuration for this bulk scan.
+ *
+ * @return the scan configuration containing scan parameters
+ */
public ScanConfig getScanConfig() {
return this.scanConfig;
}
+ /**
+ * Checks whether this bulk scan is being monitored for progress updates.
+ *
+ * @return true if monitoring is enabled, false otherwise
+ */
public boolean isMonitored() {
return this.monitored;
}
+ /**
+ * Checks whether the bulk scan operation has completed.
+ *
+ * A scan is considered finished when all target processing and job publishing has been
+ * completed, regardless of individual job success or failure.
+ *
+ * @return true if the scan is finished, false otherwise
+ */
public boolean isFinished() {
return this.finished;
}
+ /**
+ * Gets the start time of the bulk scan operation.
+ *
+ * @return the start time in epoch milliseconds
+ */
public long getStartTime() {
return this.startTime;
}
+ /**
+ * Gets the end time of the bulk scan operation.
+ *
+ * @return the end time in epoch milliseconds, or 0 if not finished
+ */
public long getEndTime() {
return this.endTime;
}
+ /**
+ * Gets the total number of targets provided for this bulk scan.
+ *
+ * @return the number of targets given
+ */
public int getTargetsGiven() {
return this.targetsGiven;
}
+ /**
+ * Gets the number of scan jobs successfully published to worker queues.
+ *
+ * @return the number of published scan jobs
+ */
public long getScanJobsPublished() {
return this.scanJobsPublished;
}
+ /**
+ * Gets the number of successfully completed scans.
+ *
+ * @return the number of successful scans
+ */
public int getSuccessfulScans() {
return this.successfulScans;
}
+ /**
+ * Gets the notification URL for scan completion callbacks.
+ *
+ * @return the notification URL, or null if not configured
+ */
public String getNotifyUrl() {
return this.notifyUrl;
}
+ /**
+ * Gets the version of the TLS scanner used for this scan.
+ *
+ * @return the scanner version string
+ */
public String getScannerVersion() {
return this.scannerVersion;
}
+ /**
+ * Gets the version of the crawler framework used for this scan.
+ *
+ * @return the crawler version string
+ */
public String getCrawlerVersion() {
return this.crawlerVersion;
}
- // Setter naming important for correct serialization, do not change!
+ /**
+ * Sets the unique identifier for this bulk scan.
+ *
+ * Important: Setter naming is critical for MongoDB serialization. Do not
+ * change this method name without considering serialization compatibility.
+ *
+ * @param _id the MongoDB document ID
+ */
public void set_id(String _id) {
this._id = _id;
}
+ /**
+ * Sets the human-readable name of the bulk scan.
+ *
+ * @param name the scan name
+ */
public void setName(String name) {
this.name = name;
}
+ /**
+ * Sets the MongoDB collection name for scan results.
+ *
+ * @param collectionName the collection name
+ */
public void setCollectionName(String collectionName) {
this.collectionName = collectionName;
}
+ /**
+ * Sets the scan configuration for this bulk scan.
+ *
+ * @param scanConfig the scan configuration
+ */
public void setScanConfig(ScanConfig scanConfig) {
this.scanConfig = scanConfig;
}
+ /**
+ * Sets whether this bulk scan should be monitored for progress updates.
+ *
+ * @param monitored true to enable monitoring, false otherwise
+ */
public void setMonitored(boolean monitored) {
this.monitored = monitored;
}
+ /**
+ * Sets whether the bulk scan operation has completed.
+ *
+ * @param finished true if the scan is finished, false otherwise
+ */
public void setFinished(boolean finished) {
this.finished = finished;
}
+ /**
+ * Sets the start time of the bulk scan operation.
+ *
+ * @param startTime the start time in epoch milliseconds
+ */
public void setStartTime(long startTime) {
this.startTime = startTime;
}
+ /**
+ * Sets the end time of the bulk scan operation.
+ *
+ * @param endTime the end time in epoch milliseconds
+ */
public void setEndTime(long endTime) {
this.endTime = endTime;
}
+ /**
+ * Sets the total number of targets provided for this bulk scan.
+ *
+ * @param targetsGiven the number of targets given
+ */
public void setTargetsGiven(int targetsGiven) {
this.targetsGiven = targetsGiven;
}
+ /**
+ * Sets the number of scan jobs successfully published to worker queues.
+ *
+ * @param scanJobsPublished the number of published scan jobs
+ */
public void setScanJobsPublished(long scanJobsPublished) {
this.scanJobsPublished = scanJobsPublished;
}
+ /**
+ * Sets the number of successfully completed scans.
+ *
+ * @param successfulScans the number of successful scans
+ */
public void setSuccessfulScans(int successfulScans) {
this.successfulScans = successfulScans;
}
+ /**
+ * Sets the notification URL for scan completion callbacks.
+ *
+ * @param notifyUrl the notification URL, or null to disable notifications
+ */
public void setNotifyUrl(String notifyUrl) {
this.notifyUrl = notifyUrl;
}
+ /**
+ * Sets the version of the TLS scanner used for this scan.
+ *
+ * @param scannerVersion the scanner version string
+ */
public void setScannerVersion(String scannerVersion) {
this.scannerVersion = scannerVersion;
}
+ /**
+ * Sets the version of the crawler framework used for this scan.
+ *
+ * @param crawlerVersion the crawler version string
+ */
public void setCrawlerVersion(String crawlerVersion) {
this.crawlerVersion = crawlerVersion;
}
+ /**
+ * Gets the job status counters for tracking scan progress.
+ *
+ * This map contains counters for each {@link JobStatus} value, allowing real-time monitoring
+ * of scan progress and completion rates.
+ *
+ * @return a map of job statuses to their respective counts
+ * @see JobStatus
+ */
public Map The BulkScanInfo class serves as a lightweight, serializable representation of essential bulk
+ * scan metadata that workers need to execute individual scan jobs correctly. It contains only the
+ * core information required for job execution while avoiding the overhead of transmitting the
+ * complete BulkScan object to every worker.
+ *
+ * Key design principles:
+ *
+ * Contained Information:
+ *
+ * Lifecycle and Usage:
+ *
+ * Immutability Guarantee: The class is designed to remain unchanged for the
+ * entire duration of a bulk scan operation, ensuring consistent configuration across all
+ * distributed workers and preventing configuration drift during long-running scans.
+ *
+ * Serialization: Implements Serializable for efficient transmission via message
+ * queues between controller and worker instances in the distributed architecture.
+ *
+ * @see BulkScan
+ * @see ScanConfig
+ * @see ScanJobDescription
*/
public class BulkScanInfo implements Serializable {
+ /** Unique identifier for the bulk scan operation. */
private final String bulkScanId;
+ /** Configuration settings for individual scan jobs within this bulk operation. */
private final ScanConfig scanConfig;
+ /** Flag indicating whether this bulk scan should be monitored for progress tracking. */
private final boolean isMonitored;
+ /**
+ * Creates a new bulk scan info object by extracting essential metadata from a bulk scan.
+ *
+ * This constructor extracts only the core information needed by workers for scan execution,
+ * creating a lightweight representation that can be efficiently serialized and distributed via
+ * message queues.
+ *
+ * Extracted Information:
+ *
+ * This ID is used for correlating individual scan job results back to their originating bulk
+ * scan operation and for progress tracking.
+ *
+ * @return the bulk scan unique identifier
+ */
public String getBulkScanId() {
return bulkScanId;
}
+ /**
+ * Gets the scan configuration for this bulk scan operation.
+ *
+ * The scan configuration contains scanner-specific settings and parameters that control how
+ * individual scan jobs should be executed.
+ *
+ * @return the scan configuration object
+ */
public ScanConfig getScanConfig() {
return scanConfig;
}
+ /**
+ * Gets the scan configuration cast to a specific scanner implementation type.
+ *
+ * This method provides type-safe access to scanner-specific configuration implementations,
+ * allowing workers to access configuration details specific to their scanner type without
+ * manual casting.
+ *
+ * Usage Example:
+ *
+ * When monitoring is enabled, workers send completion notifications that are used for
+ * progress tracking, performance metrics, and completion callbacks.
+ *
+ * @return true if progress monitoring is enabled, false otherwise
+ */
public boolean isMonitored() {
return isMonitored;
}
diff --git a/src/main/java/de/rub/nds/crawler/data/BulkScanJobCounters.java b/src/main/java/de/rub/nds/crawler/data/BulkScanJobCounters.java
index bfaac3a..fab7020 100644
--- a/src/main/java/de/rub/nds/crawler/data/BulkScanJobCounters.java
+++ b/src/main/java/de/rub/nds/crawler/data/BulkScanJobCounters.java
@@ -13,6 +13,57 @@
import java.util.Map;
import java.util.concurrent.atomic.AtomicInteger;
+/**
+ * Thread-safe job status counters for tracking bulk scan progress and completion statistics.
+ *
+ * The BulkScanJobCounters class provides atomic counting and tracking of scan job completion
+ * status across all worker threads in a distributed TLS scanning operation. It maintains separate
+ * counters for each job status type and provides thread-safe access to progress metrics used by the
+ * monitoring and progress tracking systems.
+ *
+ * Key capabilities:
+ *
+ * Atomic Operations:
+ *
+ * Status Categories Tracked:
+ *
+ * Excluded Status: The TO_BE_EXECUTED status is not tracked as it represents
+ * jobs that haven't completed yet, and this class only tracks completion statistics.
+ *
+ * Performance Metrics: The counters support real-time calculation of completion
+ * rates, error rates, and progress percentages for monitoring dashboards and ETA calculations.
+ *
+ * Memory Efficiency: Uses EnumMap for optimal memory usage and access speed
+ * when dealing with the finite set of JobStatus enum values.
+ *
+ * @see BulkScan
+ * @see JobStatus Used by ProgressMonitor for tracking bulk scan completion statistics.
+ * @see AtomicInteger
+ */
public class BulkScanJobCounters {
private final BulkScan bulkScan;
@@ -20,6 +71,19 @@ public class BulkScanJobCounters {
private final AtomicInteger totalJobDoneCount = new AtomicInteger(0);
private final Map This constructor initializes atomic counters for all completion status types, excluding
+ * TO_BE_EXECUTED which represents jobs that haven't completed yet. Each counter starts at zero
+ * and is thread-safe for concurrent updates.
+ *
+ * Counter Initialization: Creates AtomicInteger instances for each
+ * JobStatus enum value except TO_BE_EXECUTED, ensuring thread-safe access from multiple worker
+ * threads and monitoring components.
+ *
+ * @param bulkScan the bulk scan operation to track counters for
+ */
public BulkScanJobCounters(BulkScan bulkScan) {
this.bulkScan = bulkScan;
for (JobStatus jobStatus : JobStatus.values()) {
@@ -30,10 +94,29 @@ public BulkScanJobCounters(BulkScan bulkScan) {
}
}
+ /**
+ * Gets the bulk scan operation that these counters are tracking.
+ *
+ * @return the associated bulk scan object
+ */
public BulkScan getBulkScan() {
return bulkScan;
}
+ /**
+ * Creates a snapshot copy of all job status counters at the current moment.
+ *
+ * This method provides a thread-safe way to get a consistent view of all counter values
+ * without holding locks. The returned map contains the current count for each job status type
+ * and can be safely used for reporting or persistence without affecting the ongoing counter
+ * updates.
+ *
+ * Thread Safety: While individual counter reads are atomic, the overall
+ * snapshot may not be perfectly consistent if updates occur during iteration. However, this
+ * provides a reasonable approximation for monitoring purposes.
+ *
+ * @return a new EnumMap containing current counter values for all job statuses
+ */
public Map This method provides thread-safe access to individual counter values, returning the
+ * current count for the specified job status.
+ *
+ * @param jobStatus the job status type to get the count for
+ * @return the current count for the specified job status
+ * @throws NullPointerException if jobStatus is TO_BE_EXECUTED (not tracked)
+ */
public int getJobStatusCount(JobStatus jobStatus) {
return jobStatusCounters.get(jobStatus).get();
}
+ /**
+ * Atomically increments the counter for a specific job status and returns the new total.
+ *
+ * This method performs two atomic operations: incrementing the specific job status counter
+ * and incrementing the overall completion count. The operations are performed in sequence but
+ * are individually atomic, ensuring thread safety but not perfect consistency between the two
+ * counters at any given instant.
+ *
+ * Usage: Called by workers when scan jobs complete with a specific status,
+ * providing real-time updates for progress monitoring and statistics.
+ *
+ * @param jobStatus the job status type to increment
+ * @return the new total count of completed jobs across all status types
+ * @throws NullPointerException if jobStatus is TO_BE_EXECUTED (not tracked)
+ */
public int increaseJobStatusCount(JobStatus jobStatus) {
jobStatusCounters.get(jobStatus).incrementAndGet();
return totalJobDoneCount.incrementAndGet();
diff --git a/src/main/java/de/rub/nds/crawler/data/ErrorContext.java b/src/main/java/de/rub/nds/crawler/data/ErrorContext.java
new file mode 100644
index 0000000..28f64a0
--- /dev/null
+++ b/src/main/java/de/rub/nds/crawler/data/ErrorContext.java
@@ -0,0 +1,108 @@
+/*
+ * TLS-Crawler - A TLS scanning tool to perform large scale scans with the TLS-Scanner
+ *
+ * Copyright 2018-2023 Ruhr University Bochum, Paderborn University, and Hackmanit GmbH
+ *
+ * Licensed under Apache License, Version 2.0
+ * http://www.apache.org/licenses/LICENSE-2.0.txt
+ */
+package de.rub.nds.crawler.data;
+
+/**
+ * Utility class for creating structured error context information in scan results.
+ *
+ * This class provides static methods to generate standardized error context strings that can be
+ * used with {@link ScanResult#fromException(ScanJobDescription, Exception, String)} to provide
+ * detailed debugging information for scan failures.
+ *
+ * The error context strings follow a consistent format to facilitate parsing and analysis of
+ * error patterns across large-scale scan operations. Each context type includes relevant
+ * operational details and failure specifics.
+ *
+ * Context Categories:
+ *
+ * Usage Example:
+ *
+ * The ScanConfig class provides the foundation for scanner-specific configuration in the
+ * TLS-Crawler distributed architecture. It defines common scanning parameters that apply across
+ * different TLS scanner implementations while allowing concrete subclasses to add scanner-specific
+ * configuration options.
+ *
+ * Key responsibilities:
+ *
+ * Configuration Parameters:
+ *
+ * Factory Pattern: The abstract createWorker() method implements the factory
+ * pattern, allowing each scanner implementation to create appropriately configured worker instances
+ * that match the scanner's requirements and capabilities.
+ *
+ * Serialization Support: The class implements Serializable and includes a
+ * no-argument constructor for compatibility with serialization frameworks used in distributed
+ * messaging and database persistence.
+ *
+ * Extension Points: Subclasses should:
+ *
+ * Common Usage Pattern: Configuration instances are created by controllers,
+ * serialized and distributed to workers via message queues, then used to create scanner-specific
+ * worker instances that execute the actual TLS scans.
+ *
+ * @see BulkScanWorker
+ * @see ScannerDetail
+ * @see BulkScan
+ */
public abstract class ScanConfig implements Serializable {
+ /** Scanner implementation details and configuration parameters. */
private ScannerDetail scannerDetail;
+ /** Number of retry attempts for failed scan operations. */
private int reexecutions;
+ /** Maximum execution time in milliseconds for individual scan operations. */
private int timeout;
@SuppressWarnings("unused")
private ScanConfig() {}
+ /**
+ * Creates a new scan configuration with the specified parameters.
+ *
+ * This protected constructor is intended for use by subclasses to initialize the common
+ * configuration parameters that apply to all scanner implementations.
+ *
+ * @param scannerDetail the scanner detail level controlling scan comprehensiveness
+ * @param reexecutions the number of retry attempts for failed scans
+ * @param timeout the maximum execution time per scan in milliseconds
+ */
protected ScanConfig(ScannerDetail scannerDetail, int reexecutions, int timeout) {
this.scannerDetail = scannerDetail;
this.reexecutions = reexecutions;
this.timeout = timeout;
}
+ /**
+ * Gets the scanner detail level configuration.
+ *
+ * The scanner detail level controls how comprehensive the TLS scanning should be, affecting
+ * factors like the number of probes executed, the depth of analysis, and the amount of data
+ * collected.
+ *
+ * @return the scanner detail level
+ */
public ScannerDetail getScannerDetail() {
return this.scannerDetail;
}
+ /**
+ * Gets the number of reexecution attempts for failed scans.
+ *
+ * When a scan fails due to network issues or other transient problems, the scanner will
+ * retry the scan up to this many times before marking it as failed.
+ *
+ * @return the number of retry attempts (typically 3)
+ */
public int getReexecutions() {
return this.reexecutions;
}
+ /**
+ * Gets the timeout value for individual scan operations.
+ *
+ * This timeout controls how long the scanner will wait for a single scan to complete before
+ * considering it failed. The timeout applies to the TLS-Scanner execution, not the overall
+ * worker timeout.
+ *
+ * @return the scan timeout in milliseconds (typically 2000ms)
+ */
public int getTimeout() {
return this.timeout;
}
+ /**
+ * Sets the scanner detail level configuration.
+ *
+ * @param scannerDetail the scanner detail level to use
+ */
public void setScannerDetail(ScannerDetail scannerDetail) {
this.scannerDetail = scannerDetail;
}
+ /**
+ * Sets the number of reexecution attempts for failed scans.
+ *
+ * @param reexecutions the number of retry attempts
+ */
public void setReexecutions(int reexecutions) {
this.reexecutions = reexecutions;
}
+ /**
+ * Sets the timeout value for individual scan operations.
+ *
+ * @param timeout the scan timeout in milliseconds
+ */
public void setTimeout(int timeout) {
this.timeout = timeout;
}
+ /**
+ * Factory method for creating scanner-specific worker instances.
+ *
+ * This abstract method must be implemented by subclasses to create appropriate
+ * BulkScanWorker instances that are compatible with their specific scanner implementation. The
+ * worker will use this configuration to control scanning behavior.
+ *
+ * Worker Creation: The created worker should be properly configured with
+ * the scanner implementation, threading parameters, and this configuration instance.
+ *
+ * Threading Parameters:
+ *
+ * The ScanJobDescription serves as the primary communication unit between the controller and
+ * worker nodes in the TLS-Crawler system. It encapsulates all information necessary for a worker to
+ * execute a TLS scan and store the results, including the scan target, execution status, database
+ * storage location, and message queue metadata.
+ *
+ * Key responsibilities:
+ *
+ * Lifecycle Management:
+ *
+ * Message Queue Integration:
+ *
+ * Database Storage:
+ *
+ * Immutability: Most fields are final to ensure job definitions remain
+ * consistent throughout processing, with only the status field being mutable to track execution
+ * progress.
+ *
+ * Serialization: The class supports Java serialization for message queue
+ * transport while handling the transient delivery tag appropriately during deserialization.
+ *
+ * @see ScanTarget
+ * @see BulkScanInfo
+ * @see BulkScan
+ * @see JobStatus
+ */
public class ScanJobDescription implements Serializable {
+ /** Target specification containing hostname, IP address, and port information. */
private final ScanTarget scanTarget;
// Metadata
private transient Optional This constructor allows precise control over where scan results will be stored by
+ * specifying the database name and collection name directly. It's primarily used for advanced
+ * scenarios where custom storage locations are needed.
+ *
+ * @param scanTarget the target host and port to scan
+ * @param bulkScanInfo metadata about the parent bulk scan operation
+ * @param dbName the database name where results should be stored
+ * @param collectionName the collection/table name for result storage
+ * @param status the initial job status (typically TO_BE_EXECUTED)
+ */
public ScanJobDescription(
ScanTarget scanTarget,
BulkScanInfo bulkScanInfo,
@@ -43,6 +119,17 @@ public ScanJobDescription(
this.status = status;
}
+ /**
+ * Creates a new scan job description from a bulk scan configuration.
+ *
+ * This convenience constructor extracts storage configuration from the bulk scan object,
+ * using the bulk scan name as the database name and the bulk scan's collection name for result
+ * storage. This is the most common way to create scan jobs.
+ *
+ * @param scanTarget the target host and port to scan
+ * @param bulkScan the parent bulk scan containing storage and configuration details
+ * @param status the initial job status (typically TO_BE_EXECUTED)
+ */
public ScanJobDescription(ScanTarget scanTarget, BulkScan bulkScan, JobStatus status) {
this(
scanTarget,
@@ -52,6 +139,17 @@ public ScanJobDescription(ScanTarget scanTarget, BulkScan bulkScan, JobStatus st
status);
}
+ /**
+ * Custom deserialization method to properly initialize transient fields.
+ *
+ * This method ensures that the transient deliveryTag field is properly initialized to an
+ * empty Optional after deserialization. The delivery tag is transport-specific and should not
+ * be serialized across message boundaries.
+ *
+ * @param in the object input stream for deserialization
+ * @throws IOException if an I/O error occurs during deserialization
+ * @throws ClassNotFoundException if a class cannot be found during deserialization
+ */
private void readObject(java.io.ObjectInputStream in)
throws IOException, ClassNotFoundException {
// handle deserialization, cf. https://stackoverflow.com/a/3960558
@@ -59,30 +157,80 @@ private void readObject(java.io.ObjectInputStream in)
deliveryTag = Optional.empty();
}
+ /**
+ * Gets the scan target containing the host and port to be scanned.
+ *
+ * @return the scan target specifying what should be scanned
+ */
public ScanTarget getScanTarget() {
return scanTarget;
}
+ /**
+ * Gets the database name where scan results should be stored.
+ *
+ * @return the target database name for result persistence
+ */
public String getDbName() {
return dbName;
}
+ /**
+ * Gets the collection/table name where scan results should be stored.
+ *
+ * @return the target collection name for result persistence
+ */
public String getCollectionName() {
return collectionName;
}
+ /**
+ * Gets the current execution status of this scan job.
+ *
+ * The status tracks the job's progress through its lifecycle from initial creation
+ * (TO_BE_EXECUTED) through completion (SUCCESS, ERROR, etc.).
+ *
+ * @return the current job execution status
+ */
public JobStatus getStatus() {
return status;
}
+ /**
+ * Updates the execution status of this scan job.
+ *
+ * This method is used to track the job's progress as it moves through the execution
+ * pipeline, from queued to running to completed states.
+ *
+ * @param status the new job execution status
+ */
public void setStatus(JobStatus status) {
this.status = status;
}
+ /**
+ * Gets the RabbitMQ delivery tag for message acknowledgment.
+ *
+ * The delivery tag is used by workers to acknowledge message processing back to the RabbitMQ
+ * broker. This ensures reliable message delivery in the distributed system.
+ *
+ * @return the RabbitMQ delivery tag
+ * @throws java.util.NoSuchElementException if no delivery tag has been set
+ */
public long getDeliveryTag() {
return deliveryTag.get();
}
+ /**
+ * Sets the RabbitMQ delivery tag for this job message.
+ *
+ * This method is called by the orchestration provider when a job message is received from
+ * the queue. The delivery tag can only be set once to prevent accidental overwrites that could
+ * break message acknowledgment.
+ *
+ * @param deliveryTag the RabbitMQ delivery tag for message acknowledgment
+ * @throws IllegalStateException if a delivery tag has already been set
+ */
public void setDeliveryTag(Long deliveryTag) {
if (this.deliveryTag.isPresent()) {
throw new IllegalStateException("Delivery tag already set");
@@ -90,6 +238,14 @@ public void setDeliveryTag(Long deliveryTag) {
this.deliveryTag = Optional.of(deliveryTag);
}
+ /**
+ * Gets the bulk scan metadata for this individual job.
+ *
+ * The bulk scan info provides traceability back to the parent bulk scan operation and
+ * contains configuration details needed for job execution.
+ *
+ * @return the bulk scan information object
+ */
public BulkScanInfo getBulkScanInfo() {
return bulkScanInfo;
}
diff --git a/src/main/java/de/rub/nds/crawler/data/ScanResult.java b/src/main/java/de/rub/nds/crawler/data/ScanResult.java
index ebd5de5..d3a797d 100644
--- a/src/main/java/de/rub/nds/crawler/data/ScanResult.java
+++ b/src/main/java/de/rub/nds/crawler/data/ScanResult.java
@@ -14,16 +14,32 @@
import java.util.UUID;
import org.bson.Document;
+/**
+ * Immutable container for TLS scan results and metadata.
+ *
+ * Encapsulates scan outcome including target, status, result data, and traceability. Supports
+ * both successful results and error conditions. Uses Jackson/BSON for persistence.
+ *
+ * @see ScanJobDescription
+ * @see ScanTarget
+ * @see JobStatus
+ * @see BulkScanInfo
+ */
public class ScanResult implements Serializable {
+ /** Unique identifier for this scan result record. */
private String id;
+ /** Identifier of the bulk scan operation that produced this result. */
private final String bulkScan;
+ /** Target specification that was scanned to produce this result. */
private final ScanTarget scanTarget;
+ /** Final execution status indicating success, failure, or error condition. */
private final JobStatus jobStatus;
+ /** MongoDB document containing the actual scan results or error information. */
private final Document result;
private ScanResult(
@@ -35,6 +51,25 @@ private ScanResult(
this.result = result;
}
+ /**
+ * Creates a new scan result from a completed scan job description and result document.
+ *
+ * This is the primary constructor for creating scan results from successful or failed scan
+ * operations. It extracts metadata from the scan job description and associates it with the
+ * result document from the scanning process.
+ *
+ * Status Validation: The constructor validates that the scan job has
+ * completed execution by checking that its status is not TO_BE_EXECUTED. This ensures that only
+ * completed scan jobs are converted to results.
+ *
+ * Metadata Extraction: The constructor extracts key information from the
+ * scan job description including the bulk scan ID, scan target, and execution status to
+ * populate the result object.
+ *
+ * @param scanJobDescription the completed scan job containing metadata and final status
+ * @param result the BSON document containing scan results, may be null for empty results
+ * @throws IllegalArgumentException if the scan job is still in TO_BE_EXECUTED state
+ */
public ScanResult(ScanJobDescription scanJobDescription, Document result) {
this(
scanJobDescription.getBulkScanInfo().getBulkScanId(),
@@ -47,37 +82,159 @@ public ScanResult(ScanJobDescription scanJobDescription, Document result) {
}
}
+ /**
+ * Creates scan result from exception during scan execution.
+ *
+ * Creates structured error document with exception details. Validates scan job is in error
+ * state.
+ *
+ * @param scanJobDescription scan job in error state
+ * @param e exception that caused scan failure
+ * @return ScanResult containing exception details
+ * @throws IllegalArgumentException if scan job not in error state
+ */
public static ScanResult fromException(ScanJobDescription scanJobDescription, Exception e) {
if (!scanJobDescription.getStatus().isError()) {
throw new IllegalArgumentException("ScanJobDescription must be in an error state");
}
Document errorDocument = new Document();
- errorDocument.put("exception", e);
+
+ // Store structured exception information for better analysis and debugging
+ errorDocument.put("exceptionType", e.getClass().getSimpleName());
+ errorDocument.put("exceptionMessage", e.getMessage());
+ errorDocument.put("exceptionCause", e.getCause() != null ? e.getCause().toString() : null);
+ errorDocument.put("timestamp", System.currentTimeMillis());
+
+ // Include target information if available for context
+ ScanTarget target = scanJobDescription.getScanTarget();
+ if (target != null) {
+ errorDocument.put("targetHostname", target.getHostname());
+ errorDocument.put("targetIp", target.getIp());
+ errorDocument.put("targetPort", target.getPort());
+
+ // Include additional error context from the target if available
+ if (target.getErrorMessage() != null) {
+ errorDocument.put("targetErrorMessage", target.getErrorMessage());
+ }
+ if (target.getErrorType() != null) {
+ errorDocument.put("targetErrorType", target.getErrorType());
+ }
+ }
+
+ return new ScanResult(scanJobDescription, errorDocument);
+ }
+
+ /**
+ * Creates scan result from exception with additional error context.
+ *
+ * @param scanJobDescription scan job in error state
+ * @param e the exception that caused the scan to fail
+ * @param errorContext additional error context as key-value pairs
+ * @return a new ScanResult containing the exception details and additional context
+ * @throws IllegalArgumentException if the scan job is not in an error state
+ */
+ public static ScanResult fromException(
+ ScanJobDescription scanJobDescription, Exception e, String errorContext) {
+ if (!scanJobDescription.getStatus().isError()) {
+ throw new IllegalArgumentException("ScanJobDescription must be in an error state");
+ }
+ Document errorDocument = new Document();
+
+ // Store structured exception information
+ errorDocument.put("exceptionType", e.getClass().getSimpleName());
+ errorDocument.put("exceptionMessage", e.getMessage());
+ errorDocument.put("exceptionCause", e.getCause() != null ? e.getCause().toString() : null);
+ errorDocument.put("timestamp", System.currentTimeMillis());
+ errorDocument.put("errorContext", errorContext);
+
+ // Include target information if available for context
+ ScanTarget target = scanJobDescription.getScanTarget();
+ if (target != null) {
+ errorDocument.put("targetHostname", target.getHostname());
+ errorDocument.put("targetIp", target.getIp());
+ errorDocument.put("targetPort", target.getPort());
+
+ // Include additional error context from the target if available
+ if (target.getErrorMessage() != null) {
+ errorDocument.put("targetErrorMessage", target.getErrorMessage());
+ }
+ if (target.getErrorType() != null) {
+ errorDocument.put("targetErrorType", target.getErrorType());
+ }
+ }
+
return new ScanResult(scanJobDescription, errorDocument);
}
+ /**
+ * Gets the unique identifier for this scan result.
+ *
+ * The ID is a UUID string that serves as the primary key for database storage and unique
+ * identification of scan results across the system.
+ *
+ * @return the unique ID string for this scan result
+ */
@JsonProperty("_id")
public String getId() {
return this.id;
}
+ /**
+ * Sets the unique identifier for this scan result.
+ *
+ * This method is primarily used by serialization frameworks and database drivers to set the
+ * ID when loading results from persistent storage.
+ *
+ * @param id the unique ID string to assign to this scan result
+ */
@JsonProperty("_id")
public void setId(String id) {
this.id = id;
}
+ /**
+ * Gets the bulk scan ID that this result belongs to.
+ *
+ * This provides traceability back to the bulk scanning campaign that generated this
+ * individual scan result.
+ *
+ * @return the bulk scan ID string
+ */
public String getBulkScan() {
return this.bulkScan;
}
+ /**
+ * Gets the scan target (host and port) that was scanned.
+ *
+ * @return the scan target containing hostname and port information
+ */
public ScanTarget getScanTarget() {
return this.scanTarget;
}
+ /**
+ * Gets the result document containing scan findings or error details.
+ *
+ * For successful scans, this contains the TLS scanner output in BSON format. For failed
+ * scans created via fromException(), this contains exception details. May be null for scans
+ * that completed but produced no results.
+ *
+ * @return the BSON document containing scan results or error information, may be null
+ */
public Document getResult() {
return this.result;
}
+ /**
+ * Gets the final execution status of the scan job.
+ *
+ * This status indicates how the scan completed, including success, various error conditions,
+ * timeouts, and cancellations.
+ *
+ * @return the final job status for this scan result
+ * @see JobStatus
+ */
public JobStatus getResultStatus() {
return jobStatus;
}
diff --git a/src/main/java/de/rub/nds/crawler/data/ScanTarget.java b/src/main/java/de/rub/nds/crawler/data/ScanTarget.java
index b5299b6..5d5e836 100644
--- a/src/main/java/de/rub/nds/crawler/data/ScanTarget.java
+++ b/src/main/java/de/rub/nds/crawler/data/ScanTarget.java
@@ -13,24 +13,39 @@
import java.io.Serializable;
import java.net.InetAddress;
import java.net.UnknownHostException;
+import java.util.ArrayList;
+import java.util.List;
import org.apache.commons.lang3.tuple.Pair;
import org.apache.commons.validator.routines.InetAddressValidator;
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
+/**
+ * Represents a target for TLS scanning operations.
+ *
+ * Encapsulates network location (hostname/IP and port) and optional metadata (Tranco ranking).
+ * Supports parsing various string formats: hostnames, IPs (IPv4/IPv6), ports, ranks, and URL
+ * prefixes. Performs hostname resolution and denylist checking.
+ *
+ * @see JobStatus
+ * @see IDenylistProvider
+ */
public class ScanTarget implements Serializable {
private static final Logger LOGGER = LogManager.getLogger();
/**
- * Initializes a ScanTarget object from a string that potentially contains a hostname, an ip, a
- * port, the tranco rank.
+ * Creates ScanTarget(s) from a target string with parsing and validation.
*
- * @param targetString from which to create the ScanTarget object
- * @param defaultPort that used if no port is present in targetString
- * @param denylistProvider which provides info if a host is denylisted
- * @return ScanTarget object
+ * Parses various formats (rank,hostname, URLs, ports), performs hostname resolution, and
+ * checks denylists. Creates separate targets for multi-homed hosts.
+ *
+ * @param targetString string to parse (hostname, IP, with optional rank/port)
+ * @param defaultPort port to use when none specified
+ * @param denylistProvider optional denylist checker (may be null)
+ * @return list of (ScanTarget, JobStatus) pairs - multiple for multi-homed hosts
+ * @throws NumberFormatException if port or rank parsing fails
*/
- public static Pair All fields will be initialized to default values. This constructor is primarily used for
+ * deserialization and testing purposes.
+ */
public ScanTarget() {}
+ /**
+ * Returns a string representation of the scan target.
+ *
+ * @return the hostname if available, otherwise the IP address
+ */
@Override
public String toString() {
return hostname != null ? hostname : ip;
}
+ /**
+ * Gets the resolved IP address of the target.
+ *
+ * @return the IP address as a string
+ */
public String getIp() {
return this.ip;
}
+ /**
+ * Gets the hostname of the target.
+ *
+ * @return the hostname, or null if the target was specified as an IP address
+ */
public String getHostname() {
return this.hostname;
}
+ /**
+ * Gets the port number for the scan target.
+ *
+ * @return the port number (1-65534)
+ */
public int getPort() {
return this.port;
}
+ /**
+ * Gets the Tranco ranking of the target.
+ *
+ * The Tranco ranking is a research-oriented top sites ranking that provides a more stable
+ * and transparent alternative to other web ranking services.
+ *
+ * @return the Tranco rank, or 0 if not available
+ * @see Tranco: A Research-Oriented Top Sites Ranking
+ */
public int getTrancoRank() {
return this.trancoRank;
}
+ /**
+ * Sets the IP address of the target.
+ *
+ * @param ip the IP address as a string (IPv4 or IPv6 format)
+ */
public void setIp(String ip) {
this.ip = ip;
}
+ /**
+ * Sets the hostname of the target.
+ *
+ * @param hostname the hostname (may be null if target is IP-only)
+ */
public void setHostname(String hostname) {
this.hostname = hostname;
}
+ /**
+ * Sets the port number for the scan target.
+ *
+ * @param port the port number (should be between 1 and 65534)
+ */
public void setPort(int port) {
this.port = port;
}
+ /**
+ * Sets the Tranco ranking of the target.
+ *
+ * @param trancoRank the Tranco rank (use 0 if not available)
+ */
public void setTrancoRank(int trancoRank) {
this.trancoRank = trancoRank;
}
+
+ /**
+ * Gets the error message associated with this target.
+ *
+ * The error message provides detailed information about why target processing failed,
+ * including specific exception messages, DNS resolution failures, or parsing errors.
+ *
+ * @return the error message, or null if no error occurred
+ */
+ public String getErrorMessage() {
+ return this.errorMessage;
+ }
+
+ /**
+ * Sets the error message for this target.
+ *
+ * @param errorMessage the error message describing the failure
+ */
+ public void setErrorMessage(String errorMessage) {
+ this.errorMessage = errorMessage;
+ }
+
+ /**
+ * Gets the error type classification for this target.
+ *
+ * The error type provides a high-level classification of the failure type, such as
+ * "UnknownHostException", "NumberFormatException", or "DenylistRejection".
+ *
+ * @return the error type, or null if no error occurred
+ */
+ public String getErrorType() {
+ return this.errorType;
+ }
+
+ /**
+ * Sets the error type classification for this target.
+ *
+ * @param errorType the error type classification
+ */
+ public void setErrorType(String errorType) {
+ this.errorType = errorType;
+ }
}
diff --git a/src/main/java/de/rub/nds/crawler/denylist/DenylistFileProvider.java b/src/main/java/de/rub/nds/crawler/denylist/DenylistFileProvider.java
index b480d2f..4ad9e22 100644
--- a/src/main/java/de/rub/nds/crawler/denylist/DenylistFileProvider.java
+++ b/src/main/java/de/rub/nds/crawler/denylist/DenylistFileProvider.java
@@ -26,8 +26,74 @@
import org.apache.logging.log4j.Logger;
/**
- * Reads the specified denylist file. Supports hostnames, ips and complete subnets as denylist
- * entries.
+ * File-based denylist provider supporting hostnames, IP addresses, and CIDR subnet filtering.
+ *
+ * The DenylistFileProvider implements IDenylistProvider by reading filtering rules from a local
+ * text file. It supports multiple entry types to provide comprehensive target filtering
+ * capabilities for compliance, security, and resource management requirements.
+ *
+ * Key features:
+ *
+ * Supported Entry Types:
+ *
+ * File Format: Plain text file with one entry per line. Invalid entries are
+ * silently ignored. Comments and empty lines are processed as invalid entries.
+ *
+ * Example Denylist File:
+ *
+ * Validation and Processing:
+ *
+ * Performance Characteristics:
+ *
+ * Thread Safety: The isDenylisted method is synchronized to ensure thread-safe
+ * access during concurrent scanning operations.
+ *
+ * @see IDenylistProvider
+ * @see ScanTarget
+ * @see SubnetUtils
*/
public class DenylistFileProvider implements IDenylistProvider {
@@ -37,6 +103,15 @@ public class DenylistFileProvider implements IDenylistProvider {
private final List The constructor reads and parses the denylist file, categorizing entries by type (domain,
+ * IP, CIDR) and storing them in optimized data structures for fast lookup. File access errors
+ * are logged but don't prevent provider creation.
+ *
+ * @param denylistFilename the path to the denylist file to read
+ */
public DenylistFileProvider(String denylistFilename) {
List The IDenylistProvider defines the contract for target filtering and access control in the
+ * TLS-Crawler system. It enables implementations to block specific hosts, IP ranges, or domains
+ * from being scanned, supporting compliance requirements, ethical scanning practices, and resource
+ * management policies.
+ *
+ * Key responsibilities:
+ *
+ * Filtering Criteria:
+ *
+ * Common Use Cases:
+ *
+ * Implementation Guidelines:
+ *
+ * Common Implementations:
+ *
+ * Integration Points: Denylist providers are typically used during target
+ * processing in PublishBulkScanJob and ScanTarget.fromTargetString() to filter targets before scan
+ * job creation.
+ *
+ * @see ScanTarget
+ * @see ScanTarget#fromTargetString(String, int, IDenylistProvider)
+ * @see DenylistFileProvider
+ */
public interface IDenylistProvider {
+ /**
+ * Determines if a scan target should be excluded from scanning based on denylist rules.
+ *
+ * This method evaluates the provided scan target against the configured denylist criteria
+ * and returns true if the target should be blocked from scanning. The implementation should
+ * consider all relevant target attributes including hostname, IP address, and port when making
+ * the determination.
+ *
+ * Evaluation Criteria:
+ *
+ * Performance Considerations: This method may be called frequently during
+ * target processing, so implementations should optimize for fast evaluation, especially with
+ * large denylists.
+ *
+ * Thread Safety: This method must be thread-safe as it will be called
+ * concurrently during parallel target processing.
+ *
+ * @param target the scan target to evaluate against denylist rules
+ * @return true if the target is denylisted and should not be scanned, false otherwise
+ */
boolean isDenylisted(ScanTarget target);
}
diff --git a/src/main/java/de/rub/nds/crawler/orchestration/DoneNotificationConsumer.java b/src/main/java/de/rub/nds/crawler/orchestration/DoneNotificationConsumer.java
index 9af1769..f157aa2 100644
--- a/src/main/java/de/rub/nds/crawler/orchestration/DoneNotificationConsumer.java
+++ b/src/main/java/de/rub/nds/crawler/orchestration/DoneNotificationConsumer.java
@@ -10,8 +10,97 @@
import de.rub.nds.crawler.data.ScanJobDescription;
+/**
+ * Functional interface for consuming scan job completion notifications in distributed TLS scanning.
+ *
+ * The DoneNotificationConsumer defines the contract for controllers and monitoring systems to
+ * receive notifications when scan jobs complete processing. It enables real-time progress tracking,
+ * statistics collection, and completion event handling in the TLS-Crawler distributed architecture.
+ *
+ * Key characteristics:
+ *
+ * Usage Scenarios:
+ *
+ * Implementation Pattern:
+ *
+ * Thread Safety: Implementations must be thread-safe as they may be called
+ * concurrently by multiple message handling threads from the orchestration provider.
+ *
+ * Consumer Tag Usage: The consumer tag parameter identifies the specific
+ * message queue consumer that delivered the notification, useful for debugging and routing.
+ *
+ * Typical Usage:
+ *
+ * This method is called asynchronously by the orchestration provider when a scan job
+ * completes processing. The implementation should update progress tracking, statistics, and any
+ * monitoring systems based on the completed job information.
+ *
+ * Processing Responsibilities:
+ *
+ * Thread Safety: This method may be called concurrently from multiple
+ * threads, so implementations must handle synchronization appropriately.
+ *
+ * Exception Handling: Implementations should catch all exceptions
+ * internally to prevent disruption of the notification delivery system.
+ *
+ * @param consumerTag the message queue consumer tag that delivered this notification
+ * @param scanJobDescription the completed scan job with final status and metadata
+ */
void consumeDoneNotification(String consumerTag, ScanJobDescription scanJobDescription);
}
diff --git a/src/main/java/de/rub/nds/crawler/orchestration/IOrchestrationProvider.java b/src/main/java/de/rub/nds/crawler/orchestration/IOrchestrationProvider.java
index c39f41b..e92ae6e 100644
--- a/src/main/java/de/rub/nds/crawler/orchestration/IOrchestrationProvider.java
+++ b/src/main/java/de/rub/nds/crawler/orchestration/IOrchestrationProvider.java
@@ -12,45 +12,177 @@
import de.rub.nds.crawler.data.ScanJobDescription;
/**
- * Interface for the orchestration provider. Its job is to accept jobs from the controller and to
- * submit them to the worker. The provider may open a connection in its constructor, which must be
- * closed in {@link #closeConnection()}.
+ * Orchestration provider interface for distributed job coordination in TLS-Crawler.
+ *
+ * The IOrchestrationProvider defines the contract for coordinating scan job distribution between
+ * controllers and workers in the TLS-Crawler distributed architecture. It abstracts the underlying
+ * message queue implementation (RabbitMQ, etc.) and provides a reliable communication mechanism for
+ * job submission, consumption, and completion notifications.
+ *
+ * Key responsibilities:
+ *
+ * Message Flow Architecture:
+ *
+ * Consumer Registration:
+ *
+ * Reliability Features:
+ *
+ * Implementation Notes:
+ *
+ * Common Implementations:
+ *
+ * This method queues a scan job for processing by worker nodes, using the underlying message
+ * queue system to ensure reliable delivery. The job will be routed to an available worker based
+ * on the provider's load balancing strategy.
+ *
+ * Delivery Behavior: The implementation should ensure that jobs are
+ * persistently queued and will be delivered even if no workers are currently available,
+ * supporting fault-tolerant distributed processing.
*
- * @param scanJobDescription The scan job to be submitted.
+ * @param scanJobDescription the scan job to submit for processing
+ * @throws RuntimeException if the job cannot be submitted (implementation-specific)
*/
void submitScanJob(ScanJobDescription scanJobDescription);
/**
- * Register a scan job consumer. It has to confirm that the job is done using {@link
- * #notifyOfDoneScanJob(ScanJobDescription)}.
+ * Registers a scan job consumer to receive jobs from the orchestration provider.
*
- * @param scanJobConsumer The scan job consumer to be registered.
- * @param prefetchCount Number of unacknowledged jobs that may be sent to the consumer.
+ * This method registers a worker to receive scan jobs from the message queue. The consumer
+ * will be called for each available job, and must acknowledge completion using {@link
+ * #notifyOfDoneScanJob(ScanJobDescription)} to ensure reliable processing.
+ *
+ * Flow Control: The prefetchCount parameter controls how many
+ * unacknowledged jobs can be delivered to this consumer simultaneously, enabling back-pressure
+ * management and preventing worker overload.
+ *
+ * Consumer Lifecycle: The consumer remains active until the connection is
+ * closed or the application terminates. Implementations should handle consumer failures
+ * gracefully and support reregistration.
+ *
+ * @param scanJobConsumer the functional interface to handle incoming scan jobs
+ * @param prefetchCount maximum number of unacknowledged jobs to deliver simultaneously
+ * @throws RuntimeException if the consumer cannot be registered (implementation-specific)
*/
void registerScanJobConsumer(ScanJobConsumer scanJobConsumer, int prefetchCount);
/**
- * Register a done notification consumer. It is called when a scan job is done.
+ * Registers a completion notification consumer for a specific bulk scan operation.
+ *
+ * This method enables controllers to receive notifications when individual scan jobs within
+ * a bulk scan complete. The consumer will be called for each job completion, enabling real-time
+ * progress tracking and statistics collection.
*
- * @param bulkScan The bulk scan for which the consumer accepts notifications.
- * @param doneNotificationConsumer The done notification consumer to be registered.
+ * Bulk Scan Scope: The consumer is registered specifically for the provided
+ * bulk scan and will only receive notifications for jobs belonging to that bulk scan operation.
+ *
+ * Monitoring Integration: This mechanism is typically used by
+ * ProgressMonitor instances to track scan progress and calculate completion statistics.
+ *
+ * @param bulkScan the bulk scan operation to monitor for completion notifications
+ * @param doneNotificationConsumer the consumer to handle job completion events
+ * @throws RuntimeException if the consumer cannot be registered (implementation-specific)
*/
void registerDoneNotificationConsumer(
BulkScan bulkScan, DoneNotificationConsumer doneNotificationConsumer);
/**
- * Send an acknowledgment that a scan job received by a scan consumer is finished.
+ * Acknowledges completion of a scan job and triggers completion notifications.
+ *
+ * This method performs dual functions: it acknowledges successful processing of a scan job
+ * to the message queue system, and it publishes completion notifications to registered done
+ * notification consumers for progress monitoring.
+ *
+ * Acknowledgment Behavior: The method confirms to the message queue that
+ * the job has been successfully processed and can be removed from the queue, preventing
+ * redelivery to other workers.
*
- * @param scanJobDescription The scan job that is finished. Its status should reflect the status
- * of the results.
+ * Notification Publishing: Simultaneously publishes the completion event to
+ * any registered done notification consumers, enabling real-time progress tracking and
+ * statistics updates.
+ *
+ * Status Consistency: The scan job description's status field should
+ * accurately reflect the final processing outcome before calling this method.
+ *
+ * @param scanJobDescription the completed scan job with final status information
+ * @throws RuntimeException if acknowledgment or notification fails (implementation-specific)
*/
void notifyOfDoneScanJob(ScanJobDescription scanJobDescription);
- /** Close any connection to the orchestration provider, freeing resources. */
+ /**
+ * Closes connections and releases resources used by the orchestration provider.
+ *
+ * This method performs cleanup of all resources including message queue connections, thread
+ * pools, and any other resources allocated during provider operation. It should be called when
+ * the application is shutting down or when the provider is no longer needed.
+ *
+ * Cleanup Responsibilities:
+ *
+ * Thread Safety: This method should be safe to call from any thread and
+ * should handle concurrent calls gracefully.
+ */
void closeConnection();
}
diff --git a/src/main/java/de/rub/nds/crawler/orchestration/RabbitMqOrchestrationProvider.java b/src/main/java/de/rub/nds/crawler/orchestration/RabbitMqOrchestrationProvider.java
index 9f9e144..5dd5fcf 100644
--- a/src/main/java/de/rub/nds/crawler/orchestration/RabbitMqOrchestrationProvider.java
+++ b/src/main/java/de/rub/nds/crawler/orchestration/RabbitMqOrchestrationProvider.java
@@ -32,8 +32,16 @@
import org.apache.logging.log4j.Logger;
/**
- * Provides all methods required for the communication with RabbitMQ for the controller and the
- * worker.
+ * RabbitMQ-based orchestration provider for TLS-Crawler.
+ *
+ * Implements distributed messaging for scan coordination using RabbitMQ. Handles job
+ * distribution, load balancing, progress monitoring, and TLS connections.
+ *
+ * @see IOrchestrationProvider
+ * @see RabbitMqDelegate
+ * @see ScanJobDescription
+ * @see ScanJobConsumer
+ * @see DoneNotificationConsumer
*/
public class RabbitMqOrchestrationProvider implements IOrchestrationProvider {
@@ -54,6 +62,47 @@ public class RabbitMqOrchestrationProvider implements IOrchestrationProvider {
private Set This constructor performs complete initialization of the RabbitMQ connection including
+ * authentication, TLS setup, and queue declaration. It establishes the foundation for all
+ * subsequent messaging operations.
+ *
+ * Initialization Sequence:
+ *
+ * Authentication Methods:
+ *
+ * Security Features:
+ *
+ * Thread Management: Uses a named thread factory to ensure proper thread
+ * identification for monitoring and debugging purposes.
+ *
+ * @param rabbitMqDelegate the RabbitMQ configuration containing connection parameters
+ * @throws RuntimeException if connection to RabbitMQ cannot be established
+ * @see RabbitMqDelegate
+ * @see ConnectionFactory
+ */
public RabbitMqOrchestrationProvider(RabbitMqDelegate rabbitMqDelegate) {
ConnectionFactory factory = new ConnectionFactory();
factory.setHost(rabbitMqDelegate.getRabbitMqHost());
@@ -92,6 +141,29 @@ public RabbitMqOrchestrationProvider(RabbitMqDelegate rabbitMqDelegate) {
}
}
+ /**
+ * Gets or creates a notification queue for the specified bulk scan.
+ *
+ * This method implements lazy queue creation for bulk scan completion notifications. Each
+ * bulk scan gets its own dedicated notification queue to enable isolated progress monitoring
+ * without interference between different scanning campaigns.
+ *
+ * Queue Properties:
+ *
+ * Cleanup Strategy: Queues are automatically deleted by RabbitMQ after 5
+ * minutes of inactivity to prevent resource accumulation from completed scans.
+ *
+ * @param bulkScanId the unique identifier of the bulk scan
+ * @return the notification queue name for the specified bulk scan
+ * @see #DONE_NOTIFY_QUEUE_PROPERTIES
+ */
private String getDoneNotifyQueue(String bulkScanId) {
String queueName = "done-notify-queue_" + bulkScanId;
if (!declaredQueues.contains(queueName)) {
@@ -106,6 +178,30 @@ private String getDoneNotifyQueue(String bulkScanId) {
return queueName;
}
+ /**
+ * Submits a scan job to the RabbitMQ queue for processing by available workers.
+ *
+ * This method publishes scan job descriptions to the main scan job queue where they are
+ * distributed to available worker instances using RabbitMQ's round-robin load balancing. The
+ * method uses Java object serialization for reliable data transmission.
+ *
+ * Publishing Details:
+ *
+ * Error Handling: Network and I/O errors are logged but do not throw
+ * exceptions, allowing the controller to continue operating even if some job submissions fail.
+ *
+ * @param scanJobDescription the scan job to submit for processing by workers
+ * @see IOrchestrationProvider#submitScanJob(ScanJobDescription)
+ * @see ScanJobDescription
+ * @see #SCAN_JOB_QUEUE
+ */
@Override
public void submitScanJob(ScanJobDescription scanJobDescription) {
try {
@@ -116,6 +212,40 @@ public void submitScanJob(ScanJobDescription scanJobDescription) {
}
}
+ /**
+ * Registers a consumer to receive and process scan jobs from the RabbitMQ queue.
+ *
+ * This method sets up a worker instance to consume scan jobs from the main queue. It
+ * configures message prefetching, deserialization handling, and error recovery to ensure
+ * reliable job processing.
+ *
+ * Consumer Configuration:
+ *
+ * Message Processing:
+ *
+ * Error Recovery: Malformed or undeserializable messages are rejected and
+ * dropped rather than being requeued, preventing infinite processing loops.
+ *
+ * @param scanJobConsumer the consumer instance that will process received scan jobs
+ * @param prefetchCount the maximum number of unacknowledged messages per worker
+ * @see IOrchestrationProvider#registerScanJobConsumer(ScanJobConsumer, int)
+ * @see ScanJobConsumer
+ * @see ScanJobDescription
+ */
@Override
public void registerScanJobConsumer(ScanJobConsumer scanJobConsumer, int prefetchCount) {
DeliverCallback deliverCallback =
@@ -143,6 +273,24 @@ public void registerScanJobConsumer(ScanJobConsumer scanJobConsumer, int prefetc
}
}
+ /**
+ * Sends message acknowledgment to RabbitMQ for the specified delivery tag.
+ *
+ * This private method handles the RabbitMQ message acknowledgment protocol. Acknowledgments
+ * confirm that a message has been successfully processed and can be removed from the queue.
+ *
+ * Acknowledgment Details:
+ *
+ * This method sets up monitoring for bulk scan progress by registering a consumer on the
+ * scan's dedicated notification queue. It enables real-time tracking of scan completion and
+ * progress monitoring.
+ *
+ * Consumer Configuration:
+ *
+ * Monitoring Features:
+ *
+ * Queue Management: The notification queue is created lazily when first
+ * accessed and automatically cleaned up after the scan completes due to TTL configuration.
+ *
+ * @param bulkScan the bulk scan to monitor for completion notifications
+ * @param doneNotificationConsumer the consumer to handle completion notifications
+ * @see IOrchestrationProvider#registerDoneNotificationConsumer(BulkScan,
+ * DoneNotificationConsumer)
+ * @see DoneNotificationConsumer
+ * @see #getDoneNotifyQueue(String)
+ */
@Override
public void registerDoneNotificationConsumer(
BulkScan bulkScan, DoneNotificationConsumer doneNotificationConsumer) {
@@ -170,6 +353,40 @@ public void registerDoneNotificationConsumer(
}
}
+ /**
+ * Notifies completion of a scan job and sends progress notification if monitoring is enabled.
+ *
+ * This method handles the completion workflow for scan jobs by acknowledging the original
+ * message and optionally sending progress notifications for monitored scans. It ensures
+ * reliable message processing and enables progress tracking.
+ *
+ * Completion Workflow:
+ *
+ * Monitoring Integration:
+ *
+ * Error Handling: Message acknowledgment always occurs regardless of
+ * notification success, ensuring scan jobs don't get stuck in the queue due to monitoring
+ * issues.
+ *
+ * @param scanJobDescription the completed scan job to acknowledge and notify
+ * @see IOrchestrationProvider#notifyOfDoneScanJob(ScanJobDescription)
+ * @see #sendAck(long)
+ * @see #getDoneNotifyQueue(String)
+ */
@Override
public void notifyOfDoneScanJob(ScanJobDescription scanJobDescription) {
sendAck(scanJobDescription.getDeliveryTag());
@@ -186,6 +403,35 @@ public void notifyOfDoneScanJob(ScanJobDescription scanJobDescription) {
}
}
+ /**
+ * Closes the RabbitMQ connection and associated resources.
+ *
+ * This method performs clean shutdown of the RabbitMQ connection by closing the channel and
+ * connection in the proper order. It handles potential errors during shutdown gracefully to
+ * ensure resources are released.
+ *
+ * Shutdown Sequence:
+ *
+ * Resource Management:
+ *
+ * Error Handling: Shutdown errors are logged but do not prevent the method
+ * from completing, ensuring that cleanup attempts continue even if some resources fail to
+ * close.
+ *
+ * @see IOrchestrationProvider#closeConnection()
+ */
@Override
public void closeConnection() {
try {
diff --git a/src/main/java/de/rub/nds/crawler/orchestration/ScanJobConsumer.java b/src/main/java/de/rub/nds/crawler/orchestration/ScanJobConsumer.java
index 628b0ee..85c511a 100644
--- a/src/main/java/de/rub/nds/crawler/orchestration/ScanJobConsumer.java
+++ b/src/main/java/de/rub/nds/crawler/orchestration/ScanJobConsumer.java
@@ -10,8 +10,88 @@
import de.rub.nds.crawler.data.ScanJobDescription;
+/**
+ * Functional interface for consuming scan jobs from the orchestration provider in distributed TLS
+ * scanning.
+ *
+ * The ScanJobConsumer defines the contract for worker instances to receive and process scan jobs
+ * from the message queue system. It serves as the callback mechanism that enables asynchronous job
+ * processing in the TLS-Crawler distributed architecture.
+ *
+ * Key characteristics:
+ *
+ * Implementation Pattern:
+ *
+ * Thread Safety: Implementations must be thread-safe as they may be called
+ * concurrently by the orchestration provider's message handling threads.
+ *
+ * Error Handling: Implementations should handle all exceptions internally and
+ * ensure proper acknowledgment even in error scenarios to prevent message redelivery issues.
+ *
+ * Typical Usage:
+ *
+ * This method is called asynchronously by the orchestration provider when a scan job becomes
+ * available for processing. The implementation must handle the complete job lifecycle including
+ * execution, result storage, and acknowledgment.
+ *
+ * Processing Responsibilities:
+ *
+ * Thread Safety: This method may be called concurrently from multiple
+ * threads, so implementations must be thread-safe or handle synchronization appropriately.
+ *
+ * Exception Handling: Implementations should catch all exceptions
+ * internally and not allow them to propagate, as uncaught exceptions may disrupt the message
+ * queue processing loop.
+ *
+ * @param scanJobDescription the scan job to process, containing target and configuration
+ * details
+ */
void consumeScanJob(ScanJobDescription scanJobDescription);
}
diff --git a/src/main/java/de/rub/nds/crawler/persistence/IPersistenceProvider.java b/src/main/java/de/rub/nds/crawler/persistence/IPersistenceProvider.java
index 50e3626..06cc97e 100644
--- a/src/main/java/de/rub/nds/crawler/persistence/IPersistenceProvider.java
+++ b/src/main/java/de/rub/nds/crawler/persistence/IPersistenceProvider.java
@@ -11,33 +11,163 @@
import de.rub.nds.crawler.data.BulkScan;
import de.rub.nds.crawler.data.ScanJobDescription;
import de.rub.nds.crawler.data.ScanResult;
+import java.util.List;
/**
- * Persistence provider interface. Exposes methods to write out the different stages of a task to a
- * file/database/api.
+ * Persistence provider interface for database operations in the TLS-Crawler distributed
+ * architecture.
+ *
+ * The IPersistenceProvider defines the contract for storing and retrieving scan data throughout
+ * the TLS-Crawler workflow. It abstracts the underlying storage implementation (MongoDB, file
+ * system, etc.) and provides a consistent interface for controllers and workers to persist scan
+ * metadata, results, and progress information.
+ *
+ * Key responsibilities:
+ *
+ * Implementation Requirements:
+ *
+ * Storage Workflow:
+ *
+ * Data Relationships:
+ *
+ * Common Implementations:
+ *
+ * This method stores the complete outcome of a scan job execution, including the scan
+ * findings, execution status, and metadata for traceability. The implementation must ensure the
+ * result is correctly linked to its parent bulk scan.
+ *
+ * Storage Requirements:
+ *
+ * Thread Safety: This method must be thread-safe as it will be called
+ * concurrently by multiple worker threads processing scan jobs.
*
- * @param scanResult The scan result to insert.
- * @param job The job that was used to create the scan result.
+ * @param scanResult the scan result containing findings and execution status
+ * @param job the job description containing metadata and configuration details
+ * @throws RuntimeException if the result cannot be persisted (implementation-specific)
*/
void insertScanResult(ScanResult scanResult, ScanJobDescription job);
/**
- * Insert a bulk scan into the database. This is used to store metadata about the bulk scan.
- * This adds an ID to the bulk scan.
+ * Creates a new bulk scan record in the database and assigns a unique identifier.
*
- * @param bulkScan The bulk scan to insert.
+ * This method initializes a bulk scan operation by persisting its configuration and metadata
+ * to the database. The implementation must generate and assign a unique ID to the bulk scan
+ * object, which will be used to correlate individual scan results.
+ *
+ * Initialization Responsibilities:
+ *
+ * ID Generation: The implementation must ensure the generated ID is unique
+ * across all bulk scans and suitable for use as a foreign key reference in scan result records.
+ *
+ * @param bulkScan the bulk scan object to persist (ID will be assigned)
+ * @throws RuntimeException if the bulk scan cannot be created (implementation-specific)
*/
void insertBulkScan(BulkScan bulkScan);
/**
- * Update a bulk scan in the database. This updated the whole bulk scan.
+ * Updates an existing bulk scan record with current progress and statistics.
+ *
+ * This method replaces the existing bulk scan record with updated information, typically
+ * called to record progress updates, final statistics, or completion status. The bulk scan ID
+ * must remain unchanged during updates.
+ *
+ * Update Scenarios:
+ *
+ * Consistency Requirements: The implementation should ensure that updates
+ * are atomic and maintain data consistency, especially when called concurrently with scan
+ * result insertions.
*
- * @param bulkScan The bulk scan to update.
+ * @param bulkScan the bulk scan object with updated information
+ * @throws RuntimeException if the bulk scan cannot be updated (implementation-specific)
*/
void updateBulkScan(BulkScan bulkScan);
+
+ /**
+ * Retrieve scan results for a specific target hostname or IP.
+ *
+ * @param dbName The database name where the scan results are stored.
+ * @param collectionName The collection name where the scan results are stored.
+ * @param target The hostname or IP address to search for.
+ * @return A list of scan results matching the target.
+ */
+ List Provides MongoDB-based storage with separate databases per scan, collection caching, custom
+ * serialization, automatic indexing, and error recovery.
+ *
+ * @see IPersistenceProvider
+ * @see MongoDbDelegate
+ * @see BulkScan
+ * @see ScanResult
+ */
public class MongoPersistenceProvider implements IPersistenceProvider {
private static final Logger LOGGER = LogManager.getLogger();
@@ -54,6 +64,22 @@ public class MongoPersistenceProvider implements IPersistenceProvider {
private static final Set Must be registered before first MongoPersistenceProvider instance is created.
+ *
+ * This convenience method allows bulk registration of multiple Jackson serializers. All
+ * serializers will be applied during JSON serialization of scan results before storing them in
+ * MongoDB.
+ *
+ * This method delegates to {@link #registerSerializer(JsonSerializer)} for each provided
+ * serializer, maintaining the same registration lifecycle restrictions.
+ *
+ * @param serializers vararg array of JsonSerializers to register for MongoDB serialization
+ * @throws RuntimeException if called after MongoPersistenceProvider initialization
+ * @see #registerSerializer(JsonSerializer)
+ * @see #registerModule(Module...)
+ */
public static void registerSerializer(JsonSerializer>... serializers) {
for (JsonSerializer> serializer : serializers) {
registerSerializer(serializer);
}
}
+ /**
+ * Registers a custom Jackson module for extended JSON serialization functionality.
+ *
+ * This method allows registration of Jackson modules that extend the ObjectMapper's
+ * serialization capabilities. Modules can provide custom serializers, deserializers, type
+ * handlers, and other Jackson extensions for MongoDB document processing.
+ *
+ * Module Registration:
+ *
+ * This convenience method allows bulk registration of multiple Jackson modules. Each module
+ * will extend the ObjectMapper's serialization capabilities for MongoDB document processing.
+ *
+ * This method delegates to {@link #registerModule(Module)} for each provided module,
+ * maintaining the same registration lifecycle restrictions.
+ *
+ * @param modules vararg array of Jackson Modules to register for enhanced serialization
+ * @throws RuntimeException if called after MongoPersistenceProvider initialization
+ * @see #registerModule(Module)
+ * @see #registerSerializer(JsonSerializer...)
+ */
public static void registerModule(Module... modules) {
for (Module module : modules) {
registerModule(module);
@@ -87,6 +162,36 @@ public static void registerModule(Module... modules) {
resultCollectionCache;
private JacksonMongoCollection This static factory method handles the complete MongoDB client setup including connection
+ * string construction, credential management, and client configuration. It supports both direct
+ * password provision and password file reading.
+ *
+ * Connection Configuration:
+ *
+ * Password Handling:
+ *
+ * This static factory method creates a fully configured ObjectMapper that handles the
+ * complex serialization requirements of TLS scan results. The mapper integrates custom
+ * serializers, modules, and specific configuration for MongoDB storage.
+ *
+ * Configuration Features:
+ *
+ * Serialization Strategy:
+ *
+ * This constructor performs complete initialization of the MongoDB persistence layer
+ * including client connection, ObjectMapper configuration, and cache setup. It establishes the
+ * foundation for all subsequent database operations.
+ *
+ * Initialization Sequence:
*
- * @param mongoDbDelegate Mongodb command line configuration parameters
+ * Cache Configuration:
+ *
+ * Error Handling: Connection failures are wrapped in RuntimeException to
+ * ensure proper error propagation during application startup.
+ *
+ * @param mongoDbDelegate MongoDB command line configuration parameters
+ * @throws RuntimeException if MongoDB connection cannot be established
+ * @see MongoDbDelegate
+ * @see #createMapper()
+ * @see #createMongoClient(MongoDbDelegate)
*/
public MongoPersistenceProvider(MongoDbDelegate mongoDbDelegate) {
isInitialized = true;
@@ -175,11 +339,58 @@ public MongoPersistenceProvider(MongoDbDelegate mongoDbDelegate) {
key.getLeft(), key.getRight())));
}
+ /**
+ * Initializes a MongoDB database connection for the specified database name.
+ *
+ * This method is used by the database cache to lazily initialize database connections as
+ * they are requested. It provides the foundation for all database operations within a specific
+ * scan context.
+ *
+ * Database Naming Strategy: Each bulk scan typically uses its own database
+ * to ensure data isolation and simplified management of scan results.
+ *
+ * @param dbName the name of the database to initialize
+ * @return initialized MongoDatabase instance ready for collection operations
+ * @see #databaseCache
+ */
private MongoDatabase initDatabase(String dbName) {
LOGGER.info("Initializing database: {}.", dbName);
return mongoClient.getDatabase(dbName);
}
+ /**
+ * Initializes a MongoDB collection for storing scan results with performance optimization.
+ *
+ * This method is used by the collection cache to lazily initialize collections as they are
+ * requested. It creates properly configured MongoJack collections with automatic indexing for
+ * optimal query performance.
+ *
+ * Collection Configuration:
+ *
+ * Performance Indexing:
+ *
+ * Index Management: Index creation is idempotent, so repeated calls will
+ * not create duplicate indexes.
+ *
+ * @param dbName the database name containing the collection
+ * @param collectionName the name of the collection to initialize
+ * @return configured JacksonMongoCollection ready for scan result storage
+ * @see #resultCollectionCache
+ * @see ScanResult
+ */
private JacksonMongoCollection This method implements lazy initialization of the bulk scan collection, creating it only
+ * when first accessed. The collection stores high-level information about bulk scanning
+ * operations separate from individual scan results.
+ *
+ * Collection Purpose:
+ *
+ * Singleton Pattern: The collection instance is cached after first creation
+ * to avoid repeated initialization overhead for subsequent access.
+ *
+ * @param dbName the database name containing the bulk scan collection
+ * @return JacksonMongoCollection configured for BulkScan document storage
+ * @see BulkScan
+ * @see #BULK_SCAN_COLLECTION_NAME
+ */
private JacksonMongoCollection This method stores the bulk scan metadata in the appropriate database and collection. The
+ * bulk scan document contains configuration, progress tracking, and high-level information
+ * about the scanning campaign.
+ *
+ * Storage Location: The bulk scan is stored in a collection named
+ * "bulkScans" within the database corresponding to the bulk scan's name.
+ *
+ * @param bulkScan the bulk scan metadata to insert into the database
+ * @throws IllegalArgumentException if bulkScan is null
+ * @see IPersistenceProvider#insertBulkScan(BulkScan)
+ * @see BulkScan
+ */
@Override
public void insertBulkScan(@NonNull BulkScan bulkScan) {
this.getBulkScanCollection(bulkScan.getName()).insertOne(bulkScan);
}
+ /**
+ * Updates an existing bulk scan record in the MongoDB collection.
+ *
+ * This method implements a replace strategy for updating bulk scan metadata. It removes the
+ * existing document and inserts the updated version to ensure complete replacement of all
+ * fields.
+ *
+ * Update Strategy:
+ *
+ * Atomicity Consideration: This implementation is not atomic. In production
+ * environments with high concurrency, consider using MongoDB's replaceOne operation for atomic
+ * updates.
+ *
+ * @param bulkScan the updated bulk scan metadata to store in the database
+ * @throws IllegalArgumentException if bulkScan is null
+ * @see IPersistenceProvider#updateBulkScan(BulkScan)
+ * @see #insertBulkScan(BulkScan)
+ */
@Override
public void updateBulkScan(@NonNull BulkScan bulkScan) {
this.getBulkScanCollection(bulkScan.getName()).removeById(bulkScan.get_id());
this.insertBulkScan(bulkScan);
}
+ /**
+ * Writes a scan result to the appropriate MongoDB collection.
+ *
+ * This private method handles the actual database insertion of scan results. It uses the
+ * collection cache to obtain the appropriate collection and performs the insertion with logging
+ * for monitoring purposes.
+ *
+ * Collection Resolution: The method uses the collection cache with a
+ * composite key of database name and collection name to obtain the properly configured MongoDB
+ * collection.
+ *
+ * Performance Optimization: Collections are cached to avoid repeated
+ * initialization overhead during high-volume scanning operations.
+ *
+ * @param dbName the database name for the scan result storage
+ * @param collectionName the collection name for the scan result storage
+ * @param scanResult the scan result to write to the database
+ * @see #resultCollectionCache
+ * @see ScanResult
+ */
private void writeResultToDatabase(
String dbName, String collectionName, ScanResult scanResult) {
LOGGER.info(
@@ -234,6 +526,35 @@ private void writeResultToDatabase(
resultCollectionCache.getUnchecked(Pair.of(dbName, collectionName)).insertOne(scanResult);
}
+ /**
+ * Inserts a scan result into the MongoDB collection with comprehensive error handling.
+ *
+ * This method implements the core persistence logic for individual scan results. It includes
+ * validation, error recovery, and recursive error handling to ensure that scan results are
+ * never lost due to serialization issues.
+ *
+ * Validation: The method validates that the scan result status matches the
+ * job description status to ensure data consistency before insertion.
+ *
+ * Error Recovery Strategy:
+ *
+ * Status Consistency: The method ensures that scan results and job
+ * descriptions maintain consistent status information throughout the persistence process.
+ *
+ * @param scanResult the scan result to insert into the database
+ * @param scanJobDescription the job description containing storage location and status
+ * @throws IllegalArgumentException if result status doesn't match job description status
+ * @see IPersistenceProvider#insertScanResult(ScanResult, ScanJobDescription)
+ * @see ScanResult#fromException(ScanJobDescription, Exception)
+ * @see JobStatus
+ */
@Override
public void insertScanResult(ScanResult scanResult, ScanJobDescription scanJobDescription) {
if (scanResult.getResultStatus() != scanJobDescription.getStatus()) {
diff --git a/src/main/java/de/rub/nds/crawler/targetlist/CruxListProvider.java b/src/main/java/de/rub/nds/crawler/targetlist/CruxListProvider.java
index b979ae8..79e7747 100644
--- a/src/main/java/de/rub/nds/crawler/targetlist/CruxListProvider.java
+++ b/src/main/java/de/rub/nds/crawler/targetlist/CruxListProvider.java
@@ -14,8 +14,51 @@
import java.util.stream.Stream;
/**
- * Target list provider that downloads the most recent crux list (...) and extracts the top x hosts from it.
+ * Chrome UX Report (CrUX) target list provider for distributed TLS scanning operations.
+ *
+ * The CruxListProvider downloads and processes the most recent Chrome User Experience Report
+ * data to extract popular website targets for TLS security scanning. It provides access to
+ * real-world web traffic patterns based on actual Chrome browser usage statistics.
+ *
+ * Key features:
+ *
+ * Data Source: The provider downloads compressed CSV data from the official
+ * CrUX Top Lists repository maintained by zakird on GitHub. This data is updated regularly to
+ * reflect current web usage patterns.
+ *
+ * Processing Pipeline:
+ *
+ * CSV Format: Each line contains "protocol://domain, crux_rank" where the rank
+ * indicates popularity based on Chrome usage statistics.
+ *
+ * Target Selection: Only HTTPS websites with ranks <= configured number are
+ * included, ensuring TLS-capable targets for security scanning.
+ *
+ * Usage Example:
+ *
+ * The constructor configures the provider to download and process the current CrUX data,
+ * extracting up to the specified number of top-ranked HTTPS websites for TLS scanning
+ * operations.
+ *
+ * @param cruxListNumber the desired list size determining maximum number of targets
+ */
public CruxListProvider(CruxListNumber cruxListNumber) {
super(cruxListNumber.getNumber(), SOURCE, ZIP_FILENAME, FILENAME, "Crux");
}
diff --git a/src/main/java/de/rub/nds/crawler/targetlist/ITargetListProvider.java b/src/main/java/de/rub/nds/crawler/targetlist/ITargetListProvider.java
index 5e4662f..311b428 100644
--- a/src/main/java/de/rub/nds/crawler/targetlist/ITargetListProvider.java
+++ b/src/main/java/de/rub/nds/crawler/targetlist/ITargetListProvider.java
@@ -10,7 +10,81 @@
import java.util.List;
+/**
+ * Target list provider interface for supplying scan targets to TLS-Crawler operations.
+ *
+ * The ITargetListProvider defines the contract for obtaining lists of scan targets from various
+ * sources including files, web services, databases, and curated lists. It abstracts the target
+ * acquisition mechanism and provides a consistent interface for controllers to obtain targets for
+ * bulk scanning operations.
+ *
+ * Key responsibilities:
+ *
+ * Target Format:
+ *
+ * Common Implementations:
+ *
+ * Implementation Guidelines:
+ *
+ * Usage Pattern: Target list providers are typically configured based on
+ * command-line arguments and used by controllers during bulk scan initialization to obtain the
+ * complete list of targets for processing.
+ *
+ * @see TargetFileProvider
+ * @see TrancoListProvider
+ * @see CruxListProvider Configured via ControllerCommandConfig.getTargetListProvider() method.
+ */
public interface ITargetListProvider {
+ /**
+ * Retrieves the complete list of scan targets from the configured source.
+ *
+ * This method fetches all available targets from the provider's source and returns them as a
+ * list of string representations. The implementation should handle any necessary data
+ * retrieval, parsing, and formatting to produce valid target strings.
+ *
+ * Target Format: Each string should represent a valid scan target in
+ * hostname[:port] format, suitable for parsing by ScanTarget.fromTargetString().
+ *
+ * Error Handling: Implementations should handle source-specific errors
+ * (network failures, file not found, etc.) and either throw appropriate exceptions or return
+ * empty lists based on the error recovery strategy.
+ *
+ * Performance Considerations: This method may perform expensive operations
+ * like network requests or large file parsing. Consider implementing caching or streaming
+ * strategies for large target lists.
+ *
+ * @return a list of target strings in hostname[:port] format
+ * @throws RuntimeException if targets cannot be retrieved (implementation-specific)
+ */
List The TargetFileProvider implements ITargetListProvider to supply scan targets by reading from a
+ * local text file. It supports common file formats with comment filtering and empty line handling,
+ * making it suitable for managing static target lists in development, testing, and production
+ * environments.
+ *
+ * Key features:
+ *
+ * File Format:
+ *
+ * Example File Content:
+ *
+ * Error Handling: File access errors (file not found, permission denied, I/O
+ * errors) are wrapped in RuntimeException with descriptive messages for troubleshooting.
+ *
+ * Performance Characteristics:
+ *
+ * Usage Example:
+ *
+ * The constructor stores the file path for later use when getTargetList() is called. The
+ * file is not validated or accessed during construction, allowing for flexible deployment
+ * scenarios where the file may be created after the provider is instantiated.
+ *
+ * @param filename the path to the target list file to read
+ */
public TargetFileProvider(String filename) {
this.filename = filename;
}
+ /**
+ * Reads and returns the complete list of scan targets from the configured file.
+ *
+ * This method opens the file, reads all lines, and filters out comments (lines starting with
+ * '#') and empty lines. The remaining lines are returned as scan targets in the order they
+ * appear in the file.
+ *
+ * Processing Steps:
+ *
+ * File Format Requirements:
+ *
+ * The TrancoEmailListProvider builds upon existing target list providers (typically Tranco
+ * rankings) to discover and extract mail server hostnames through DNS MX record resolution. This
+ * enables TLS scanning of email infrastructure associated with popular websites.
+ *
+ * Key capabilities:
+ *
+ * Processing Pipeline:
+ *
+ * DNS Resolution: Uses Java's InitialDirContext to perform DNS queries for MX
+ * records. Failed lookups are logged but don't prevent processing of other domains.
+ *
+ * Error Handling:
+ *
+ * Use Cases:
+ *
+ * Usage Example:
+ *
+ * The constructor configures the provider to use any ITargetListProvider implementation as
+ * the source for domain names, which will be queried for MX records to discover associated mail
+ * servers.
+ *
+ * @param trancoList the target list provider to obtain domains from for MX record lookup
+ */
public TrancoEmailListProvider(ITargetListProvider trancoList) {
this.trancoList = trancoList;
}
diff --git a/src/main/java/de/rub/nds/crawler/targetlist/TrancoListProvider.java b/src/main/java/de/rub/nds/crawler/targetlist/TrancoListProvider.java
index 47d8784..483e175 100644
--- a/src/main/java/de/rub/nds/crawler/targetlist/TrancoListProvider.java
+++ b/src/main/java/de/rub/nds/crawler/targetlist/TrancoListProvider.java
@@ -13,8 +13,54 @@
import java.util.stream.Stream;
/**
- * Target list provider that downloads the most recent tranco list (...) and extracts the top x hosts from it.
+ * Tranco ranking target list provider for research-grade TLS scanning operations.
+ *
+ * The TrancoListProvider downloads and processes the most recent Tranco ranking data to extract
+ * popular website targets for TLS security scanning. Tranco provides a research-oriented
+ * alternative to commercial rankings, designed specifically for security and privacy studies.
+ *
+ * Key advantages:
+ *
+ * Data Source: Downloads the top 1 million domain ranking from tranco-list.eu,
+ * which aggregates data from multiple sources including Alexa, Umbrella, Majestic, and Quantcast to
+ * provide robust and manipulation-resistant rankings.
+ *
+ * Processing Characteristics:
+ *
+ * Usage Scenarios:
+ *
+ * Usage Example:
+ *
+ * The constructor configures the provider to download the current Tranco top 1 million
+ * ranking and extract the specified number of highest-ranked domains for scanning.
+ *
+ * @param number the maximum number of domains to extract from the ranking (1 to 1,000,000)
+ */
public TrancoListProvider(int number) {
super(number, SOURCE, ZIP_FILENAME, FILENAME, "Tranco");
}
diff --git a/src/main/java/de/rub/nds/crawler/targetlist/ZipFileProvider.java b/src/main/java/de/rub/nds/crawler/targetlist/ZipFileProvider.java
index ee1419d..053df02 100644
--- a/src/main/java/de/rub/nds/crawler/targetlist/ZipFileProvider.java
+++ b/src/main/java/de/rub/nds/crawler/targetlist/ZipFileProvider.java
@@ -23,15 +23,94 @@
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
+/**
+ * Abstract base class for target list providers that download and extract targets from compressed
+ * archives.
+ *
+ * The ZipFileProvider provides a foundation for implementing target list providers that obtain
+ * scan targets from remote compressed files (ZIP, GZIP). It handles the complete workflow of
+ * downloading, extracting, parsing, and cleaning up temporary files, allowing subclasses to focus
+ * on the specific target extraction logic.
+ *
+ * Key capabilities:
+ *
+ * Processing Workflow:
+ *
+ * Supported Formats:
+ *
+ * Error Handling:
+ *
+ * Performance Considerations:
+ *
+ * Implementation Requirements: Subclasses must implement
+ * getTargetListFromLines() to define how targets are extracted from the decompressed file content.
+ *
+ * Common Subclasses:
+ *
+ * This method implements the complete workflow for obtaining targets from a remote
+ * compressed file. It downloads the file, extracts the content, processes it through the
+ * subclass implementation, and cleans up temporary files.
+ *
+ * Processing Steps:
+ *
+ * Error Recovery: Download and extraction errors are logged but don't
+ * prevent processing from continuing. Cleanup errors are logged but don't affect the returned
+ * target list.
+ *
+ * @return a list of target strings extracted from the compressed file
+ * @throws RuntimeException if the extracted file cannot be read
+ */
public List This method automatically detects the compression format based on the filename and returns
+ * the appropriate decompression stream. It supports GZIP and ZIP formats.
+ *
+ * @param filename the name of the compressed file to open
+ * @return an InflaterInputStream for reading decompressed content
+ * @throws IOException if the file cannot be opened
+ */
private InflaterInputStream getZipInputStream(String filename) throws IOException {
if (filename.contains(".gz")) {
return new GZIPInputStream(new FileInputStream(filename));
@@ -99,5 +211,24 @@ private InflaterInputStream getZipInputStream(String filename) throws IOExceptio
}
}
+ /**
+ * Extracts scan targets from the decompressed file content.
+ *
+ * This abstract method must be implemented by subclasses to define how targets are extracted
+ * from the decompressed file lines. Different target list formats require different parsing
+ * logic.
+ *
+ * Implementation Guidelines:
+ *
+ * The CanceallableThreadPoolExecutor extends ThreadPoolExecutor to use CancellableFuture
+ * instances instead of standard FutureTask objects. This enables tasks to preserve their results
+ * even after being cancelled, which is valuable for timeout scenarios and graceful degradation in
+ * distributed scanning operations.
+ *
+ * Key features:
+ *
+ * Use Cases:
+ *
+ * Behavior: All submitted tasks are wrapped in CancellableFuture instances,
+ * which provide the enhanced cancellation behavior. The executor maintains standard
+ * ThreadPoolExecutor semantics for all other operations.
+ *
+ * @see CancellableFuture
+ * @see ThreadPoolExecutor
+ */
public class CanceallableThreadPoolExecutor extends ThreadPoolExecutor {
+ /**
+ * Creates a new cancellable thread pool executor with basic configuration.
+ *
+ * @param corePoolSize the number of threads to keep in the pool
+ * @param maximumPoolSize the maximum number of threads to allow in the pool
+ * @param keepAliveTime when the number of threads is greater than the core, this is the maximum
+ * time that excess idle threads will wait for new tasks before terminating
+ * @param unit the time unit for the keepAliveTime argument
+ * @param workQueue the queue to use for holding tasks before they are executed
+ */
public CanceallableThreadPoolExecutor(
int corePoolSize,
int maximumPoolSize,
@@ -20,6 +63,17 @@ public CanceallableThreadPoolExecutor(
super(corePoolSize, maximumPoolSize, keepAliveTime, unit, workQueue);
}
+ /**
+ * Creates a new cancellable thread pool executor with custom thread factory.
+ *
+ * @param corePoolSize the number of threads to keep in the pool
+ * @param maximumPoolSize the maximum number of threads to allow in the pool
+ * @param keepAliveTime when the number of threads is greater than the core, this is the maximum
+ * time that excess idle threads will wait for new tasks before terminating
+ * @param unit the time unit for the keepAliveTime argument
+ * @param workQueue the queue to use for holding tasks before they are executed
+ * @param threadFactory the factory to use when the executor creates a new thread
+ */
public CanceallableThreadPoolExecutor(
int corePoolSize,
int maximumPoolSize,
@@ -30,6 +84,18 @@ public CanceallableThreadPoolExecutor(
super(corePoolSize, maximumPoolSize, keepAliveTime, unit, workQueue, threadFactory);
}
+ /**
+ * Creates a new cancellable thread pool executor with custom rejection handler.
+ *
+ * @param corePoolSize the number of threads to keep in the pool
+ * @param maximumPoolSize the maximum number of threads to allow in the pool
+ * @param keepAliveTime when the number of threads is greater than the core, this is the maximum
+ * time that excess idle threads will wait for new tasks before terminating
+ * @param unit the time unit for the keepAliveTime argument
+ * @param workQueue the queue to use for holding tasks before they are executed
+ * @param handler the handler to use when execution is blocked because the thread bounds and
+ * queue capacities are reached
+ */
public CanceallableThreadPoolExecutor(
int corePoolSize,
int maximumPoolSize,
@@ -40,6 +106,19 @@ public CanceallableThreadPoolExecutor(
super(corePoolSize, maximumPoolSize, keepAliveTime, unit, workQueue, handler);
}
+ /**
+ * Creates a new cancellable thread pool executor with full configuration options.
+ *
+ * @param corePoolSize the number of threads to keep in the pool
+ * @param maximumPoolSize the maximum number of threads to allow in the pool
+ * @param keepAliveTime when the number of threads is greater than the core, this is the maximum
+ * time that excess idle threads will wait for new tasks before terminating
+ * @param unit the time unit for the keepAliveTime argument
+ * @param workQueue the queue to use for holding tasks before they are executed
+ * @param threadFactory the factory to use when the executor creates a new thread
+ * @param handler the handler to use when execution is blocked because the thread bounds and
+ * queue capacities are reached
+ */
public CanceallableThreadPoolExecutor(
int corePoolSize,
int maximumPoolSize,
diff --git a/src/main/java/de/rub/nds/crawler/util/CancellableFuture.java b/src/main/java/de/rub/nds/crawler/util/CancellableFuture.java
index d7706b1..25f9317 100644
--- a/src/main/java/de/rub/nds/crawler/util/CancellableFuture.java
+++ b/src/main/java/de/rub/nds/crawler/util/CancellableFuture.java
@@ -12,12 +12,61 @@
import java.util.concurrent.*;
import java.util.concurrent.atomic.AtomicReference;
+/**
+ * Enhanced Future implementation that preserves results even after cancellation.
+ *
+ * The CancellableFuture provides a specialized Future implementation that allows retrieval of
+ * results even after the future has been cancelled. This is particularly useful in scenarios where
+ * partial results are valuable and should not be lost due to timeout or cancellation.
+ *
+ * Key features:
+ *
+ * Cancellation Behavior: Unlike standard FutureTask, this implementation allows
+ * access to the computed result even after the future is cancelled. The result is captured
+ * atomically before the cancellation takes effect.
+ *
+ * Synchronization Mechanism: Uses a Semaphore to coordinate access to results
+ * after cancellation, ensuring thread-safe retrieval without blocking indefinitely.
+ *
+ * Use Cases:
+ *
+ * Thread Safety: All operations are thread-safe through atomic references and
+ * semaphore synchronization. Multiple threads can safely access the future concurrently.
+ *
+ * @param The future wraps the callable in a FutureTask that captures the result atomically and
+ * signals completion via semaphore release, enabling result access even after cancellation.
+ *
+ * @param callable the task to execute that produces a result
+ */
public CancellableFuture(Callable The future wraps the runnable in a FutureTask that executes the task and returns the
+ * provided result value, with atomic result capture for post-cancellation access.
+ *
+ * @param runnable the task to execute
+ * @param res the result value to return upon successful completion
+ */
public CancellableFuture(Runnable runnable, V res) {
innerFuture =
new FutureTask<>(
diff --git a/src/test/java/de/rub/nds/crawler/core/ControllerTest.java b/src/test/java/de/rub/nds/crawler/core/ControllerTest.java
index afddf0f..6f49eda 100644
--- a/src/test/java/de/rub/nds/crawler/core/ControllerTest.java
+++ b/src/test/java/de/rub/nds/crawler/core/ControllerTest.java
@@ -40,7 +40,11 @@ void submitting() throws IOException, InterruptedException {
Thread.sleep(1000);
- Assertions.assertEquals(2, orchestrationProvider.jobQueue.size());
+ // With multi-IP hostname support, we expect at least 2 jobs (one per hostname)
+ // but may get more if hostnames resolve to multiple IPs
+ Assertions.assertTrue(
+ orchestrationProvider.jobQueue.size() >= 2,
+ "Expected at least 2 jobs but got " + orchestrationProvider.jobQueue.size());
Assertions.assertEquals(0, orchestrationProvider.unackedJobs.size());
}
}
diff --git a/src/test/java/de/rub/nds/crawler/data/ScanTargetTest.java b/src/test/java/de/rub/nds/crawler/data/ScanTargetTest.java
new file mode 100644
index 0000000..3c8d3f2
--- /dev/null
+++ b/src/test/java/de/rub/nds/crawler/data/ScanTargetTest.java
@@ -0,0 +1,275 @@
+/*
+ * TLS-Crawler - A TLS scanning tool to perform large scale scans with the TLS-Scanner
+ *
+ * Copyright 2018-2023 Ruhr University Bochum, Paderborn University, and Hackmanit GmbH
+ *
+ * Licensed under Apache License, Version 2.0
+ * http://www.apache.org/licenses/LICENSE-2.0.txt
+ */
+package de.rub.nds.crawler.data;
+
+import static org.junit.jupiter.api.Assertions.*;
+
+import de.rub.nds.crawler.constant.JobStatus;
+import java.util.List;
+import org.apache.commons.lang3.tuple.Pair;
+import org.junit.jupiter.api.Test;
+
+/** Tests for ScanTarget parsing functionality, particularly IPv6 address handling. */
+class ScanTargetTest {
+
+ private static final int DEFAULT_PORT = 443;
+
+ @Test
+ void testIPv4AddressWithPort() {
+ List
+ *
+ *
+ * @param scanTarget the target to scan
+ * @return a Future representing the scan operation result
+ */
public Future
+ *
+ *
+ *
+ *
+ *
+ *
+ * // Static convenience method
+ * Future<Document> result = BulkScanWorkerManager.handleStatic(
+ * scanJobDescription, 4, 8);
+ *
+ * // Instance usage
+ * BulkScanWorkerManager manager = BulkScanWorkerManager.getInstance();
+ * Future<Document> result = manager.handle(scanJobDescription, 4, 8);
+ *
+ *
+ * @see BulkScanWorker
+ * @see ScanJobDescription
+ * @see ScanConfig
+ */
public class BulkScanWorkerManager {
private static final Logger LOGGER = LogManager.getLogger();
+
+ /** Singleton instance of the worker manager. */
private static BulkScanWorkerManager instance;
+ /**
+ * Gets the singleton instance of the BulkScanWorkerManager.
+ *
+ *
+ *
+ */
private BulkScanWorkerManager() {
bulkScanWorkers =
CacheBuilder.newBuilder()
@@ -58,6 +140,28 @@ private BulkScanWorkerManager() {
.build();
}
+ /**
+ * Gets or creates a bulk scan worker for the specified bulk scan.
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ * @return the appropriate ScheduleBuilder for the configured scheduling strategy
+ * @see ControllerCommandConfig#getScanCronInterval()
+ */
private ScheduleBuilder> getScanSchedule() {
if (config.getScanCronInterval() != null) {
return CronScheduleBuilder.cronSchedule(config.getScanCronInterval())
@@ -91,6 +160,30 @@ private ScheduleBuilder> getScanSchedule() {
}
}
+ /**
+ * Conditionally shuts down the scheduler if all triggers have completed.
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ * @param millis the duration in milliseconds to format
+ * @return formatted time string with appropriate units
+ */
private String formatTime(double millis) {
if (millis < 1000) {
return String.format("%4.0f ms", millis);
@@ -93,6 +208,35 @@ private String formatTime(double millis) {
return String.format("%.1f d", days);
}
+ /**
+ * Processes a scan job completion notification and updates progress metrics.
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ * @param consumerTag the RabbitMQ consumer tag for this notification
+ * @param scanJob the completed scan job description
+ */
@Override
public void consumeDoneNotification(String consumerTag, ScanJobDescription scanJob) {
try {
@@ -141,10 +285,38 @@ public void consumeDoneNotification(String consumerTag, ScanJobDescription scanJ
}
/**
- * Adds a listener for the done notification queue that updates the counters for the bulk scans
- * and checks if a bulk scan is finished.
+ * Initiates progress monitoring for a bulk scan operation.
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ * @param resultFuture the future representing the ongoing scan operation
+ * @param scanJobDescription the job description to update with final status
+ * @return a ScanResult containing the job description and result document
+ * @throws ExecutionException if the scan execution encounters an error
+ * @throws InterruptedException if the current thread is interrupted while waiting
+ * @throws TimeoutException if the scan cannot be cancelled within the grace period
+ */
private ScanResult waitForScanResult(
Future
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ * @param bulkScan the source bulk scan to extract metadata from
+ */
public BulkScanInfo(BulkScan bulkScan) {
this.bulkScanId = bulkScan.get_id();
this.scanConfig = bulkScan.getScanConfig();
this.isMonitored = bulkScan.isMonitored();
}
+ /**
+ * Gets the unique identifier of the bulk scan this metadata represents.
+ *
+ *
+ * TlsServerScanConfig tlsConfig = info.getScanConfig(TlsServerScanConfig.class);
+ *
+ *
+ * @param
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ * {@code
+ * try {
+ * // Perform hostname resolution
+ * InetAddress.getAllByName(hostname);
+ * } catch (UnknownHostException e) {
+ * String context = ErrorContext.dnsResolutionFailure(hostname, "A record lookup failed");
+ * ScanResult errorResult = ScanResult.fromException(jobDescription, e, context);
+ * }
+ * }
+ *
+ * @see ScanResult#fromException(ScanJobDescription, Exception, String)
+ * @see de.rub.nds.crawler.constant.JobStatus
+ */
+public final class ErrorContext {
+
+ private ErrorContext() {
+ // Utility class - prevent instantiation
+ }
+
+ /**
+ * Creates error context for DNS resolution failures.
+ *
+ * @param hostname the hostname that failed to resolve
+ * @param reason the specific DNS failure reason
+ * @return formatted error context string
+ */
+ public static String dnsResolutionFailure(String hostname, String reason) {
+ return String.format("DNS resolution failed for hostname '%s': %s", hostname, reason);
+ }
+
+ /**
+ * Creates error context for denylist rejections.
+ *
+ * @param target the target that was rejected
+ * @param ruleType the type of denylist rule that triggered (IP, domain, etc.)
+ * @return formatted error context string
+ */
+ public static String denylistRejection(String target, String ruleType) {
+ return String.format("Target '%s' rejected by %s denylist rule", target, ruleType);
+ }
+
+ /**
+ * Creates error context for target string parsing failures.
+ *
+ * @param targetString the unparseable target string
+ * @param parseStage the parsing stage where failure occurred
+ * @return formatted error context string
+ */
+ public static String targetParsingFailure(String targetString, String parseStage) {
+ return String.format(
+ "Failed to parse target string '%s' during %s", targetString, parseStage);
+ }
+
+ /**
+ * Creates error context for port parsing failures.
+ *
+ * @param portString the invalid port string
+ * @param targetString the full target string for context
+ * @return formatted error context string
+ */
+ public static String portParsingFailure(String portString, String targetString) {
+ return String.format("Invalid port '%s' in target string '%s'", portString, targetString);
+ }
+
+ /**
+ * Creates error context for general target processing failures.
+ *
+ * @param targetString the target string being processed
+ * @param operation the operation that failed
+ * @return formatted error context string
+ */
+ public static String targetProcessingFailure(String targetString, String operation) {
+ return String.format(
+ "Target processing failed for '%s' during %s", targetString, operation);
+ }
+}
diff --git a/src/main/java/de/rub/nds/crawler/data/ScanConfig.java b/src/main/java/de/rub/nds/crawler/data/ScanConfig.java
index 8f91fc2..6b95cb8 100644
--- a/src/main/java/de/rub/nds/crawler/data/ScanConfig.java
+++ b/src/main/java/de/rub/nds/crawler/data/ScanConfig.java
@@ -12,47 +12,175 @@
import de.rub.nds.scanner.core.config.ScannerDetail;
import java.io.Serializable;
+/**
+ * Abstract base configuration class for TLS scanner implementations in distributed scanning.
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ * @param bulkScanID the ID of the bulk scan this worker belongs to
+ * @param parallelConnectionThreads the number of threads for parallel connections
+ * @param parallelScanThreads the number of parallel scanner instances
+ * @return a new BulkScanWorker instance configured for this scanner type
+ */
public abstract BulkScanWorker extends ScanConfig> createWorker(
String bulkScanID, int parallelConnectionThreads, int parallelScanThreads);
}
diff --git a/src/main/java/de/rub/nds/crawler/data/ScanJobDescription.java b/src/main/java/de/rub/nds/crawler/data/ScanJobDescription.java
index 841b410..9d1cbf3 100644
--- a/src/main/java/de/rub/nds/crawler/data/ScanJobDescription.java
+++ b/src/main/java/de/rub/nds/crawler/data/ScanJobDescription.java
@@ -13,23 +13,99 @@
import java.io.Serializable;
import java.util.Optional;
+/**
+ * Data transfer object representing a single TLS scan job in the distributed scanning architecture.
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ * # Private networks
+ * 192.168.0.0/16
+ * 10.0.0.0/8
+ * 172.16.0.0/12
+ *
+ * # Specific domains
+ * government.gov
+ * sensitive.internal
+ *
+ * # Individual IPs
+ * 203.0.113.1
+ * 2001:db8::1
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ * {@code
+ * // Lambda implementation
+ * DoneNotificationConsumer consumer = (tag, job) -> {
+ * updateProgress(job.getStatus());
+ * logCompletion(job);
+ * };
+ *
+ * // Method reference
+ * DoneNotificationConsumer consumer = this::handleCompletion;
+ *
+ * // Registration with orchestration provider
+ * orchestrationProvider.registerDoneNotificationConsumer(bulkScan, consumer);
+ * }
+ *
+ * @see ScanJobDescription
+ * @see IOrchestrationProvider#registerDoneNotificationConsumer(de.rub.nds.crawler.data.BulkScan,
+ * DoneNotificationConsumer) Typically implemented by
+ * ProgressMonitor.BulkscanMonitor.consumeDoneNotification method.
+ */
@FunctionalInterface
public interface DoneNotificationConsumer {
+ /**
+ * Processes a scan job completion notification from the orchestration provider.
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ * @see ScanJobDescription
+ * @see ScanJobConsumer
+ * @see DoneNotificationConsumer
+ * @see BulkScan
+ * @see RabbitMqOrchestrationProvider
*/
public interface IOrchestrationProvider {
/**
- * Submit a scan job to the orchestration provider.
+ * Submits a scan job for distribution to available worker instances.
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ * @param deliveryTag the unique delivery tag of the message to acknowledge
+ * @see #notifyOfDoneScanJob(ScanJobDescription)
+ */
private void sendAck(long deliveryTag) {
try {
channel.basicAck(deliveryTag, false);
@@ -151,6 +299,41 @@ private void sendAck(long deliveryTag) {
}
}
+ /**
+ * Registers a consumer to receive completion notifications for a specific bulk scan.
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ * {@code
+ * // Lambda implementation
+ * ScanJobConsumer consumer = jobDescription -> {
+ * // Process the scan job
+ * processJob(jobDescription);
+ * };
+ *
+ * // Method reference
+ * ScanJobConsumer consumer = this::handleScanJob;
+ *
+ * // Registration with orchestration provider
+ * orchestrationProvider.registerScanJobConsumer(consumer, prefetchCount);
+ * }
+ *
+ * @see ScanJobDescription
+ * @see IOrchestrationProvider#registerScanJobConsumer(ScanJobConsumer, int) Typically implemented
+ * by Worker.handleScanJob(ScanJobDescription) method.
+ */
@FunctionalInterface
public interface ScanJobConsumer {
+ /**
+ * Processes a scan job received from the orchestration provider.
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ * @see BulkScan
+ * @see ScanResult
+ * @see ScanJobDescription
+ * @see MongoPersistenceProvider
*/
public interface IPersistenceProvider {
/**
- * Insert a scan result into the database.
+ * Persists a scan result and its associated job metadata to the database.
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ * @param serializer the custom JsonSerializer to register for MongoDB serialization
+ * @throws RuntimeException if called after MongoPersistenceProvider initialization
+ * @see #registerSerializer(JsonSerializer...)
+ * @see #registerModule(Module)
+ */
public static void registerSerializer(JsonSerializer> serializer) {
if (isInitialized) {
throw new RuntimeException("Cannot register serializer after initialization");
@@ -61,12 +87,47 @@ public static void registerSerializer(JsonSerializer> serializer) {
serializers.add(serializer);
}
+ /**
+ * Registers multiple custom JSON serializers for use in MongoDB document serialization.
+ *
+ *
+ *
+ *
+ * @param module the Jackson Module to register for enhanced serialization support
+ * @throws RuntimeException if called after MongoPersistenceProvider initialization
+ * @see #registerModule(Module...)
+ * @see #registerSerializer(JsonSerializer)
+ */
public static void registerModule(Module module) {
if (isInitialized) {
throw new RuntimeException("Cannot register module after initialization");
@@ -74,6 +135,20 @@ public static void registerModule(Module module) {
modules.add(module);
}
+ /**
+ * Registers multiple Jackson modules for extended JSON serialization functionality.
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ * @param mongoDbDelegate the MongoDB configuration containing connection parameters
+ * @return configured MongoClient ready for database operations
+ * @see MongoDbDelegate
+ * @see MongoClientSettings
+ */
private static MongoClient createMongoClient(MongoDbDelegate mongoDbDelegate) {
ConnectionString connectionString =
new ConnectionString(
@@ -120,6 +225,36 @@ private static MongoClient createMongoClient(MongoDbDelegate mongoDbDelegate) {
return MongoClients.create(mongoClientSettings);
}
+ /**
+ * Creates and configures a Jackson ObjectMapper for MongoDB document serialization.
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ * @return configured ObjectMapper ready for MongoDB document serialization
+ * @see #registerSerializer(JsonSerializer)
+ * @see #registerModule(Module)
+ * @see JavaTimeModule
+ */
private static ObjectMapper createMapper() {
ObjectMapper mapper = new ObjectMapper();
@@ -143,9 +278,38 @@ private static ObjectMapper createMapper() {
}
/**
- * Initialize connection to mongodb and setup MongoJack PojoToBson mapper.
+ * Initializes connection to MongoDB and sets up MongoJack PojoToBson mapper.
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ * {@code
+ * CruxListProvider provider = new CruxListProvider(CruxListNumber.TOP_10K);
+ * List
+ *
+ * @see ZipFileProvider
+ * @see CruxListNumber
+ * @see ITargetListProvider
*/
public class CruxListProvider extends ZipFileProvider {
@@ -24,6 +67,15 @@ public class CruxListProvider extends ZipFileProvider {
private static final String ZIP_FILENAME = "current.csv.gz";
private static final String FILENAME = "current.csv";
+ /**
+ * Creates a new CrUX list provider for the specified target list size.
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ * # TLS Crawler Target List
+ * # Production servers
+ * example.com:443
+ * api.example.com
+ * secure.example.org:8443
+ *
+ * # Test servers
+ * test.example.com:443
+ *
+ *
+ *
+ *
+ *
+ * {@code
+ * TargetFileProvider provider = new TargetFileProvider("/path/to/targets.txt");
+ * List
+ *
+ * @see ITargetListProvider Configured via ControllerCommandConfig.getTargetListProvider() method.
+ */
public class TargetFileProvider implements ITargetListProvider {
private static final Logger LOGGER = LogManager.getLogger();
private String filename;
+ /**
+ * Creates a new target file provider for the specified file path.
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ * @return a list of target strings read from the file
+ * @throws RuntimeException if the file cannot be read (file not found, I/O error, etc.)
+ */
@Override
public List
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ * {@code
+ * TrancoListProvider domains = new TrancoListProvider(10000);
+ * TrancoEmailListProvider emailProvider = new TrancoEmailListProvider(domains);
+ * List
+ *
+ * @see ITargetListProvider
+ * @see TrancoListProvider
+ * @see CruxListProvider
*/
public class TrancoEmailListProvider implements ITargetListProvider {
@@ -29,6 +84,15 @@ public class TrancoEmailListProvider implements ITargetListProvider {
private final ITargetListProvider trancoList;
+ /**
+ * Creates a new email list provider using the specified domain list provider.
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ * {@code
+ * TrancoListProvider provider = new TrancoListProvider(10000);
+ * List
+ *
+ * @see ZipFileProvider
+ * @see ITargetListProvider
+ * @see Tranco Ranking Project
*/
public class TrancoListProvider extends ZipFileProvider {
@@ -22,6 +68,14 @@ public class TrancoListProvider extends ZipFileProvider {
private static final String ZIP_FILENAME = "tranco-1m.csv.zip";
private static final String FILENAME = "tranco-1m.csv";
+ /**
+ * Creates a new Tranco list provider for the specified number of top-ranked domains.
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ * @see ITargetListProvider
+ * @see TrancoListProvider
+ * @see CruxListProvider
+ */
public abstract class ZipFileProvider implements ITargetListProvider {
+ /** Logger instance for tracking download and extraction operations. */
protected static final Logger LOGGER = LogManager.getLogger();
+
+ /** Maximum number of targets to extract from the target list. */
protected final int number;
+
private final String sourceUrl;
private final String zipFilename;
private final String outputFile;
private final String listName;
+ /**
+ * Creates a new ZIP file provider with the specified configuration.
+ *
+ * @param number the maximum number of targets to extract from the list
+ * @param sourceUrl the URL to download the compressed file from
+ * @param zipFilename the local filename for the downloaded compressed file
+ * @param outputFile the local filename for the extracted content
+ * @param listName the human-readable name of the list for logging
+ */
protected ZipFileProvider(
int number, String sourceUrl, String zipFilename, String outputFile, String listName) {
this.number = number;
@@ -41,6 +120,29 @@ protected ZipFileProvider(
this.listName = listName;
}
+ /**
+ * Downloads, extracts, and processes the compressed target list file.
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ * @param lines a stream of lines from the extracted file
+ * @return a list of target strings formatted for scanning
+ */
protected abstract List
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *