Vision

No direct AWS dependency

Right now Atum server contains direct dependency on AWS SDK and services - for retrieving database credentials. This makes it difficult to use Atum in environments where AWS services.

The best solution would be to have an abstract interface for retrieving the credentials, perhaps defaulting to read from a config file. And then have an implementation in a form of a plugin that uses AWS SDK to retrieve them from AWS Secrets Manager. This way, users can provide their own implementation if they don't want to use AWS.

Securing REST endpoints

The Atum server currently does not have any authentication nor authorization mechanisms in place for its REST endpoints. While a detailed authorization mechanism is on purpose left out - that's the responsibility of the application incorporating the Atum libraries - basic authentication should be implemented to prevent unauthorized access to the Atum server.

Some time ago, we figured out that authentication via tokens would be the best approach. The Atum server would accept a predefined token in the HTTP headers of each request and validate it before processing the request. The token could be configured via environment variable, configuration file, AWS Secrets Manager, or any other secure method. The suggested method of implementation should be similar to the one proposed in the previous section for retrieving database credentials - default to read from configuration file, with possibility to provide a custom implementation via plugin. The key point is, that this should be easy to set up and use. Set up the method of retrieving the token when integrating the library into an application, and then provide the token in the HTTP headers of each request to the Atum server automatically. The token should NOT be needed to provide when calling the Atum classes facing the application code.

N.B. Perhaps there could two tokens - one for read-only access (for Reader moduled), and another one for write access (the Agent module).

Streaming/incremental processing support

The original design was created with streaming/incremental processing in mind, but the current implementation does not fully support it yet.

The idea is to add method startCheckpoint to AtumContext, followed by addPartialCheckpoint calls, and finally closeCheckpoint. This would allow to create a checkpoint in multiple steps, which is essential for streaming/incremental processing. The server would store these partial data in a separate table. And upon closing of the checkpoint retrieving them, aggregating them, and storing the final checkpoint data similarly to the one-step checkpoint creation.

Steps to implement the capability (not exhaustive):

new methods in AtumContext for starting, adding partial data, and closing checkpoint
new REST endpoints in the Atum server for starting and closing a checkpoint and handling the partial checkpoint data
new table in the database for storing partial checkpoint data
marking those measures that can be calculated incrementally, not every function is additive; only those would be calculated in the partial checkpoints
adding the logic for aggregating the partial checkpoint data upon closing the checkpoint to the Atum server
perhaps extending the Reader module to be able to read partial checkpoint data too
include support for optional identification of the micro-batches within the partial checkpoints (to avoid double counting in case of re-processing of a micro-batch)
an advanced feature would be to be able to add a filter when counting the partial checkpoints for cases when the micro-batches contain data of multiple data batches (e.g. using watermarking techniques).

Reader to be able to compare checkpoints

This is just a very fresh idea based on the often voice user requirements. But at the same time it might turn our to be relatively easy to implement.

The checkpoint data are there and easily retrievable. Comparing two checkpoints for similarity should be easy, even a sequence of them. And as type of the value is known, the comparison logic can be type-specific even with more information than just equality/distinct.

Measuring optimization

The current implementation of the measures requires a pass through the data for each measurement. But their nature is such that they can be all calculated in a single pass through, two at maximum. This would significantly improve performance when measuring multiple measures on the same data.

Memoization of the requests

Most REST requests to the Atum server require partitioning or feed id. But this is always re-requested despite being constant. Memoization of these requests would improve performance and reduce load on the server.

UI

There's no intention to have a full-fledged UI for Atum. The reason is the above-mentioned complexity of authorization and the fact that Atum is supposed to be integrated into other applications that would provide their own UI.

However, a simple Angular based UI component that would display the checkpoint data for provided partitioning or feed would greatly help the integration. The form and content of such component is to be discussed.

Patterns

The idea of patterns is to be able to automate measures assignment and (expected) additional data to newly created partitioning based on matching criteria.

For example a pattern of the following designation

source_system = EB
dataset = credit_cards
report_date = NULL

would have the following defined

measures: row_count, sum(amount)
additional data: owner = finance_team, sensitivity = high, processing_time = NULL

Then a partitioning created with

source_system = EB
dataset = credit_cards
report_date = 2024-06-01

would have the above measures and additional data automatically assigned to it. (Where processing_time would be an expected additional data entry.) .

Flows would have the same capability.

_INFO file support

The original implementation of Atum included support for reading control measurements and additional data from _INFO file placed next to the data file. And they were also written when writing new data. Atum service might want to revive this ability.

The creation of _INFO file is rather straightforward, Reader module provides methods to gather all the required data. It's just needed to combine them into a JSON structure and output it, eventually write into a file.

The opposite operation is to take in an _INFO file, parse its content, and feed it into Atum Service. For proper operation, the correct partitioning must be identified first, then the data from the _INFO file can be uploaded in a serries of createCheckpointOnProvidedData and addAdditionalData calls.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!