Fixing task ID replacement for MNP jobs on AWS Batch #2574

vymao · 2025-08-26T20:54:29Z

If we aren't using the Metaflow metadata service provider, Metaflow defaults to generating task IDs locally. But these task IDs are just simple integers based on how many tasks/steps there are and are sequentially incremented based on new_task_id in metaflow/plugins/metadata_providers/local.py. This presents a problem when we're doing AWS Batch MNP, since currently we try and mass replace based on the task ID in the secondary command. If this is a simple integer, this will replace many erroneous places.

For example, if the task ID is "3", there could be many instances of "3" in the secondary command that then have many replacements with "-node-$AWS_BATCH_JOB_NODE_INDEX" when really we just want to replace the actual task ID.

Here, I've identified two places - the input task ID via --task-id and the task ID in MF_PATHSPEC, that should be the only two places in the command that have the actual task ID in them that need replacing. It is better to have more specific regexes this way.

Furthermore, if there is no metadata provider, I've added a new check for control MNP jobs to finish by checking the S3 datastore instead.

metaflow/plugins/aws/batch/batch_client.py

…place with node-index

…HSPEC

…a-aws-batch-mnp

saikonen

functionally this PR is now working as expected. Had some suggestions for cleanup

saikonen · 2025-10-27T21:33:41Z

metaflow/plugins/aws/batch/batch_client.py

+            # Set the ulimit of number of open files to 65536. This is because we cannot set it easily once worker processes start on Batch.
+            # job_definition["containerProperties"]["linuxParameters"]["ulimits"] = [
+            #     {
+            #         "name": "nofile",
+            #         "softLimit": 65536,
+            #         "hardLimit": 65536,
+            #     }
+            # ]


can this be cleaned up?

Yep! Removed.

saikonen · 2025-10-27T21:34:38Z

metaflow/plugins/aws/batch/batch_client.py

+        # Prefer the task role by default when running inside AWS Batch containers
+        # by temporarily removing higher-precedence env credentials for this process.
+        # This avoids AMI-injected AWS_* env vars from overriding the task role.
+        # Outside of Batch, we leave env vars untouched unless explicitly opted-in.
+        if "AWS_BATCH_JOB_ID" in os.environ:
+            _aws_env_keys = [
+                "AWS_ACCESS_KEY_ID",
+                "AWS_SECRET_ACCESS_KEY",
+                "AWS_SESSION_TOKEN",
+                "AWS_PROFILE",
+                "AWS_DEFAULT_PROFILE",
+            ]
+            _present = [k for k in _aws_env_keys if k in os.environ]
+            print(
+                "[Metaflow] AWS credential-related env vars present before Batch client init:",
+                _present,
+            )
+            _saved_env = {
+                k: os.environ.pop(k) for k in _aws_env_keys if k in os.environ
+            }
+            try:
+                self._client = get_aws_client("batch")
+            finally:
+                # Restore prior env for the rest of the process
+                for k, v in _saved_env.items():
+                    os.environ[k] = v
+        else:
+            self._client = get_aws_client("batch")


is this change relevant to the batch parallel issue, or something different? the PR seems to work fine without this part as well

Indeed it works, this was to cover the instances where particular AWS keys have already been set in the environment, which messed up getting the AWS client. This is relevant for the batch process given that we're using the batch client now.

what confuses me with these is that the same environment variables should then also interfere with the task running inside a batch process from getting a working S3 client as well, which would break datastore access.

do you have an example AMI which exhibits this issue, or an example on how to reproduce the issue? I'm not running into any issues with my test setup even without these additions.

If the environment modification part is not critical to this fix working for your use case, could you introduce that part as a separate PR?

metaflow/plugins/aws/batch/batch_decorator.py

saikonen · 2025-11-05T10:23:20Z

metaflow/plugins/aws/batch/batch_decorator.py

+                    if tds.has_metadata(TaskDataStore.METADATA_DONE_SUFFIX):
+                        completed += 1
+                except Exception as e:
+                    self.logger.warning("Datastore wait: error checking %s: %s", ps, e)


self.logger doesn't actually have any methods, it is just click.secho being passed in. This also adds unnecessary (duplicate) timestamps to the log lines so sticking to print for now is fine.

also note all other instances of self.logger

saikonen · 2025-11-05T10:25:44Z

metaflow/plugins/aws/batch/batch_client.py

+        # Prefer the task role by default when running inside AWS Batch containers
+        # by temporarily removing higher-precedence env credentials for this process.
+        # This avoids AMI-injected AWS_* env vars from overriding the task role.
+        # Outside of Batch, we leave env vars untouched unless explicitly opted-in.
+        if "AWS_BATCH_JOB_ID" in os.environ:
+            _aws_env_keys = [
+                "AWS_ACCESS_KEY_ID",
+                "AWS_SECRET_ACCESS_KEY",
+                "AWS_SESSION_TOKEN",
+                "AWS_PROFILE",
+                "AWS_DEFAULT_PROFILE",
+            ]
+            _present = [k for k in _aws_env_keys if k in os.environ]
+            print(
+                "[Metaflow] AWS credential-related env vars present before Batch client init:",
+                _present,
+            )
+            _saved_env = {
+                k: os.environ.pop(k) for k in _aws_env_keys if k in os.environ
+            }
+            try:
+                self._client = get_aws_client("batch")
+            finally:
+                # Restore prior env for the rest of the process
+                for k, v in _saved_env.items():
+                    os.environ[k] = v
+        else:
+            self._client = get_aws_client("batch")


what confuses me with these is that the same environment variables should then also interfere with the task running inside a batch process from getting a working S3 client as well, which would break datastore access.

do you have an example AMI which exhibits this issue, or an example on how to reproduce the issue? I'm not running into any issues with my test setup even without these additions.

saikonen · 2025-11-05T10:29:14Z

metaflow/plugins/aws/batch/batch_client.py

+        # Prefer the task role by default when running inside AWS Batch containers
+        # by temporarily removing higher-precedence env credentials for this process.
+        # This avoids AMI-injected AWS_* env vars from overriding the task role.
+        # Outside of Batch, we leave env vars untouched unless explicitly opted-in.
+        if "AWS_BATCH_JOB_ID" in os.environ:
+            _aws_env_keys = [
+                "AWS_ACCESS_KEY_ID",
+                "AWS_SECRET_ACCESS_KEY",
+                "AWS_SESSION_TOKEN",
+                "AWS_PROFILE",
+                "AWS_DEFAULT_PROFILE",
+            ]
+            _present = [k for k in _aws_env_keys if k in os.environ]
+            print(
+                "[Metaflow] AWS credential-related env vars present before Batch client init:",
+                _present,
+            )
+            _saved_env = {
+                k: os.environ.pop(k) for k in _aws_env_keys if k in os.environ
+            }
+            try:
+                self._client = get_aws_client("batch")
+            finally:
+                # Restore prior env for the rest of the process
+                for k, v in _saved_env.items():
+                    os.environ[k] = v
+        else:
+            self._client = get_aws_client("batch")


If the environment modification part is not critical to this fix working for your use case, could you introduce that part as a separate PR?

Fixing task ID replacement for MNP jobs on AWS Batch

2a5c211

vymao mentioned this pull request Aug 26, 2025

Is it possible to use @metaflow_ray with foreach on AWS Batch? #2564

Open

savingoyal reviewed Aug 26, 2025

View reviewed changes

metaflow/plugins/aws/batch/batch_client.py Outdated Show resolved Hide resolved

savingoyal requested a review from saikonen August 26, 2025 21:43

Victor Mao (main) added 4 commits August 28, 2025 10:48

Modifying so that we can have better/earlier matches for places to re…

ce15127

…place with node-index

Fixing step_kwargs conflict with logs writing

86c3b84

Making it so that the [NODE-INDEX] substitution gets passed to MF_PAT…

f2ee285

…HSPEC

Merge remote-tracking branch 'upstream/master' into fix/local-metadat…

21a62ac

…a-aws-batch-mnp

vymao requested a review from savingoyal August 28, 2025 20:23

Merge branch 'master' into fix/local-metadata-aws-batch-mnp

26fa49c

saikonen linked an issue Sep 5, 2025 that may be closed by this pull request

Is it possible to use @metaflow_ray with foreach on AWS Batch? #2564

Open

Victor Mao (main) and others added 2 commits September 22, 2025 23:46

Updating flow for MNP

cc5b44e

Merge branch 'master' into fix/local-metadata-aws-batch-mnp

96259ac

saikonen reviewed Oct 27, 2025

View reviewed changes

Victor Mao (main) and others added 3 commits October 30, 2025 16:45

Resolving comments

9fa4391

Merge branch 'master' into fix/local-metadata-aws-batch-mnp

526c81c

Cleaning up code

a0f68ca

vymao requested a review from saikonen October 30, 2025 20:47

saikonen reviewed Nov 5, 2025

View reviewed changes

Fixing task ID replacement for MNP jobs on AWS Batch #2574

Are you sure you want to change the base?

Fixing task ID replacement for MNP jobs on AWS Batch #2574

Uh oh!

Conversation

vymao commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

saikonen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vymao commented Aug 26, 2025 •

edited

Loading