-
Notifications
You must be signed in to change notification settings - Fork 930
Fixing task ID replacement for MNP jobs on AWS Batch #2574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…place with node-index
saikonen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
functionally this PR is now working as expected. Had some suggestions for cleanup
| # Set the ulimit of number of open files to 65536. This is because we cannot set it easily once worker processes start on Batch. | ||
| # job_definition["containerProperties"]["linuxParameters"]["ulimits"] = [ | ||
| # { | ||
| # "name": "nofile", | ||
| # "softLimit": 65536, | ||
| # "hardLimit": 65536, | ||
| # } | ||
| # ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be cleaned up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! Removed.
| # Prefer the task role by default when running inside AWS Batch containers | ||
| # by temporarily removing higher-precedence env credentials for this process. | ||
| # This avoids AMI-injected AWS_* env vars from overriding the task role. | ||
| # Outside of Batch, we leave env vars untouched unless explicitly opted-in. | ||
| if "AWS_BATCH_JOB_ID" in os.environ: | ||
| _aws_env_keys = [ | ||
| "AWS_ACCESS_KEY_ID", | ||
| "AWS_SECRET_ACCESS_KEY", | ||
| "AWS_SESSION_TOKEN", | ||
| "AWS_PROFILE", | ||
| "AWS_DEFAULT_PROFILE", | ||
| ] | ||
| _present = [k for k in _aws_env_keys if k in os.environ] | ||
| print( | ||
| "[Metaflow] AWS credential-related env vars present before Batch client init:", | ||
| _present, | ||
| ) | ||
| _saved_env = { | ||
| k: os.environ.pop(k) for k in _aws_env_keys if k in os.environ | ||
| } | ||
| try: | ||
| self._client = get_aws_client("batch") | ||
| finally: | ||
| # Restore prior env for the rest of the process | ||
| for k, v in _saved_env.items(): | ||
| os.environ[k] = v | ||
| else: | ||
| self._client = get_aws_client("batch") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this change relevant to the batch parallel issue, or something different? the PR seems to work fine without this part as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed it works, this was to cover the instances where particular AWS keys have already been set in the environment, which messed up getting the AWS client. This is relevant for the batch process given that we're using the batch client now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what confuses me with these is that the same environment variables should then also interfere with the task running inside a batch process from getting a working S3 client as well, which would break datastore access.
do you have an example AMI which exhibits this issue, or an example on how to reproduce the issue? I'm not running into any issues with my test setup even without these additions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the environment modification part is not critical to this fix working for your use case, could you introduce that part as a separate PR?
| if tds.has_metadata(TaskDataStore.METADATA_DONE_SUFFIX): | ||
| completed += 1 | ||
| except Exception as e: | ||
| self.logger.warning("Datastore wait: error checking %s: %s", ps, e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.logger doesn't actually have any methods, it is just click.secho being passed in. This also adds unnecessary (duplicate) timestamps to the log lines so sticking to print for now is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also note all other instances of self.logger
| # Prefer the task role by default when running inside AWS Batch containers | ||
| # by temporarily removing higher-precedence env credentials for this process. | ||
| # This avoids AMI-injected AWS_* env vars from overriding the task role. | ||
| # Outside of Batch, we leave env vars untouched unless explicitly opted-in. | ||
| if "AWS_BATCH_JOB_ID" in os.environ: | ||
| _aws_env_keys = [ | ||
| "AWS_ACCESS_KEY_ID", | ||
| "AWS_SECRET_ACCESS_KEY", | ||
| "AWS_SESSION_TOKEN", | ||
| "AWS_PROFILE", | ||
| "AWS_DEFAULT_PROFILE", | ||
| ] | ||
| _present = [k for k in _aws_env_keys if k in os.environ] | ||
| print( | ||
| "[Metaflow] AWS credential-related env vars present before Batch client init:", | ||
| _present, | ||
| ) | ||
| _saved_env = { | ||
| k: os.environ.pop(k) for k in _aws_env_keys if k in os.environ | ||
| } | ||
| try: | ||
| self._client = get_aws_client("batch") | ||
| finally: | ||
| # Restore prior env for the rest of the process | ||
| for k, v in _saved_env.items(): | ||
| os.environ[k] = v | ||
| else: | ||
| self._client = get_aws_client("batch") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what confuses me with these is that the same environment variables should then also interfere with the task running inside a batch process from getting a working S3 client as well, which would break datastore access.
do you have an example AMI which exhibits this issue, or an example on how to reproduce the issue? I'm not running into any issues with my test setup even without these additions.
| # Prefer the task role by default when running inside AWS Batch containers | ||
| # by temporarily removing higher-precedence env credentials for this process. | ||
| # This avoids AMI-injected AWS_* env vars from overriding the task role. | ||
| # Outside of Batch, we leave env vars untouched unless explicitly opted-in. | ||
| if "AWS_BATCH_JOB_ID" in os.environ: | ||
| _aws_env_keys = [ | ||
| "AWS_ACCESS_KEY_ID", | ||
| "AWS_SECRET_ACCESS_KEY", | ||
| "AWS_SESSION_TOKEN", | ||
| "AWS_PROFILE", | ||
| "AWS_DEFAULT_PROFILE", | ||
| ] | ||
| _present = [k for k in _aws_env_keys if k in os.environ] | ||
| print( | ||
| "[Metaflow] AWS credential-related env vars present before Batch client init:", | ||
| _present, | ||
| ) | ||
| _saved_env = { | ||
| k: os.environ.pop(k) for k in _aws_env_keys if k in os.environ | ||
| } | ||
| try: | ||
| self._client = get_aws_client("batch") | ||
| finally: | ||
| # Restore prior env for the rest of the process | ||
| for k, v in _saved_env.items(): | ||
| os.environ[k] = v | ||
| else: | ||
| self._client = get_aws_client("batch") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the environment modification part is not critical to this fix working for your use case, could you introduce that part as a separate PR?
If we aren't using the Metaflow metadata service provider, Metaflow defaults to generating task IDs locally. But these task IDs are just simple integers based on how many tasks/steps there are and are sequentially incremented based on
new_task_idinmetaflow/plugins/metadata_providers/local.py. This presents a problem when we're doing AWS Batch MNP, since currently we try and mass replace based on the task ID in the secondary command. If this is a simple integer, this will replace many erroneous places.For example, if the task ID is "3", there could be many instances of "3" in the secondary command that then have many replacements with "-node-$AWS_BATCH_JOB_NODE_INDEX" when really we just want to replace the actual task ID.
Here, I've identified two places - the input task ID via
--task-idand the task ID inMF_PATHSPEC, that should be the only two places in the command that have the actual task ID in them that need replacing. It is better to have more specific regexes this way.Furthermore, if there is no metadata provider, I've added a new check for control MNP jobs to finish by checking the S3 datastore instead.