Skip to content

Conversation

@manics
Copy link
Member

@manics manics commented Oct 3, 2020

Prototype of uploading repo2docker build logs to S3 (jupyterhub/binderhub#1156)

If you don't have access to an S3 server run a local one:

docker run -it --rm -e MINIO_ACCESS_KEY=minio -e MINIO_SECRET_KEY=minio123 -p 9000:9000 minio/minio server /data

Download an S3 client e.g. the minio client from https://min.io/download

Add temporary client config called mybindertest for the local test server

export MC_HOST_mybindertest=http://minio:minio123@localhost:9000

Create a bucket called mybinder

mc mb mybindertest/mybinder

Allow public downloads from the bucket

mc policy set download mybindertest/mybinder

Repo2docker S3 logs config

cat << EOF > s3config.py 
c.Repo2Docker.s3_logs_endpoint = 'http://localhost:9000'
c.Repo2Docker.s3_logs_access_key = 'minio'
c.Repo2Docker.s3_logs_secret_key = 'minio123'
c.Repo2Docker.s3_logs_bucket = 'mybinder'
EOF

Build with image name asd123

repo2docker --config s3config.py --image-name asd123 https://github.com/binder-examples/requirements

Logs should be available at
http://localhost:9000/mybinder/buildlogs/asd123/repo2docker.log

If you can't be bothered to setup your own Minio server you can use the public demo one at play.min.io:

export MC_HOST_mybindertest=https://Q3AM3UQ867SPQQA43P2F:[email protected]
mc mb mybindertest/mybinder
mc policy set download mybindertest/mybinder

Those demo credentials are public, I've already run the above so assuming no-one else has modified it you just need:

cat << EOF > s3config.py 
c.Repo2Docker.s3_logs_endpoint = 'https://play.min.io'
c.Repo2Docker.s3_logs_access_key = 'Q3AM3UQ867SPQQA43P2F'
c.Repo2Docker.s3_logs_secret_key = 'zuf+tfteSlswRu7BJ86wekitnifILbZam1KYY3TG'
c.Repo2Docker.s3_logs_bucket = 'mybinder'
EOF

repo2docker --config s3config.py --no-run https://github.com/binder-examples/conda
...
Successfully tagged r2dhttps-3a-2f-2fgithub-2ecom-2fbinder-2dexamples-2fconda0349319:latest
Uploading log to mybinder/buildlogs/r2dhttps-3a-2f-2fgithub-2ecom-2fbinder-2dexamples-2fconda0349319/repo2docker.log

https://play.minio.io/mybinder/buildlogs/r2dhttps-3a-2f-2fgithub-2ecom-2fbinder-2dexamples-2fconda0349319/repo2docker.log

Copy link
Member

@minrk minrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! I think this is a great approach. I had been thinking we should do it at the deployment/mybinder.org level, but this does seem lots simpler.

and self.s3_logs_secret_key
and self.s3_logs_bucket
):
logfile = tempfile.NamedTemporaryFile("w", delete=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These builds logs have been useful in the past, maybe they should be kept anyway, even if not uploading to s3?

@manics
Copy link
Member Author

manics commented Oct 5, 2020

Thanks for reviewing! I started with repo2docker for this prototype because it was easier to get the logs, but I've since figured out how to do it on the BinderHub side: jupyterhub/binderhub#1161. Obviously all your review comments from here also apply there.

I think BinderHub is a better location since it integrates multiple components though the repo2docker is a bit simpler. Let me know which you prefer and I'll put more time into it.

@betatim
Copy link
Member

betatim commented Oct 7, 2020

Came here to find out why "store logs in s3" needed changes in repo2docker. My guess at an implementation would have either been changes to BinderHub chart (inject a sidecar) or in BinderHub itself. Is it worth discussing the pros and cons?

Having thought about it for a bit while writing this I am starting to think that r2d is the right place (sidecar or "native"), not bhub. By putting the code into r2d we can restart the hub pod without losing logs and maintain the kind of "fire and forget" setup we have right now for builds (bhub asks for a build to start and then doesn't need to maintain any state about it).

@betatim
Copy link
Member

betatim commented Oct 7, 2020

About naming the bucket: I think there are two options:

  1. use a name derived from the resolved ref (in the Git case that would be a SHA1 and URL)
  2. use a name derived from the unresolved ref (in the Git case that would be a branch/tag name and URL)

(the "derived from" is meant to represent us hashing/mangling/escaping/processing the inputs to generate a bucket name)

I like naming it after the resolved ref as that generates unique names for free and we can have a history of previous builds. However I can't (yet) think of a nice way for a user to find those logs again. The scenario i am thinking about is this:

  1. user builds master of org/repo thinking "let me quickly build this repo"
  2. build fails with some error that looks easy to fix
  3. user makes a commit in org/repo
  4. user builds master of org/repo again
  5. error persists, "ok, maybe let me look at that original log again, must have overlooked something"
  6. user now has to go find the original commit's SHA1 in order to get access to the log??

Basically in a situation where you build and rebuild and rerebuild the same repo by adding commits to your repository it becomes quite tricky to (re)find the log of a previous build.

I think this is a user interface problem which we could solve in BinderHub with some GUI or by clever structuring of the bucket names or some other way. I think the right level of complexity for a user would be presenting them with a chronologically sorted list of build logs that has a link back to the source repository (in the state it was when we built it) and maybe even the human recognisable ref that they typed into the BinderHub form.

@minrk
Copy link
Member

minrk commented Oct 7, 2020

Basically in a situation where you build and rebuild and rerebuild the same repo by adding commits to your repository it becomes quite tricky to (re)find the log of a previous build.

I see two pieces of information useful for solving this problem:

  1. We could use a directory prefix (s3 doesn't really have directories) based on the repo only, so that it's easy to list all build logs for a repo. This may also come in handy for per-repo quotas/culling purposes, though e.g. 30-day internal bucket expiry may be sufficient for our purposes.
  2. We can use metadata fields like we do in the analytics-archive to store information like resolved_ref, unresolved_ref, repo, date, etc.

Key information: prefix can be used to filter files to a short manageable list per-repo, metadata cannot, hence the prefix. bucket-storage like S3/GCS does not support things like list-files-matching-metadata.

Then we can have a BinderHub API based on this info to retrieve the build log history for a repo, e.g.

  • GET /api/build-logs/$provider/$repo -> list of build logs for repo, sorted by date with info about which ref they were building
  • GET /api/build-log/$provider/$repo/$ref -> single log after resolving ref

@manics
Copy link
Member Author

manics commented Oct 7, 2020

About naming the bucket: I think there are two options:

use a name derived from the resolved ref (in the Git case that would be a SHA1 and URL)
use a name derived from the unresolved ref (in the Git case that would be a branch/tag name and URL)

For the benefit of anyone new to S3 I thought I'd clear up a bit of terminology. You have an S3 server, a bucket, and a key which references an object. With many public clouds you also have a region but ignore that for now.

This means the URL to your object (in this case a log-file) is something like https://<s3-server.example.org>/<bucket>/<key>

The bucket is a container for multiple objects.
The key is a single string, / has no meaning. As @minrk mentioned you can simulate a directory structure by doing a prefix search, but since it's not a real directory you can confuse people by creating both /bucket/foo and /bucket/foo/bar as objects 😀 (some S3 implementations will prevent this).

So in my original example:
https://play.minio.io/mybinder/buildlogs/r2dhttps-3a-2f-2fgithub-2ecom-2fbinder-2dexamples-2fconda0349319/repo2docker.log

Server: https://play.minio.io
Bucket: mybinder
Key: buildlogs/r2dhttps-3a-2f-2fgithub-2ecom-2fbinder-2dexamples-2fconda0349319/repo2docker.log

@manics
Copy link
Member Author

manics commented Oct 16, 2020

I think this is quite close to being ready. I took @minrk's suggestion (thanks!) of using a Configurable which I think makes things a lot nicer, the changes to app.py are fairly small, most of the logic is in logstore.S3LogStore.

The main issue left is how to configure S3LogStore, in particular how to set the log name when integrated into BinderHub. At present S3LogStore knows nothing about the R2D configuration, so it relies on S3LogStore.logname to be set to the desired S3 key/path, which means it has to be set in a config file passed as repo2docker --config my_config.py:

c.Repo2Docker.logstore = 'repo2docker.logstore.S3LogStore'
c.S3LogStore.endpoint = 'http://192.168.1.1:9000'
c.S3LogStore.access_key = 'minio'
c.S3LogStore.secret_key = 'minio123'
c.S3LogStore.bucket = 'mybinder'
c.S3LogStore.logname = 'buildlogs/repo2docker.log'
c.S3LogStore.metadata = {"R2D-Key": "value",}

This is probably OK for the static S3 connection parameters, but the logname will change with every build so the config would need to be rewritten. Options:

  1. Don't change anything, always rewrite the config file
  2. Add a new repo2docker argument to set the logname, and possibly the metadata if we want it.
  3. Allow configurables to be set on the command line: --S3LogStore.logname=foo/bar.txt

Any thoughts? I quite like option 3 as it matches other Jupyter/Ipython applications, and hides the new feature from a typical user. It's not currently working in R2D though, was this a deliberately design choice? It seemed easy enough to add:

diff --git a/repo2docker/__main__.py b/repo2docker/__main__.py
index 907460d..5ea22d8 100644
--- a/repo2docker/__main__.py
+++ b/repo2docker/__main__.py
@@ -232,7 +232,8 @@ def make_r2d(argv=None):
         print(__version__)
         sys.exit(0)
 
-    args = get_argparser().parse_args(argv)
+    print(argv)
+    args, unknown_args = get_argparser().parse_known_args(argv)
 
     r2d = Repo2Docker()
 
@@ -240,6 +241,9 @@ def make_r2d(argv=None):
         r2d.log_level = logging.DEBUG
 
     r2d.load_config_file(args.config)
+    if unknown_args:
+        print("Unknown args: ", unknown_args)
+        r2d.parse_command_line(unknown_args)
     if args.appendix:
         r2d.appendix = args.appendix

@manics
Copy link
Member Author

manics commented Oct 16, 2020

On K8S sidecar vs native in BinderHub: AFAIK since the repo2docker container logs to stdout it's not possible to capture them in a sidecar without changing repo2docker to log to a location the sidecar logging agent can read, such as a file on a shared volume or a network socket. I might be wrong on this though.

BinderHub obtains the logs using the K8s API
https://github.com/jupyterhub/binderhub/blob/acdab1952385eead7650590464bd218f1892fca6/binderhub/build.py#L350

@consideRatio consideRatio changed the title WIP: s3 logs prototype s3 logs prototype Oct 30, 2022
@consideRatio consideRatio changed the title s3 logs prototype Add support for uploading build logs to S3 Oct 31, 2022
@consideRatio consideRatio changed the title Add support for uploading build logs to S3 Add LogStore, and a S3LogStore implementation to retaining build logs Oct 31, 2022
@consideRatio consideRatio changed the title Add LogStore, and a S3LogStore implementation to retaining build logs Add LogStore, and a S3LogStore implementation to retain build logs Oct 31, 2022
@consideRatio consideRatio changed the title Add LogStore, and a S3LogStore implementation to retain build logs Add LogStore to retain build logs, and a S3LogStore implementation Oct 31, 2022
@consideRatio
Copy link
Member

I'm onboard with the idea and I love that you have created an abstract LogStore class where s3 is one implementation of it.

Happy to help with review efforts if you want to bring this to life again @manics!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants