-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a docker-compose environment for local/integration testing #58
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so cool. Thank you much for the hard work here, @dmateusp! I took it for a spin locally and was amazed by the ease of setup. I think this is going to enable local integration testing and containerized CI in a way that accelerates the pace of contribution.
Given that this is a fork off of @aaronsteers' work in #55, is everyone on board with preferring this one? I prefer the addition of a Postgres dependency to a MySQL one. I'm open to hearing if there's significant functionality supported in the other approach and omitted here.
If it's okay with you, I want to wait on merging this until @beckjake has a chance to give it a once-over. (He's on vacation this week.)
I agree! Fantastic work - and thank you @dmateusp for your effort in getting this revamped and cleaned up.
@jtcohen6 - YES - for my part, at least, I do agree - this is far cleaner than the initial which I posted on #55; I am happy to close or deprioritize #55 in favor of this approach. I may still iterate on some version of this for my own needs in a standalone image, but I can use the core Dockerfile as the source image for further downstream work. And meanwhile, the core image here will be leaner and easier to maintain. |
docker/thrift/Dockerfile
Outdated
ARG HADOOP_VERSION=2.7.7 | ||
ARG HADOOP_MINOR_VERSION=2.7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than have to declare two args and having to keep both in sync, what about calculating HADOOP_MINOR_VERSION
from HADOOP_VERSION
?
I'm not an expert at bash substitution by any means, but I believe this does the trick:
# Get 2-part minor version string (e.g. `2.7.7` -> `2.7`)
ENV HADOOP_MINOR_VERSION=${HADOOP_VERSION%.*}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Step 4/26 : ARG HADOOP_MINOR_VERSION=${HADOOP_VERSION%.*}
ERROR: Service 'thrift' failed to build: failed to process "${HADOOP_VERSION%.*}": missing ':' in substitution
I don't think this is supported by Docker, curious if/how you made it work
docker/thrift/Dockerfile
Outdated
ARG HADOOP_MINOR_VERSION=2.7 | ||
ARG HADOOP_HOME=/usr/local/hdp | ||
ARG SPARK_NAME=spark-${SPARK_VERSION}-bin-hadoop${HADOOP_MINOR_VERSION} | ||
ARG SPARK_HOME=/usr/local/spark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious - Do we need SPARK_HOME
as an arg or would a simple ENV
do the trick?
(I'm not sure what the use case for having this as an ARG would be.)
Same question also for SPARK_NAME
since we already have args for spark and hadoop version strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback here, I removed some of those ARGs, also to clean up the usage of ARG and ENV I re-use the base image
This seems to be a trick in Docker to propagate the environment but I'm happy with how it simplified the ARG/ENV usage in the Dockerfile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left a few comments/questions/suggestions inline with the code, but I see no blockers or glaring issues. I would still wait on approval from someone else on the core team, but for my part, I would be happy to see this move forward.
Also, general disclaimer: since I am working on another project now, I haven't had time to do any real testing. I will have to lean on others in terms of testing-based feedback.
Swap Spark instructions and Hadoop instructions Re-use base image to share ENV Remove some ARGs
Co-Authored-By: Aaron Steers <[email protected]>
Hi @aaronsteers I've updated our spark dockerhub image to be able to run a thrift server. https://github.com/godatadriven-dockerhub/spark I've included a thrift server example docker-compose file in the root of the repo: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for picking this up @dmateusp 👍
docker-compose.yml
Outdated
services: | ||
|
||
thrift: | ||
build: docker/thrift |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be interested in using @NielsZeilemaker his suggestion and using a pre-existing image instead of building one from scratch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're also open for PR's :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh of course, didn't know there was something out there
docker-compose.yml
Outdated
depends_on: | ||
- hive-metastore | ||
volumes: | ||
- ./.spark-warehouse/:/usr/local/spark/spark-warehouse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, when would you use this? The data will be mounted onto the root fs, while all the metadata is inside the docker images.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mount another volume here: https://github.com/fishtown-analytics/dbt-spark/pull/58/files#diff-4e5e90c6228fd48698d074241c2ba760R20
So, you have both the metadata and data persisted locally
I just think it's nicer to know you can run docker-compose down
in case something is wrong but still keep you metadata/data somewhere
Co-Authored-By: Fokko Driesprong <[email protected]>
@Fokko @NielsZeilemaker thanks for sharing the godatadriven However would you look into: godatadriven-dockerhub/spark#1 ? My docker-compose sometimes crashes (typically on the first run when database files aren't initialized), I solved it in this repository by adding the retry mechanism in the entrypoint After that's solved, I don't see any objections to removing the Docker image I created here in favor of the godatadriven hosted image |
@NielsZeilemaker committed a fix: godatadriven-dockerhub/spark@990e234 |
I cleaned up this PR to reuse godatadriven's image |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work @dmateusp! LGTM
docker-compose.yml
Outdated
version: "3.7" | ||
services: | ||
|
||
thrift: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe thrift
isn't the best name. I'd rather call it Spark, or Spark2
. It would be great if we can test DBT against Spark 2 and 3 in the future :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the names to be more specific
Really exciting stuff in here, folks! @beckjake Could you give this a once-over when you get a chance? |
Would be great if we could run this as a CI, without having to wait for the manual integration tests :) |
I can look into #61 in a separate PR, hopefully it helps with 0.15.0! |
Right on. Additionally, there are a lot of conversations happening on our end currently around how we can write and implement better Spark integration tests. The goal is to find some happy medium between dbt-core integration tests and the proof-of-concept dbt-integration-tests repo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me! I have some tiny suggested docs changes, but with those tweaks to my profiles.yml
I was able to get this up and running.
Co-Authored-By: Jacob Beck <[email protected]>
Co-Authored-By: Jacob Beck <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, I love this! It was so easy to get spark up and running locally.
Hi there,
Relevant:
@aaronsteers shared his original local environment, I diverged quite a bit from it so I decided to open a fresh PR
What this PR adds:
./.spark-warehouse
and./.hive-metastore
localhost:4040
where users can see queries being executedI hope this helps with integration testing and makes it easier for people to get started with dbt-spark or develop the plug-in
cc @Fokko