-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a Flintrock repository to host Hadoop and Spark releases #238
Comments
Working with Requester Pays S3 buckets turned out to be more difficult than expected. Using Requester Pays means all requests need to be authenticated and authorized through IAM. This implies a number of things:
To make this all work without requiring users to make a bunch of manual changes related to IAM, Flintrock would need to do some extra work whenever a user tries to launch a cluster without providing explicit download sources for Spark and Hadoop:
It's a fair bit of work, unfortunately. And this is on top of needing to maintain the repository of Spark and Hadoop releases itself. The alternative to doing all this is to stick to the Apache mirror network as a default download source for both Hadoop and Spark. Flintrock can provide a warning about the performance and reliability of Apache mirrors and simply leave it up to users to setup their own alternate download sources if they choose. For the sake of expediency, I'm now thinking of pursuing this less exciting strategy first. It will enable Flintrock users to launch Spark 2.3 clusters and it takes much less work. Once that's in place, I can revisit this grander strategy of maintaining a repository of Spark and Hadoop releases. If you're reading along and have some questions or suggestions, please chime in. I wish there was a better solution. Short of Apache hosting Spark and Hadoop releases on a fast and modern CDN, this is the best I could come up with. |
Since its creation, Flintrock has sourced Spark releases from
s3://spark-related-packages
, an S3 bucket hosted by the AMPLab and kept up-to-date by the Apache Spark project. As of Spark 2.2.1, the Spark committers have confirmed that this bucket will no longer receive updates (alternate reference).This is a big change for Flintrock's out-of-the-box experience. Users today can configure Flintrock to download Spark from a custom location via the
--spark-download-source
option, but by default Flintrock downloads Spark froms3://spark-related-packages
. This gives users a fast, reliable, and convenient source of Spark releases to use with Flintrock without users needing to do any work. Now that the bucket is being retired, we're stuck with the Apache mirror network as a default download source. Flintrock already uses Apache mirrors as a default source for Hadoop, and as Flintrock users know, they are slow and often unreliable (#66).To preserve a strong out-of-the-box experience for Flintrock, I have begrudgingly decided to maintain a repository of Spark and Hadoop releases on S3 for use with Flintrock. I am loath to maintain new infrastructure, but in the absence of a fast CDN hosting public Spark and Hadoop releases, I think this is the only way.
To summarize the changes I plan to make:
s3://spark-related-packages
.--spark-download-source
and--hdfs-download-source
.When this change is complete, Flintrock will no longer depend on external sources for Spark and Hadoop, and clusters that use Hadoop will launch faster by default since they will now download Hadoop from S3 as opposed to the Apache mirror network.
Thank you to the AMPLab and to the Apache Spark project for graciously hosting Spark releases on S3 for as long as they did (and footing the bill!), and to Matei for the suggestion to use a Requester Pays bucket with Flintrock.
The text was updated successfully, but these errors were encountered: