diff --git a/pkg/samplerepo/assets/sample/README.md.tmpl b/pkg/samplerepo/assets/sample/README.md.tmpl index 023d75baf34..3926f649f68 100644 --- a/pkg/samplerepo/assets/sample/README.md.tmpl +++ b/pkg/samplerepo/assets/sample/README.md.tmpl @@ -10,9 +10,9 @@ With lakeFS, you can use concepts on your data lake such as **branch** to create This quickstart will introduce you to some of the core ideas in lakeFS and show what you can do by illustrating the concept of branching, merging, and rolling back changes to data. It's laid out in four short sections. -* ![Query icon](images/quickstart-step-01-query.png) [Query](#query) the pre-populated data on the `main` branch +* ![Query icon](images/quickstart-step-01-query.png) [Query](#query) the pre-populated data on the default branch (`{{.RepoDefaultBranch}}`) * ![Branch icon](images/quickstart-step-02-branch.png) [Make changes](#branch) to the data on a new branch -* ![Merge icon](images/quickstart-step-03-merge.png) [Merge](#commit-and-merge) the changed data back to the `main` branch +* ![Merge icon](images/quickstart-step-03-merge.png) [Merge](#commit-and-merge) the changed data back to the default branch * ![Rollback icon](images/quickstart-step-04-rollback.png) [Change our mind](#rollback) and rollback the changes * ![Actions and Hooks icon](images/quickstart-step-05-actions.png) Learn about [actions and hooks](#actions-and-hooks) in lakeFS @@ -64,15 +64,17 @@ Follow the prompts to enter your credentials that you created when you first set _We'll start off by querying the sample data to orient ourselves around what it is we're working with. The lakeFS server has been loaded with a sample parquet datafile. Fittingly enough for a piece of software to help users of data lakes, the `lakes.parquet` file holds data about lakes around the world._ -_You'll notice that the branch is set to `main`. This is conceptually the same as your main branch in Git against which you develop software code._ +_You'll notice that the branch is set to `{{.RepoDefaultBranch}}`. This is your default branch in lakeFS, conceptually the same as your main branch in Git against which you develop software code._ -The lakeFS objects list with a highlight to indicate that the branch is set to main. +_Note: Screenshots in this guide show `main` as the default branch name._ -_Let's have a look at the data, ahead of making some changes to it on a branch in the following steps._. +The lakeFS Objects tab showing the branch selector set to the default branch (main shown here). -Click on [`lakes.parquet`](object?ref=main&path=lakes.parquet) from the object browser and notice that the built-it DuckDB runs a query to show a preview of the file's contents. +_Let's have a look at the data, ahead of making some changes to it on a branch in the following steps._ -The lakeFS object viewer with embedded DuckDB to query parquet files. A query has run automagically to preview the contents of the selected parquet file. +Click on [`lakes.parquet`](object?ref={{.RepoDefaultBranch}}&path=lakes.parquet) from the object browser and notice that the built-it DuckDB runs a query to show a preview of the file's contents. + +The lakeFS object viewer with embedded DuckDB to query parquet files. A query has run automagically to preview the contents of the selected parquet file. _Now we'll run our own query on it to look at the top five countries represented in the data_. @@ -80,15 +82,15 @@ Copy and paste the following SQL statement into the DuckDB query panel and click ```sql SELECT country, COUNT(*) -FROM READ_PARQUET('lakefs://{{.RepoName}}/main/lakes.parquet') +FROM READ_PARQUET('lakefs://{{.RepoName}}/{{.RepoDefaultBranch}}/lakes.parquet') GROUP BY country ORDER BY COUNT(*) DESC LIMIT 5; ``` -An embedded DuckDB query showing a count of rows per country in the dataset. +An embedded DuckDB query showing a count of rows per country in the dataset. -_Next we're going to make some changes to the data—but on a development branch so that the data in the main branch remains untouched._ +_Next we're going to make some changes to the data—but on a development branch so that the data in the `{{.RepoDefaultBranch}}` branch remains untouched._ # Create a Branch 🪓 @@ -120,13 +122,13 @@ In a new terminal window run the following: docker exec lakefs \ lakectl branch create \ lakefs://{{.RepoName}}/denmark-lakes \ - --source lakefs://{{.RepoName}}/main + --source lakefs://{{.RepoName}}/{{.RepoDefaultBranch}} ``` _You should get a confirmation message like this:_ ```bash -Source ref: lakefs://{{.RepoName}}/main +Source ref: lakefs://{{.RepoName}}/{{.RepoDefaultBranch}} created branch 'denmark-lakes' 3384cd7cdc4a2cd5eb6249b52f0a709b49081668bb1574ce8f1ef2d956646816 ``` @@ -141,13 +143,13 @@ In a new terminal window run the following: ```bash lakectl branch create \ lakefs://{{.RepoName}}/denmark-lakes \ - --source lakefs://{{.RepoName}}/main + --source lakefs://{{.RepoName}}/{{.RepoDefaultBranch}} ``` _You should get a confirmation message like this:_ ```bash -Source ref: lakefs://{{.RepoName}}/main +Source ref: lakefs://{{.RepoName}}/{{.RepoDefaultBranch}} created branch 'denmark-lakes' 3384cd7cdc4a2cd5eb6249b52f0a709b49081668bb1574ce8f1ef2d956646816 ``` @@ -158,9 +160,9 @@ _Now we'll make a change to the data. lakeFS has several native clients, as well We're going to use DuckDB which is embedded within the web interface of lakeFS. -From the lakeFS [**Objects** page](/repositories/{{.RepoName}}/objects?ref=main) select the [`lakes.parquet`](/repositories/{{.RepoName}}/object?ref=main&path=lakes.parquet) file to open the DuckDB editor: +From the lakeFS [**Objects** page](/repositories/{{.RepoName}}/objects?ref={{.RepoDefaultBranch}}) select the [`lakes.parquet`](/repositories/{{.RepoName}}/object?ref={{.RepoDefaultBranch}}&path=lakes.parquet) file to open the DuckDB editor: -The lakeFS object viewer with embedded DuckDB to query parquet files. A query has run automagically to preview the contents of the selected parquet file. +The lakeFS object viewer with embedded DuckDB to query parquet files. A query has run automagically to preview the contents of the selected parquet file. To start with, we'll load the lakes data into a DuckDB table so that we can manipulate it. Replace the previous text in the DuckDB editor with this: @@ -181,7 +183,7 @@ ORDER BY COUNT(*) DESC LIMIT 5; ``` -The DuckDB editor pane querying the lakes table +The DuckDB editor pane querying the lakes table ### Making a Change to the Data @@ -192,7 +194,7 @@ Now we can change our table, which was loaded from the original `lakes.parquet`, DELETE FROM lakes WHERE Country != 'Denmark'; ``` -The DuckDB editor pane deleting rows from the lakes table +The DuckDB editor pane deleting rows from the lakes table We can verify that it's worked by reissuing the same query as before: @@ -204,19 +206,19 @@ ORDER BY COUNT(*) DESC LIMIT 5; ``` -The DuckDB editor pane querying the lakes table showing only rows for Denmark remain +The DuckDB editor pane querying the lakes table showing only rows for Denmark remain ## Write the Data back to lakeFS _The changes so far have only been to DuckDB's copy of the data. Let's now push it back to lakeFS._ -_Note the lakeFS path is different this time as we're writing it to the `denmark-lakes` branch, not `main`._ +_Note the lakeFS path is different this time as we're writing it to the `denmark-lakes` branch, not `{{.RepoDefaultBranch}}`._ ```sql COPY lakes TO 'lakefs://{{.RepoName}}/denmark-lakes/lakes.parquet'; ``` -The DuckDB editor pane writing data back to the denmark-lakes branch +The DuckDB editor pane writing data back to the denmark-lakes branch ## Verify that the Data's Changed on the Branch @@ -234,29 +236,29 @@ ORDER BY COUNT(*) DESC LIMIT 5; ``` -The DuckDB editor pane show the parquet file on denmark-lakes branch has been changed +The DuckDB editor pane show the parquet file on denmark-lakes branch has been changed -## What about the data in `main`? +## What about the data in `{{.RepoDefaultBranch}}`? -_So we've changed the data in our `denmark-lakes` branch, deleting swathes of the dataset. What's this done to our original data in the `main` branch? Absolutely nothing!_ +_So we've changed the data in our `denmark-lakes` branch, deleting swathes of the dataset. What's this done to our original data in the `{{.RepoDefaultBranch}}` branch? Absolutely nothing!_ -See for yourself by returning to [the lakeFS object view](object?ref=main&path=lakes.parquet) and re-running the same query: +See for yourself by returning to [the lakeFS object view](object?ref={{.RepoDefaultBranch}}&path=lakes.parquet) and re-running the same query: ```sql SELECT country, COUNT(*) -FROM READ_PARQUET('lakefs://{{.RepoName}}/main/lakes.parquet') +FROM READ_PARQUET('lakefs://{{.RepoName}}/{{.RepoDefaultBranch}}/lakes.parquet') GROUP BY country ORDER BY COUNT(*) DESC LIMIT 5; ``` -The lakeFS object browser showing DuckDB querying lakes.parquet on the main branch. The results are the same as they were before we made the changes to the denmark-lakes branch, which is as expected. +The lakeFS object browser showing DuckDB querying lakes.parquet on the default branch (main shown here). The results are the same as they were before we made the changes to the denmark-lakes branch, which is as expected. -_In the next step we'll see how to merge our branch back into main._ +_In the next step we'll see how to merge our branch back into `{{.RepoDefaultBranch}}`._ # Committing Changes in lakeFS 🤝🏻 -_In the previous step we branched our data from `main` into a new `denmark-lakes` branch, and overwrote the `lakes.parquet` to hold solely information about lakes in Denmark. Now we're going to commit that change (just like Git) and merge it back to main (just like Git)._ +_In the previous step we branched our data from `{{.RepoDefaultBranch}}` into a new `denmark-lakes` branch, and overwrote the `lakes.parquet` to hold solely information about lakes in Denmark. Now we're going to commit that change (just like Git) and merge it back to `{{.RepoDefaultBranch}}` (just like Git)._ _Having make the change to the datafile in the `denmark-lakes` branch, we now want to commit it. There are various options for interacting with lakeFS' API, including the web interface, [a Python client](https://pydocs.lakefs.io/), and `lakectl`._ @@ -325,7 +327,7 @@ Parents: 3384cd7cdc4a2cd5eb6249b52f0a709b49081668bb1574ce8f1ef2d956646816 -_With our change committed, it's now time to merge it to back to the `main` branch._ +_With our change committed, it's now time to merge it to back to the `{{.RepoDefaultBranch}}` branch._ # Merging Branches in lakeFS 🔀 @@ -334,7 +336,7 @@ _As with most operations in lakeFS, merging can be done through a variety of int
Web UI -1. Click [here](./compare?ref=main&compare=denmark-lakes), or manually go to the **Compare** tab and set the **Compared to branch** to `denmark-lakes`. +1. Click [here](./compare?ref={{.RepoDefaultBranch}}&compare=denmark-lakes), or manually go to the **Compare** tab and set the **Compared to branch** to `denmark-lakes`. ![Merge dialog in lakeFS](images/merge01.png) @@ -355,7 +357,7 @@ Run this from a terminal window. docker exec lakefs \ lakectl merge \ lakefs://{{.RepoName}}/denmark-lakes \ - lakefs://{{.RepoName}}/main + lakefs://{{.RepoName}}/{{.RepoDefaultBranch}} ```
@@ -370,15 +372,15 @@ Run this from a terminal window. ```bash lakectl merge \ lakefs://{{.RepoName}}/denmark-lakes \ - lakefs://{{.RepoName}}/main + lakefs://{{.RepoName}}/{{.RepoDefaultBranch}} ``` -_We can confirm that this has worked by returning to the same object view of [`lakes.parquet`](object?ref=main&path=lakes.parquet) as before and clicking on **Execute** to rerun the same query. You'll see that the country row counts have changed, and only Denmark is left in the data._ +_We can confirm that this has worked by returning to the same object view of [`lakes.parquet`](object?ref={{.RepoDefaultBranch}}&path=lakes.parquet) as before and clicking on **Execute** to rerun the same query. You'll see that the country row counts have changed, and only Denmark is left in the data._ -The lakeFS object browser with a DuckDB query on lakes.parquet showing that there is only data for Denmark. +The lakeFS object browser with a DuckDB query on lakes.parquet showing that there is only data for Denmark. **But…oh no!** 😬 A slow chill creeps down your spine, and the bottom drops out of your stomach. What have you done! 😱 *You were supposed to create **a separate file** of Denmark's lakes - not replace the original one* 🤦🏻🤦🏻‍♀. @@ -389,7 +391,7 @@ _Have no fear; lakeFS can revert changes. Keep reading for the final part of the # Rolling back Changes in lakeFS ↩️ -_Our intrepid user (you) merged a change back into the `main` branch and realised that they had made a mistake 🤦🏻._ +_Our intrepid user (you) merged a change back into the `{{.RepoDefaultBranch}}` branch and realised that they had made a mistake 🤦🏻._ _The good news for them (you) is that lakeFS can revert changes made, similar to how you would in Git 😅._ @@ -401,15 +403,15 @@ From your terminal window run `lakectl` with the `revert` command: ```bash docker exec -it lakefs \ lakectl branch revert \ - lakefs://{{.RepoName}}/main \ - main --parent-number 1 --yes + lakefs://{{.RepoName}}/{{.RepoDefaultBranch}} \ + {{.RepoDefaultBranch}} --parent-number 1 --yes ``` _You should see a confirmation of a successful rollback:_ ```bash -Branch: lakefs://{{.RepoName}}/main -commit main successfully reverted +Branch: lakefs://{{.RepoName}}/{{.RepoDefaultBranch}} +commit {{.RepoDefaultBranch}} successfully reverted ``` @@ -421,22 +423,22 @@ From your terminal window run `lakectl` with the `revert` command: ```bash lakectl branch revert \ - lakefs://{{.RepoName}}/main \ - main --parent-number 1 --yes + lakefs://{{.RepoName}}/{{.RepoDefaultBranch}} \ + {{.RepoDefaultBranch}} --parent-number 1 --yes ``` _You should see a confirmation of a successful rollback:_ ```bash -Branch: lakefs://{{.RepoName}}/main -commit main successfully reverted +Branch: lakefs://{{.RepoName}}/{{.RepoDefaultBranch}} +commit {{.RepoDefaultBranch}} successfully reverted ``` Back in the object page and the DuckDB query we can see that the original file is now back to how it was. -The lakeFS object viewer with DuckDB query showing that the lakes dataset on main branch has been successfully returned to state prior to the merge. +The lakeFS object viewer with DuckDB query showing that the lakes dataset on the default branch has been successfully returned to state prior to the merge. @@ -462,7 +464,7 @@ _Hooks_ can be either a Lua script that lakeFS will execute itself, an external docker exec lakefs \ lakectl branch create \ lakefs://quickstart/add_action \ - --source lakefs://quickstart/main + --source lakefs://quickstart/{{.RepoDefaultBranch}} ``` 1. Open up your favorite text editor (or emacs), and paste the following YAML: @@ -520,13 +522,13 @@ _Hooks_ can be either a Lua script that lakeFS will execute itself, an external 1. Go to the **Uncommitted Changes** tab in the UI, and make sure that you see the new file in the path shown: - lakeFS Uncommitted Changes view showing a file called `check_commit_metadata.yml` under the path `_lakefs_actions/` + lakeFS Uncommitted Changes view showing a file called `check_commit_metadata.yml` under the path `_lakefs_actions/` Click **Commit Changes** and enter a suitable message to commit this new file to the branch. -1. Now we'll merge this new branch into `main`. From the **Compare** tab in the UI compare the `main` branch with `add_action` and click **Merge** +1. Now we'll merge this new branch into `{{.RepoDefaultBranch}}`. From the **Compare** tab in the UI compare the `{{.RepoDefaultBranch}}` branch with `add_action` and click **Merge** - lakeFS Compare view showing the difference between `main` and `add_action` branches + lakeFS Compare view showing the difference between the default and `add_action` branches ## Testing the Action @@ -540,11 +542,11 @@ Let's remind ourselves what the rules are that the action is going to enforce. We'll start by creating a branch that's going to match the `etl` pattern, and then go ahead and commit a change and see how the action works. -1. Create a new branch (see above instructions on how to do this if necessary) called `etl_20230504`. Make sure you use `main` as the source branch. +1. Create a new branch (see above instructions on how to do this if necessary) called `etl_20230504`. Make sure you use `{{.RepoDefaultBranch}}` as the source branch. In your new branch you should see the action that you created and merged above: - lakeFS branch etl_20230504 with object /_lakefs_actions/check_commit_metadata.yml + lakeFS branch etl_20230504 with object /_lakefs_actions/check_commit_metadata.yml 1. To simulate an ETL job we'll use the built-in DuckDB editor to run some SQL and write the result back to the lakeFS branch. @@ -562,7 +564,7 @@ We'll start by creating a branch that's going to match the `etl` pattern, and th 1. Head to the **Uncommitted Changes** tab in the UI and notice that there is now a file called `top10_lakes.parquet` waiting to be committed. - lakeFS branch etl_20230504 with uncommitted file top10_lakes.parquet + lakeFS branch etl_20230504 with uncommitted file top10_lakes.parquet Now we're ready to start trying out the commit rules, and seeing what happens if we violate them. @@ -576,11 +578,11 @@ We'll start by creating a branch that's going to match the `etl` pattern, and th `❌ A commit message must be provided` - lakeFS blocking an attempt to commit with no commit message + lakeFS blocking an attempt to commit with no commit message 1. Do the same as the previous step, but provide a message this time: - A commit to lakeFS with commit message in place + A commit to lakeFS with commit message in place The commit still fails as we need to include metadata too, which is what the error tells us @@ -588,7 +590,7 @@ We'll start by creating a branch that's going to match the `etl` pattern, and th 1. Repeat the **Commit Changes** dialog and use the **Add Metadata field** to add the required metadata: - A commit to lakeFS with commit message and metadata in place + A commit to lakeFS with commit message and metadata in place We're almost there, but this still fails (as it should), since the version is not entirely numeric but includes `v` and `ß`: @@ -596,20 +598,20 @@ We'll start by creating a branch that's going to match the `etl` pattern, and th Repeat the commit attempt specify the version as `1.00` this time, and rejoice as the commit succeeds - Commit history in lakeFS showing that the commit met the rules set by the action and completed successfully. + Commit history in lakeFS showing that the commit met the rules set by the action and completed successfully. --- You can view the history of all action runs from the **Action** tab: -Action run history in lakeFS +Action run history in lakeFS ## Bonus Challenge And so with that, this quickstart for lakeFS draws to a close. If you're simply having _too much fun_ to stop then here's an exercise for you. -Implement the requirement from above *correctly*, such that you write `denmark-lakes.parquet` in the respective branch and successfully merge it back into main. Look up how to list the contents of the `main` branch and verify that it looks like this: +Implement the requirement from above *correctly*, such that you write `denmark-lakes.parquet` in the respective branch and successfully merge it back into `{{.RepoDefaultBranch}}`. Look up how to list the contents of the `{{.RepoDefaultBranch}}` branch and verify that it looks like this: ```bash object 2023-03-21 17:33:51 +0000 UTC 20.9 kB denmark-lakes.parquet diff --git a/pkg/samplerepo/samplecontent.go b/pkg/samplerepo/samplecontent.go index 56786c24dfb..36101dcdafb 100644 --- a/pkg/samplerepo/samplecontent.go +++ b/pkg/samplerepo/samplecontent.go @@ -31,7 +31,8 @@ func PopulateSampleRepo(ctx context.Context, repo *catalog.Repository, cat *cata // we also skip checking if the file exists, since we know the repo is empty const tmplSuffix = ".tmpl" config := map[string]string{ - "RepoName": repo.Name, + "RepoName": repo.Name, + "RepoDefaultBranch": repo.DefaultBranch, } err := fs.WalkDir(assets.SampleData, sampleRepoFSRootPath, func(p string, d fs.DirEntry, topLevelErr error) error {