-
Notifications
You must be signed in to change notification settings - Fork 48
DOCS-3198: Add offline data pipelines, SDK docs, hot data store fixes #4440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOCS-3198: Add offline data pipelines, SDK docs, hot data store fixes #4440
Conversation
✅ Deploy Preview for viam-docs ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Started reviewing but this is a philosophical one about the extent to which we want to redundantly document the APIs and CLI commands. Feels like a new precedent, since for each item for example "Delete a pipeline," there's a 1:1 matching section in the API docs as well as an example in the CLI page, so will leave to Naomi to review.
@@ -2587,7 +2622,7 @@ User-defined metadata is billed as data. | |||
|
|||
**Parameters:** | |||
|
|||
- `robot_id` ([str](https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str)) (required): The ID of the robot with which to associate the user-defined metadata. You can obtain your robot ID from your machine's page. | |||
- `robot_id` ([str](https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str)) (required): The ID of the robot with which to associate the user-defined metadata. You can obtain your robot ID from the machine page. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems less helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is generated documentation from the SDK doc. I'll file a PR to improve this wording but I think it's out of scope and would hold this PR up for a while while we fix it upstream.
static/include/app/apis/overrides/protos/data.ListDataPipelines.md
Outdated
Show resolved
Hide resolved
static/include/app/apis/overrides/protos/data.GetDataPipeline.md
Outdated
Show resolved
Hide resolved
When you address the merge conflicts with /generated/app.md, note that we changed a couple things manually in #4431 but I created this upstream PR to get them to stick. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Nathan!
static/include/app/apis/overrides/protos/data.ListDataPipelineRuns.md
Outdated
Show resolved
Hide resolved
static/include/app/apis/overrides/protos/data.ListDataPipelines.md
Outdated
Show resolved
Hide resolved
{{% /tab %}} | ||
{{< /tabs >}} | ||
|
||
### Update a pipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm tempted to not include this in the documentation. We don't want people changing pipeline schedules or queries after a pipeline has started inserting query results. Might end up just being the name that we allow them to update. Is it ok to leave this out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, it'll still be in the SDK docs unless we manually hide it from each SDK. If we think it's dangerous for users to edit existing pipeline queries or schedules, we should either:
- not provide that functionality
- warn users about the dangers
So I would strongly prefer to keep documentation in place, with a warning explaining the danger, unless we decide to remove that functionality. Otherwise users will still wind up using the API and Ask AI will scrape the SDK doc and recommend using the method anyway, just without any warnings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm adding a warning for now; let me know how you feel about the warning once I've pushed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a question for @jdamon96
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we don’t “want” the users to update schedule/pipeline once documents have been written, i think we should just disallow that to be done at all. that said, don't want to block this docs update for now, so we can revisit and edit the docs when/if we actually make that change
(I'll hold off on review until commetns are resolved) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hot data page looks good to me!
added link to hot data store mention
|
{{% /tab %}} | ||
{{< /tabs >}} | ||
|
||
To create a pipeline that reads data from the [hot data store](/data-ai/data/hot-data-store/), specify a `dataSourceType` in your pipeline configuration. | ||
|
||
{{< alert title="Caution" color="caution" >}} | ||
|
||
Avoid specifying an `_id` value in your pipeline's final group stage unless you can guarantee its uniqueness across all pipeline runs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I probably commented incorrectly in my last review. I believe $group needs an _id. So if users want to include a $group stage, there should be a stage after (such as $project) to remove the _id field.
$group: {
_id: "$location_id",
count: { $sum: 1 }
},
$project: {
location: "$_id",
count: 1,
_id: 0,
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. So the bigger takeaway is that data pipelines should never end with the $group
stage.
I'll update the docs accordingly, but I wonder if this issue is dangerous enough that the UI (and maybe the backend?) should either warn users, log a message, or even add a sneaky silent $project
stage to rename _id
fields to something else. I imagine a lot of data pipelines will wind up using $group
, so users are bound to run into this eventually!
@jdamon96 thoughts? If you like one of these product solutions, happy to file a ticket for it (and update the docs once it lands).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree it could make sense to address this on a product level; for now i'd prefer to release as-is w/ this docs level warning and wait a bit to see how users interact with the product and use that to inform our solution approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I updated the create example with the $project
stage and added more information about this in the subsequent admonition. I'll keep an eye on Ask AI to see if this comes up at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thank you
Co-authored-by: Naomi Pentrel <[email protected]>
Co-authored-by: Naomi Pentrel <[email protected]>
docs/data-ai/data/hot-data-store.md
Outdated
@@ -2,7 +2,7 @@ | |||
linkTitle: "Cache recent data" | |||
title: "Cache recent data" | |||
weight: 25 | |||
description: "Make processed automatically available for faster, simpler queries." | |||
description: "Store the last 24 hours of data in a shared recent-data database, while continuing to write all data to blob storage." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain please why this says 24h?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I tried using the line you liked from the old hot data store section (https://github.com/viamrobotics/docs/pull/4440/files#r2192540953) but I should have removed the explicit 24h condition.
Fixing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does shared mean in this context?
docs/data-ai/data/hot-data-store.md
Outdated
@@ -12,11 +12,11 @@ platformarea: ["data", "cli"] | |||
date: "2024-12-03" | |||
--- | |||
|
|||
The hot data store enables faster queries on recent sensor data. | |||
The hot data store caches the last 24 hours of data in a shared recent-data database, while continuing to write all data to blob storage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above
Queries typically execute on blob storage. | ||
To query data from hot data store instead of blob storage, specify hot storage as your data source in your query. | ||
Queries execute on blob storage by default which is slower than queries to a hot data store. | ||
If you have configured a hot data store, you must specify it in any queries as the data source to be used for the query. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again: Can you explain what happens if some of the data is no longer in the hot storage and I specify that as the data_source? Is there fallback? Either way we should explain that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reaching out to @dmhilly for a concrete answer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi! There is no fallback; if you specify hot_data_store and there is no data in the store, you get an empty result set, even if data that matches your query exists in blob storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to chime in here: i'm not sure there is anything to explain; it's expected behavior for a query to return exactly what is being asked for (i.e. query = specific question, result = answer), and in this case if the user is querying for data that isn't there, the empty set is that answer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the word cache
being in the title is what made me question this and is throwing me off. Maybe that needs to change? When I think cache I think if there's a cache miss I'll still get the data, it'll just take longer. Absolutely tell me I'm wrong but if I am stumbling on this, others likely will too. I'd prefer we avoid any doubt for users than leave a possible stumbling block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@npentrel I just removed all usage of the word cache from this page, including the title. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally am a fan of not using cache to describe the hot data store.
docs/data-ai/data/data-pipelines.md
Outdated
|
||
## Prerequisites | ||
|
||
Before creating a data pipeline, you must enable data capture from at least one component and begin syncing data with Viam. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nathan-contino why did you resolve this? You didn't change anything. Generally unless it's optional feedback, I'd expect a comment or a change. Resolving comments should in most cases be left to the commenter
docs/data-ai/data/hot-data-store.md
Outdated
linkTitle: "Cache recent data" | ||
title: "Cache recent data" | ||
weight: 25 | ||
description: "Cache recent data while continuing to write all data to blob storage." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might provide more context
description: "Cache recent data while continuing to write all data to blob storage." | |
description: "Cache recent data to allow faster access, while continuing to write all data to blob storage." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworded a bit differently based on your statements elsewhere regarding the term 'cache'. Hopefully my changes are mostly in the same spirit.
"capture_frequency_hz": 0.5, | ||
"additional_params": {}, | ||
"recent_data_store": { | ||
"stored_hours": 24 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I asked for the example to stay in the docs in some format because the description added a nice explanation of the feature. That really didn't mean please use this as THE thing to tell people to add - because that doesn't help with explanation. A tab with an example configuration or an example configuration just below the configure section might work.
I really don't think you should tell people to add this, then show an example, without explaning it's an example and that the stored_hours are configurable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. I didn't understand that you were referring to the sentence and the example snippet before. Sorry for misunderstanding.
Reworded this to explicitly state that the configuration is an example, lifted the sentence about stored_hours
configuration above the snippet, and added an introductory sentence to help users connect the 24 in the example to 24 stored hours of data.
@npentrel updated based on your feedback, let me know what other suggestions you have so i can keep this moving |
docs/data-ai/data/hot-data-store.md
Outdated
@@ -0,0 +1,164 @@ | |||
--- | |||
linkTitle: "Speed up queries to recent data" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a bit long for the side bar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe?
linkTitle: "Speed up queries to recent data" | |
linkTitle: "Optimize recend data queries" |
Anything that stays in imperative mood but doesn't spread across two lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM % changing the hot data store link title
Is this ready to be merged? |
🔎💬 Inkeep AI search and chat service is syncing content for source 'Viam Docs' |
docs/data-ai/data/data-pipelines.md
)docs/data-ai/data/hot-data-store.md
), since: