Skip to content

DOCS-3198: Add offline data pipelines, SDK docs, hot data store fixes #4440

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

nathan-contino
Copy link
Member

  • adds information about offline data pipelines (docs/data-ai/data/data-pipelines.md)
    • previously had no page in the docs besides some minimal CLI doc discussing pipelines; this introduces that missing page
    • includes examples in all supported languages (Python, Go, TypeScript) for basic data pipeline tasks
      • note that Go snippets link directly to Go API reference -- see SDK docs notes for more info
  • updates generated SDK documentation to include Python and Typescript data pipelines APIs (no Flutter yet)
    • no Go because our data page doesn't seem to have or support any Go snippets (i'm guessing there's an out-of-scope story here)
  • yanked hot data store out into its own page ((docs/data-ai/data/hot-data-store.md), since:
    • it has a lot in common with data pipelines, which could easily lead to user confusion
    • was very buried (increasing the likelihood of user confusion)
    • the recent API improvements for data pipelines broke our existing hot data store examples
  • slight reorder of 'Advanced data capture and sync configurations' since some short-but-useful sections were buried all the way at the end of a very long page of complex, niche examples
  • note that the alias for hot data store doesn't work -- leaving it for now in the hopes that someone can suggest a better alternative for relocating a single section of a still-existing page to another page

Copy link

netlify bot commented Jul 2, 2025

Deploy Preview for viam-docs ready!

Name Link
🔨 Latest commit 599fe80
🔍 Latest deploy log https://app.netlify.com/projects/viam-docs/deploys/6874fc66ba7dd400087ba20e
😎 Deploy Preview https://deploy-preview-4440--viam-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.
Lighthouse
Lighthouse
1 paths audited
Performance: 54 (no change from production)
Accessibility: 100 (no change from production)
Best Practices: 100 (no change from production)
SEO: 92 (no change from production)
PWA: 70 (no change from production)
View the detailed breakdown and full score reports

To edit notification comments on pull requests, go to your Netlify project configuration.

@viambot viambot added the safe to build This pull request is marked safe to build from a trusted zone label Jul 2, 2025
Copy link
Collaborator

@JessamyT JessamyT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Started reviewing but this is a philosophical one about the extent to which we want to redundantly document the APIs and CLI commands. Feels like a new precedent, since for each item for example "Delete a pipeline," there's a 1:1 matching section in the API docs as well as an example in the CLI page, so will leave to Naomi to review.

@@ -2587,7 +2622,7 @@ User-defined metadata is billed as data.

**Parameters:**

- `robot_id` ([str](https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str)) (required): The ID of the robot with which to associate the user-defined metadata. You can obtain your robot ID from your machine's page.
- `robot_id` ([str](https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str)) (required): The ID of the robot with which to associate the user-defined metadata. You can obtain your robot ID from the machine page.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems less helpful

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is generated documentation from the SDK doc. I'll file a PR to improve this wording but I think it's out of scope and would hold this PR up for a while while we fix it upstream.

@JessamyT
Copy link
Collaborator

JessamyT commented Jul 2, 2025

When you address the merge conflicts with /generated/app.md, note that we changed a couple things manually in #4431 but I created this upstream PR to get them to stick.

Copy link
Member

@vijayvuyyuru vijayvuyyuru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Nathan!

{{% /tab %}}
{{< /tabs >}}

### Update a pipeline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm tempted to not include this in the documentation. We don't want people changing pipeline schedules or queries after a pipeline has started inserting query results. Might end up just being the name that we allow them to update. Is it ok to leave this out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, it'll still be in the SDK docs unless we manually hide it from each SDK. If we think it's dangerous for users to edit existing pipeline queries or schedules, we should either:

  • not provide that functionality
  • warn users about the dangers

So I would strongly prefer to keep documentation in place, with a warning explaining the danger, unless we decide to remove that functionality. Otherwise users will still wind up using the API and Ask AI will scrape the SDK doc and recommend using the method anyway, just without any warnings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm adding a warning for now; let me know how you feel about the warning once I've pushed it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a question for @jdamon96

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we don’t “want” the users to update schedule/pipeline once documents have been written, i think we should just disallow that to be done at all. that said, don't want to block this docs update for now, so we can revisit and edit the docs when/if we actually make that change

@npentrel npentrel self-requested a review July 4, 2025 14:42
@npentrel
Copy link
Collaborator

npentrel commented Jul 4, 2025

(I'll hold off on review until commetns are resolved)

Copy link
Member

@vijayvuyyuru vijayvuyyuru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hot data page looks good to me!

@dmhilly dmhilly requested a review from jdamon96 July 7, 2025 16:36
added link to hot data store mention
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

{{% /tab %}}
{{< /tabs >}}

To create a pipeline that reads data from the [hot data store](/data-ai/data/hot-data-store/), specify a `dataSourceType` in your pipeline configuration.

{{< alert title="Caution" color="caution" >}}

Avoid specifying an `_id` value in your pipeline's final group stage unless you can guarantee its uniqueness across all pipeline runs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably commented incorrectly in my last review. I believe $group needs an _id. So if users want to include a $group stage, there should be a stage after (such as $project) to remove the _id field.

$group: {
  _id: "$location_id",
  count: { $sum: 1 }
},
$project: {
  location: "$_id",
  count: 1,
  _id: 0,
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. So the bigger takeaway is that data pipelines should never end with the $group stage.

I'll update the docs accordingly, but I wonder if this issue is dangerous enough that the UI (and maybe the backend?) should either warn users, log a message, or even add a sneaky silent $project stage to rename _id fields to something else. I imagine a lot of data pipelines will wind up using $group, so users are bound to run into this eventually!

@jdamon96 thoughts? If you like one of these product solutions, happy to file a ticket for it (and update the docs once it lands).

Copy link
Contributor

@jdamon96 jdamon96 Jul 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree it could make sense to address this on a product level; for now i'd prefer to release as-is w/ this docs level warning and wait a bit to see how users interact with the product and use that to inform our solution approach

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I updated the create example with the $project stage and added more information about this in the subsequent admonition. I'll keep an eye on Ask AI to see if this comes up at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thank you

@@ -2,7 +2,7 @@
linkTitle: "Cache recent data"
title: "Cache recent data"
weight: 25
description: "Make processed automatically available for faster, simpler queries."
description: "Store the last 24 hours of data in a shared recent-data database, while continuing to write all data to blob storage."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain please why this says 24h?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I tried using the line you liked from the old hot data store section (https://github.com/viamrobotics/docs/pull/4440/files#r2192540953) but I should have removed the explicit 24h condition.

Fixing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does shared mean in this context?

@@ -12,11 +12,11 @@ platformarea: ["data", "cli"]
date: "2024-12-03"
---

The hot data store enables faster queries on recent sensor data.
The hot data store caches the last 24 hours of data in a shared recent-data database, while continuing to write all data to blob storage.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above

Queries typically execute on blob storage.
To query data from hot data store instead of blob storage, specify hot storage as your data source in your query.
Queries execute on blob storage by default which is slower than queries to a hot data store.
If you have configured a hot data store, you must specify it in any queries as the data source to be used for the query.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again: Can you explain what happens if some of the data is no longer in the hot storage and I specify that as the data_source? Is there fallback? Either way we should explain that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reaching out to @dmhilly for a concrete answer.

Copy link
Member

@dmhilly dmhilly Jul 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! There is no fallback; if you specify hot_data_store and there is no data in the store, you get an empty result set, even if data that matches your query exists in blob storage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to chime in here: i'm not sure there is anything to explain; it's expected behavior for a query to return exactly what is being asked for (i.e. query = specific question, result = answer), and in this case if the user is querying for data that isn't there, the empty set is that answer

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the word cache being in the title is what made me question this and is throwing me off. Maybe that needs to change? When I think cache I think if there's a cache miss I'll still get the data, it'll just take longer. Absolutely tell me I'm wrong but if I am stumbling on this, others likely will too. I'd prefer we avoid any doubt for users than leave a possible stumbling block.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@npentrel I just removed all usage of the word cache from this page, including the title. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally am a fan of not using cache to describe the hot data store.


## Prerequisites

Before creating a data pipeline, you must enable data capture from at least one component and begin syncing data with Viam.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nathan-contino why did you resolve this? You didn't change anything. Generally unless it's optional feedback, I'd expect a comment or a change. Resolving comments should in most cases be left to the commenter

linkTitle: "Cache recent data"
title: "Cache recent data"
weight: 25
description: "Cache recent data while continuing to write all data to blob storage."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might provide more context

Suggested change
description: "Cache recent data while continuing to write all data to blob storage."
description: "Cache recent data to allow faster access, while continuing to write all data to blob storage."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded a bit differently based on your statements elsewhere regarding the term 'cache'. Hopefully my changes are mostly in the same spirit.

"capture_frequency_hz": 0.5,
"additional_params": {},
"recent_data_store": {
"stored_hours": 24
Copy link
Collaborator

@npentrel npentrel Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked for the example to stay in the docs in some format because the description added a nice explanation of the feature. That really didn't mean please use this as THE thing to tell people to add - because that doesn't help with explanation. A tab with an example configuration or an example configuration just below the configure section might work.

I really don't think you should tell people to add this, then show an example, without explaning it's an example and that the stored_hours are configurable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. I didn't understand that you were referring to the sentence and the example snippet before. Sorry for misunderstanding.

Reworded this to explicitly state that the configuration is an example, lifted the sentence about stored_hours configuration above the snippet, and added an introductory sentence to help users connect the 24 in the example to 24 stored hours of data.

@npentrel npentrel changed the title DOCS-3198: offline data pipelines, SDK docs, hot data store fixes DOCS-3198: Add offline data pipelines, SDK docs, hot data store fixes Jul 10, 2025
@nathan-contino
Copy link
Member Author

@npentrel updated based on your feedback, let me know what other suggestions you have so i can keep this moving

@nathan-contino nathan-contino requested a review from npentrel July 11, 2025 12:51
@@ -0,0 +1,164 @@
---
linkTitle: "Speed up queries to recent data"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit long for the side bar.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe?

Suggested change
linkTitle: "Speed up queries to recent data"
linkTitle: "Optimize recend data queries"

Anything that stays in imperative mood but doesn't spread across two lines

Copy link
Collaborator

@npentrel npentrel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % changing the hot data store link title

@npentrel
Copy link
Collaborator

Is this ready to be merged?

@nathan-contino nathan-contino merged commit 8560cd8 into viamrobotics:main Jul 14, 2025
11 of 12 checks passed
@nathan-contino nathan-contino deleted the DOCS-3198-data-pipelines-prose branch July 14, 2025 12:50
Copy link

🔎💬 Inkeep AI search and chat service is syncing content for source 'Viam Docs'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
safe to build This pull request is marked safe to build from a trusted zone
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants