Skip to content

File storage API is missing an atomic-write primitive — every ability persisting JSON state has a tearing window #253

@realdecimalist

Description

@realdecimalist

Hi — while refactoring the orynq-ai-auditability ability for #249, we hit a durability gap in the file-storage API that probably affects every community ability doing JSON persistence.

The gap

check_if_file_exists, read_file, write_file, and delete_file are the only storage primitives documented in docs/OpenHome_SDK_Reference.md §8. There is no atomic-replace:

  • No replace_file(src, dst) / rename equivalent
  • No atomic_write_file(path, contents) helper
  • No fsync-after-write guarantee

The docs prescribe delete-then-write or mode="w" for overwrites. Both have a crash window: if the device loses power, the app is killed, or the write fails partway through, the file ends up empty / truncated / corrupt, and the ability boots with zero state on next run.

For abilities that maintain append-only audit chains, monotonic counters, or any state where "lose half of it" is worse than "have the old version," that window is a correctness issue, not a performance one.

What we did as a workaround (PR #249 commit e404687)

Forward-journal pattern on top of the four existing primitives:

  1. Write new contents to <path>.tmp
  2. Read back and verify (size check + JSON parse)
  3. delete_file(<path>)
  4. write_file(<path>, contents)
  5. delete_file(<path>.tmp)
  6. On load: if <path> is missing/corrupt and <path>.tmp parses, recover from .tmp and promote on next write

This closes the window for our ability, but:

  • Steps 3–4 still have a brief crash window where the real file doesn't exist on disk. Only a true rename is fully atomic.
  • Every ability author doing JSON persistence now has to re-implement this pattern.
  • Most won't — they'll follow the docs' "delete-then-write" prescription and eat the tearing risk silently.

Proposed fix

A single primitive in the SDK:

sdk.atomic_write_file(path: str, contents: str | bytes, in_ability_directory: bool = True) -> None

Implementation is standard write-to-tmp + fsync + os.rename (or os.replace — atomic on both POSIX and NTFS). ~5 lines plus error handling.

That one helper closes the class of bug across every ability without each author having to know POSIX filesystem semantics. The existing four primitives can stay; this is additive.

Alternatively (cheaper)

If atomic rename isn't trivial given the SDK's deployment model (sandboxed FS, remote storage, etc.), a clearly-flagged write_file_with_backup that does copy real -> .bak; write real; delete .bak would at least give ability authors a recoverable path. Our forward-journal pattern is close to this shape but journals forward instead of backward. Either direction is strictly better than the current docs-prescribed flow.

Scope

Happy to open a follow-up PR with the implementation if the core team agrees on the shape. We're already carrying the workaround in community/orynq-ai-auditability/ and can migrate it back to the SDK-provided primitive once it lands.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions