Skip to content

Add additional dataset format support (pyarrow, csv, tsv, etc.) #6

@Cameron7195

Description

@Cameron7195

Summary
Support viewing non-JSONL datasets (PyArrow-based, CSV, TSV) in a way that still gives users a good inspection experience.

Problem
Many real-world ML datasets aren’t stored as JSONL; they may be in PyArrow/Parquet, CSV, or TSV. Right now, the viewer is effectively JSONL-centric, which limits its usefulness for ML engineers working with diverse formats.

Proposed Solution

  • Define a minimal abstraction for “row-based dataset” independent of underlying storage.
  • For each supported format:
    • CSV/TSV:
      • Parse headers as column names.
      • Treat each row as a flat object; if a cell contains JSON, optionally detect and pretty-print it.
    • PyArrow / Arrow-backed formats (initially optimistic / read-only assumptions):
      • Use a Node/JS-accessible reader or a conversion step to present rows and columns.
      • Preserve column names and basic types; optionally detect JSON strings similarly.
  • Keep the UI consistent: same pretty-printing and JQ-style key selection concepts where applicable.

Acceptance Criteria

  • User can open CSV/TSV files and:
    • See rows/columns with column headers.
    • Optionally expand JSON-like cells as nested JSON views.
  • Basic support for at least one Arrow-based dataset path (even if via a conversion step or a restricted subset).
  • Errors for unsupported or malformed files are clear and non-crashing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions