Summary
Support viewing non-JSONL datasets (PyArrow-based, CSV, TSV) in a way that still gives users a good inspection experience.
Problem
Many real-world ML datasets aren’t stored as JSONL; they may be in PyArrow/Parquet, CSV, or TSV. Right now, the viewer is effectively JSONL-centric, which limits its usefulness for ML engineers working with diverse formats.
Proposed Solution
- Define a minimal abstraction for “row-based dataset” independent of underlying storage.
- For each supported format:
- CSV/TSV:
- Parse headers as column names.
- Treat each row as a flat object; if a cell contains JSON, optionally detect and pretty-print it.
- PyArrow / Arrow-backed formats (initially optimistic / read-only assumptions):
- Use a Node/JS-accessible reader or a conversion step to present rows and columns.
- Preserve column names and basic types; optionally detect JSON strings similarly.
- Keep the UI consistent: same pretty-printing and JQ-style key selection concepts where applicable.
Acceptance Criteria
- User can open CSV/TSV files and:
- See rows/columns with column headers.
- Optionally expand JSON-like cells as nested JSON views.
- Basic support for at least one Arrow-based dataset path (even if via a conversion step or a restricted subset).
- Errors for unsupported or malformed files are clear and non-crashing.
Summary
Support viewing non-JSONL datasets (PyArrow-based, CSV, TSV) in a way that still gives users a good inspection experience.
Problem
Many real-world ML datasets aren’t stored as JSONL; they may be in PyArrow/Parquet, CSV, or TSV. Right now, the viewer is effectively JSONL-centric, which limits its usefulness for ML engineers working with diverse formats.
Proposed Solution
Acceptance Criteria