Merge pull request #249 from lincc-frameworks/internals_diagram

dougbrn · web-flow · commit efaa27ded4cb · 2025-05-05T13:26:43.000-07:00
Add About Section to docs
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -17,10 +17,10 @@ repos:
         name: Clear output from Jupyter notebooks
         description: Clear output from Jupyter notebooks.
         files: \.ipynb$
-        exclude: ^docs/pre_executed
         stages: [pre-commit]
         language: system
         entry: jupyter nbconvert --clear-output
+        exclude: docs/pre_executed
     # Prevents committing directly branches named 'main' and 'master'.
   - repo: https://github.com/pre-commit/pre-commit-hooks
     rev: v4.4.0
diff --git a/docs/about.rst b/docs/about.rst
@@ -0,0 +1,8 @@
+About Nested-Pandas
+===================
+
+
+.. toctree::
+
+    Internal Representation of Nested Data <about/internals>
+    Performance Impact of Nested-Pandas <pre_executed/performance>
diff --git a/docs/about/internals.rst b/docs/about/internals.rst
@@ -0,0 +1,36 @@
+Internal Representation of Nested Data
+======================================
+"Dataframes within Dataframes" is a useful hueristic for understanding the 
+API/workings of a NestedFrame. However, the actual storage representation 
+leverages pyarrow and materializes the nested dataframes as a view of the 
+data. The following diagram details the actual storage representation of 
+nested-pandas:
+
+.. image:: ./npd_internals.png
+   :width: 400
+   :align: center
+   :alt: Internal representation of nested-pandas
+
+
+The advantage of this approach is that each sub-column ("field" in pyarrow) is
+stored in a flat array, with an offset array used to slice the data into the
+respective sub-dataframes. This allows for efficient transformations to other
+data representations (dataframes, list-arrays, flat arrays, etc.) which are
+used internally to minimize overhead of operations involving nested data.
+
+Nested Serialization to Parquet
+-------------------------------
+The internal design of nested columns has valid pyarrow struct-list objects
+underneath. This allows for direct serialization of nested columns to the
+parquet format. nested-pandas will automatically write nested columns to
+parquet format as valid pyarrow dtypes, which allows for them to be read
+by other parquet readers that support complex types. Additionally, nested-pandas
+will attempt to cast pyarrow struct-list columns to nested columns directly
+when reading from parquet.
+
+
+Multi-level Nesting Support
+---------------------------
+At this time, nested-pandas only supports a single level of nesting. Though we
+intend to support multiple levels of nesting in the future, and would be
+additionally motivated by community use cases that would benefit from this.
diff --git a/docs/about/npd_internals.png b/docs/about/npd_internals.png
diff --git a/docs/index.rst b/docs/index.rst
@@ -79,6 +79,9 @@ API-level information about nested-pandas is viewable in the
 :doc:`API Reference <reference>`
 section.
 
+The :doc:`About Nested-Pandas <about>` section provides information on the
+design and performance advantages of nested-pandas.
+
 Learn more about contributing to this repository in our :doc:`Contribution Guide <gettingstarted/contributing>`.
 
 .. toctree::
@@ -88,3 +91,4 @@ Learn more about contributing to this repository in our :doc:`Contribution Guide
    Getting Started <gettingstarted>
    Tutorials <tutorials>
    API Reference <reference>
+   About Nested-Pandas <about>
diff --git a/docs/pre_executed/performance.ipynb b/docs/pre_executed/performance.ipynb
@@ -0,0 +1,139 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Performance Impact of `nested-pandas`\n",
+    "\n",
+    "For use-cases involving nesting data, `nested-pandas` can offer significant speedups compared to using the native `pandas` API. Below is a brief example workflow comparison between `pandas` and `nested-pandas`, where this example workflow calculates the amplitude of photometric fluxes after a few filtering steps."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import nested_pandas as npd\n",
+    "import pandas as pd\n",
+    "import light_curve as licu\n",
+    "import numpy as np\n",
+    "\n",
+    "from nested_pandas.utils import count_nested"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Pandas"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "498 ms ± 3.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%timeit\n",
+    "\n",
+    "# Read data\n",
+    "object_df = pd.read_parquet(\"objects.parquet\")\n",
+    "source_df = pd.read_parquet(\"ztf_sources.parquet\")\n",
+    "\n",
+    "# Filter on object\n",
+    "filtered_object = object_df.query(\"ra > 10.0\")\n",
+    "# sync object to source --removes any index values of source not found in object\n",
+    "filtered_source = filtered_object[[]].join(source_df, how=\"left\")\n",
+    "\n",
+    "# Count number of observations per photometric band and add it to the object table\n",
+    "band_counts = (\n",
+    "    source_df.groupby(level=0)\n",
+    "    .apply(lambda x: x[[\"band\"]].value_counts().reset_index())\n",
+    "    .pivot_table(values=\"count\", index=\"index\", columns=\"band\", aggfunc=\"sum\")\n",
+    ")\n",
+    "filtered_object = filtered_object.join(band_counts[[\"g\", \"r\"]])\n",
+    "\n",
+    "# Filter on our nobs\n",
+    "filtered_object = filtered_object.query(\"g > 520\")\n",
+    "filtered_source = filtered_object[[]].join(source_df, how=\"left\")\n",
+    "\n",
+    "# Calculate Amplitude\n",
+    "amplitude = licu.Amplitude()\n",
+    "filtered_source.groupby(level=0).apply(lambda x: amplitude(np.array(x.mjd), np.array(x.flux)))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Nested-Pandas"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "228 ms ± 2.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%timeit\n",
+    "\n",
+    "# Read in parquet data\n",
+    "# nesting sources into objects\n",
+    "nf = npd.read_parquet(\"objects.parquet\")\n",
+    "nf = nf.add_nested(npd.read_parquet(\"ztf_sources.parquet\"), \"ztf_sources\")\n",
+    "\n",
+    "# Filter on object\n",
+    "nf = nf.query(\"ra > 10.0\")\n",
+    "\n",
+    "# Count number of observations per photometric band and add it as a column\n",
+    "nf = count_nested(nf, \"ztf_sources\", by=\"band\", join=True)  # use an existing utility\n",
+    "\n",
+    "# Filter on our nobs\n",
+    "nf = nf.query(\"n_ztf_sources_g > 520\")\n",
+    "\n",
+    "# Calculate Amplitude\n",
+    "amplitude = licu.Amplitude()\n",
+    "nf.reduce(amplitude, \"ztf_sources.mjd\", \"ztf_sources.flux\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "lsdb",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.11"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/docs/requirements.txt b/docs/requirements.txt
@@ -10,4 +10,5 @@ sphinx-copybutton
 sphinx-book-theme
 astroquery
 astropy
-matplotlib
+matplotlib
+light-curve