Skip to content

Add About Section to docs #249

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
May 5, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,10 @@ repos:
name: Clear output from Jupyter notebooks
description: Clear output from Jupyter notebooks.
files: \.ipynb$
exclude: ^docs/pre_executed
stages: [pre-commit]
language: system
entry: jupyter nbconvert --clear-output
exclude: docs/pre_executed
# Prevents committing directly branches named 'main' and 'master'.
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.4.0
Expand Down
8 changes: 8 additions & 0 deletions docs/about.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
About Nested-Pandas
===================


.. toctree::

Internal Representation of Nested Data <about/internals>
Performance Impact of Nested-Pandas <pre_executed/performance>
36 changes: 36 additions & 0 deletions docs/about/internals.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
Internal Representation of Nested Data
======================================
"Dataframes within Dataframes" is a useful hueristic for understanding the
API/workings of a NestedFrame. However, the actual storage representation
leverages pyarrow and materializes the nested dataframes as a view of the
data. The following diagram details the actual storage representation of
nested-pandas:

.. image:: ./npd_internals.png
:width: 400
:align: center
:alt: Internal representation of nested-pandas


The advantage of this approach is that each sub-column ("field" in pyarrow) is
stored in a flat array, with an offset array used to slice the data into the
respective sub-dataframes. This allows for efficient transformations to other
data representations (dataframes, list-arrays, flat arrays, etc.) which are
used internally to minimize overhead of operations involving nested data.

Nested Serialization to Parquet
-------------------------------
The internal design of nested columns has valid pyarrow struct-list objects
underneath. This allows for direct serialization of nested columns to the
parquet format. nested-pandas will automatically write nested columns to
parquet format as valid pyarrow dtypes, which allows for them to be read
by other parquet readers that support complex types. Additionally, nested-pandas
will attempt to cast pyarrow struct-list columns to nested columns directly
when reading from parquet.


Multi-level Nesting Support
---------------------------
At this time, nested-pandas only supports a single level of nesting. Though we
intend to support multiple levels of nesting in the future, and would be
additionally motivated by community use cases that would benefit from this.
Binary file added docs/about/npd_internals.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 4 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,9 @@ API-level information about nested-pandas is viewable in the
:doc:`API Reference <reference>`
section.

The :doc:`About Nested-Pandas <about>` section provides information on the
design and performance advantages of nested-pandas.

Learn more about contributing to this repository in our :doc:`Contribution Guide <gettingstarted/contributing>`.

.. toctree::
Expand All @@ -88,3 +91,4 @@ Learn more about contributing to this repository in our :doc:`Contribution Guide
Getting Started <gettingstarted>
Tutorials <tutorials>
API Reference <reference>
About Nested-Pandas <about>
139 changes: 139 additions & 0 deletions docs/pre_executed/performance.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Performance Impact of `nested-pandas`\n",
"\n",
"For use-cases involving nesting data, `nested-pandas` can offer significant speedups compared to using the native `pandas` API. Below is a brief example workflow comparison between `pandas` and `nested-pandas`, where this example workflow calculates the amplitude of photometric fluxes after a few filtering steps."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import nested_pandas as npd\n",
"import pandas as pd\n",
"import light_curve as licu\n",
"import numpy as np\n",
"\n",
"from nested_pandas.utils import count_nested"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Pandas"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"498 ms ± 3.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%%timeit\n",
"\n",
"# Read data\n",
"object_df = pd.read_parquet(\"objects.parquet\")\n",
"source_df = pd.read_parquet(\"ztf_sources.parquet\")\n",
"\n",
"# Filter on object\n",
"filtered_object = object_df.query(\"ra > 10.0\")\n",
"# sync object to source --removes any index values of source not found in object\n",
"filtered_source = filtered_object[[]].join(source_df, how=\"left\")\n",
"\n",
"# Count number of observations per photometric band and add it to the object table\n",
"band_counts = (\n",
" source_df.groupby(level=0)\n",
" .apply(lambda x: x[[\"band\"]].value_counts().reset_index())\n",
" .pivot_table(values=\"count\", index=\"index\", columns=\"band\", aggfunc=\"sum\")\n",
")\n",
"filtered_object = filtered_object.join(band_counts[[\"g\", \"r\"]])\n",
"\n",
"# Filter on our nobs\n",
"filtered_object = filtered_object.query(\"g > 520\")\n",
"filtered_source = filtered_object[[]].join(source_df, how=\"left\")\n",
"\n",
"# Calculate Amplitude\n",
"amplitude = licu.Amplitude()\n",
"filtered_source.groupby(level=0).apply(lambda x: amplitude(np.array(x.mjd), np.array(x.flux)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Nested-Pandas"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"228 ms ± 2.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%%timeit\n",
"\n",
"# Read in parquet data\n",
"# nesting sources into objects\n",
"nf = npd.read_parquet(\"objects.parquet\")\n",
"nf = nf.add_nested(npd.read_parquet(\"ztf_sources.parquet\"), \"ztf_sources\")\n",
"\n",
"# Filter on object\n",
"nf = nf.query(\"ra > 10.0\")\n",
"\n",
"# Count number of observations per photometric band and add it as a column\n",
"nf = count_nested(nf, \"ztf_sources\", by=\"band\", join=True) # use an existing utility\n",
"\n",
"# Filter on our nobs\n",
"nf = nf.query(\"n_ztf_sources_g > 520\")\n",
"\n",
"# Calculate Amplitude\n",
"amplitude = licu.Amplitude()\n",
"nf.reduce(amplitude, \"ztf_sources.mjd\", \"ztf_sources.flux\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "lsdb",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.11"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
3 changes: 2 additions & 1 deletion docs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,5 @@ sphinx-copybutton
sphinx-book-theme
astroquery
astropy
matplotlib
matplotlib
light-curve
Loading