Skip to content

Commit efaa27d

Browse files
authored
Merge pull request #249 from lincc-frameworks/internals_diagram
Add About Section to docs
2 parents f75d700 + 764abc1 commit efaa27d

File tree

7 files changed

+190
-2
lines changed

7 files changed

+190
-2
lines changed

.pre-commit-config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,10 +17,10 @@ repos:
1717
name: Clear output from Jupyter notebooks
1818
description: Clear output from Jupyter notebooks.
1919
files: \.ipynb$
20-
exclude: ^docs/pre_executed
2120
stages: [pre-commit]
2221
language: system
2322
entry: jupyter nbconvert --clear-output
23+
exclude: docs/pre_executed
2424
# Prevents committing directly branches named 'main' and 'master'.
2525
- repo: https://github.com/pre-commit/pre-commit-hooks
2626
rev: v4.4.0

docs/about.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
About Nested-Pandas
2+
===================
3+
4+
5+
.. toctree::
6+
7+
Internal Representation of Nested Data <about/internals>
8+
Performance Impact of Nested-Pandas <pre_executed/performance>

docs/about/internals.rst

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
Internal Representation of Nested Data
2+
======================================
3+
"Dataframes within Dataframes" is a useful hueristic for understanding the
4+
API/workings of a NestedFrame. However, the actual storage representation
5+
leverages pyarrow and materializes the nested dataframes as a view of the
6+
data. The following diagram details the actual storage representation of
7+
nested-pandas:
8+
9+
.. image:: ./npd_internals.png
10+
:width: 400
11+
:align: center
12+
:alt: Internal representation of nested-pandas
13+
14+
15+
The advantage of this approach is that each sub-column ("field" in pyarrow) is
16+
stored in a flat array, with an offset array used to slice the data into the
17+
respective sub-dataframes. This allows for efficient transformations to other
18+
data representations (dataframes, list-arrays, flat arrays, etc.) which are
19+
used internally to minimize overhead of operations involving nested data.
20+
21+
Nested Serialization to Parquet
22+
-------------------------------
23+
The internal design of nested columns has valid pyarrow struct-list objects
24+
underneath. This allows for direct serialization of nested columns to the
25+
parquet format. nested-pandas will automatically write nested columns to
26+
parquet format as valid pyarrow dtypes, which allows for them to be read
27+
by other parquet readers that support complex types. Additionally, nested-pandas
28+
will attempt to cast pyarrow struct-list columns to nested columns directly
29+
when reading from parquet.
30+
31+
32+
Multi-level Nesting Support
33+
---------------------------
34+
At this time, nested-pandas only supports a single level of nesting. Though we
35+
intend to support multiple levels of nesting in the future, and would be
36+
additionally motivated by community use cases that would benefit from this.

docs/about/npd_internals.png

44.4 KB
Loading

docs/index.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,9 @@ API-level information about nested-pandas is viewable in the
7979
:doc:`API Reference <reference>`
8080
section.
8181

82+
The :doc:`About Nested-Pandas <about>` section provides information on the
83+
design and performance advantages of nested-pandas.
84+
8285
Learn more about contributing to this repository in our :doc:`Contribution Guide <gettingstarted/contributing>`.
8386

8487
.. toctree::
@@ -88,3 +91,4 @@ Learn more about contributing to this repository in our :doc:`Contribution Guide
8891
Getting Started <gettingstarted>
8992
Tutorials <tutorials>
9093
API Reference <reference>
94+
About Nested-Pandas <about>

docs/pre_executed/performance.ipynb

Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Performance Impact of `nested-pandas`\n",
8+
"\n",
9+
"For use-cases involving nesting data, `nested-pandas` can offer significant speedups compared to using the native `pandas` API. Below is a brief example workflow comparison between `pandas` and `nested-pandas`, where this example workflow calculates the amplitude of photometric fluxes after a few filtering steps."
10+
]
11+
},
12+
{
13+
"cell_type": "code",
14+
"execution_count": null,
15+
"metadata": {},
16+
"outputs": [],
17+
"source": [
18+
"import nested_pandas as npd\n",
19+
"import pandas as pd\n",
20+
"import light_curve as licu\n",
21+
"import numpy as np\n",
22+
"\n",
23+
"from nested_pandas.utils import count_nested"
24+
]
25+
},
26+
{
27+
"cell_type": "markdown",
28+
"metadata": {},
29+
"source": [
30+
"## Pandas"
31+
]
32+
},
33+
{
34+
"cell_type": "code",
35+
"execution_count": 5,
36+
"metadata": {},
37+
"outputs": [
38+
{
39+
"name": "stdout",
40+
"output_type": "stream",
41+
"text": [
42+
"498 ms ± 3.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
43+
]
44+
}
45+
],
46+
"source": [
47+
"%%timeit\n",
48+
"\n",
49+
"# Read data\n",
50+
"object_df = pd.read_parquet(\"objects.parquet\")\n",
51+
"source_df = pd.read_parquet(\"ztf_sources.parquet\")\n",
52+
"\n",
53+
"# Filter on object\n",
54+
"filtered_object = object_df.query(\"ra > 10.0\")\n",
55+
"# sync object to source --removes any index values of source not found in object\n",
56+
"filtered_source = filtered_object[[]].join(source_df, how=\"left\")\n",
57+
"\n",
58+
"# Count number of observations per photometric band and add it to the object table\n",
59+
"band_counts = (\n",
60+
" source_df.groupby(level=0)\n",
61+
" .apply(lambda x: x[[\"band\"]].value_counts().reset_index())\n",
62+
" .pivot_table(values=\"count\", index=\"index\", columns=\"band\", aggfunc=\"sum\")\n",
63+
")\n",
64+
"filtered_object = filtered_object.join(band_counts[[\"g\", \"r\"]])\n",
65+
"\n",
66+
"# Filter on our nobs\n",
67+
"filtered_object = filtered_object.query(\"g > 520\")\n",
68+
"filtered_source = filtered_object[[]].join(source_df, how=\"left\")\n",
69+
"\n",
70+
"# Calculate Amplitude\n",
71+
"amplitude = licu.Amplitude()\n",
72+
"filtered_source.groupby(level=0).apply(lambda x: amplitude(np.array(x.mjd), np.array(x.flux)))"
73+
]
74+
},
75+
{
76+
"cell_type": "markdown",
77+
"metadata": {},
78+
"source": [
79+
"## Nested-Pandas"
80+
]
81+
},
82+
{
83+
"cell_type": "code",
84+
"execution_count": null,
85+
"metadata": {},
86+
"outputs": [
87+
{
88+
"name": "stdout",
89+
"output_type": "stream",
90+
"text": [
91+
"228 ms ± 2.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
92+
]
93+
}
94+
],
95+
"source": [
96+
"%%timeit\n",
97+
"\n",
98+
"# Read in parquet data\n",
99+
"# nesting sources into objects\n",
100+
"nf = npd.read_parquet(\"objects.parquet\")\n",
101+
"nf = nf.add_nested(npd.read_parquet(\"ztf_sources.parquet\"), \"ztf_sources\")\n",
102+
"\n",
103+
"# Filter on object\n",
104+
"nf = nf.query(\"ra > 10.0\")\n",
105+
"\n",
106+
"# Count number of observations per photometric band and add it as a column\n",
107+
"nf = count_nested(nf, \"ztf_sources\", by=\"band\", join=True) # use an existing utility\n",
108+
"\n",
109+
"# Filter on our nobs\n",
110+
"nf = nf.query(\"n_ztf_sources_g > 520\")\n",
111+
"\n",
112+
"# Calculate Amplitude\n",
113+
"amplitude = licu.Amplitude()\n",
114+
"nf.reduce(amplitude, \"ztf_sources.mjd\", \"ztf_sources.flux\")"
115+
]
116+
}
117+
],
118+
"metadata": {
119+
"kernelspec": {
120+
"display_name": "lsdb",
121+
"language": "python",
122+
"name": "python3"
123+
},
124+
"language_info": {
125+
"codemirror_mode": {
126+
"name": "ipython",
127+
"version": 3
128+
},
129+
"file_extension": ".py",
130+
"mimetype": "text/x-python",
131+
"name": "python",
132+
"nbconvert_exporter": "python",
133+
"pygments_lexer": "ipython3",
134+
"version": "3.11.11"
135+
}
136+
},
137+
"nbformat": 4,
138+
"nbformat_minor": 2
139+
}

docs/requirements.txt

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,4 +10,5 @@ sphinx-copybutton
1010
sphinx-book-theme
1111
astroquery
1212
astropy
13-
matplotlib
13+
matplotlib
14+
light-curve

0 commit comments

Comments
 (0)