Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prototypes: data.py, data.r, analysis.py, analysis.r #94

Closed
wants to merge 18 commits into from

Conversation

magland
Copy link
Collaborator

@magland magland commented Jun 26, 2024

This is a redo of #74

In addition to being based on the new main branch, the difference is that it uses an unobtrusive method for triggering the prototypes view.

See this comment

// We want to be as unobtrusive as possible, so we're not going to hook into the
// routing system for the application. We just have this test for
// prototypes, which is a query parameter that can be set to 1 to enable the
// prototypes. Then below we just render the Prototypes Window on this
// condition.
const query = new URLSearchParams(window.location.search)
const prototypesMode = query.get('prototypes') === '1'

So you would do http://localhost:3000?prototypes=1

The idea here is to be able to develop the functionality of generating data.json using data.py or data.r (and also create plots using analysis.py) and then later decide how to integrate it into the UI. It think it would be a mistake to try to do both at the same time.

Copy link
Collaborator

@WardBrian WardBrian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first look the functionality here looks really good! Have you had a go at plot outputs from webr yet?

A few simplifying suggestions and a question:

const [status, setStatus] = useState<'idle' | 'loading' | 'running' | 'completed' | 'failed'>('idle')

const loadPyodideInstance = useMemo(() => {
let pyodide: PyodideInterface | null = null
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any hope of sharing code between this and the data file? Not just for code organization reasons, but I also noticed that this seems to re-download pyodide if you run data.py and then switch and run analysis.py

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to make a decision on whether to use the same pyodide instance for data.py and analysis.py. Since they are going to be depending on a different set of packages (analysis.py will potentially be a lot heavier), I suggest that we use two separate instances even if it means we download twice.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having the variables available from the data script (or at the very least, the data variable) has some value, I think

height: number
}

const exampleScript = `import sys
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to note, the stanio package will probably also be helpful here in terms of preparing a useful output object from the draws. You can see what I'm doing in the normal python tinystan here:

https://github.com/WardBrian/tinystan/blob/main/clients/python/tinystan/output.py

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering whether we should use a pandas dataframe (long form, all chains in one). We want something that will stand the test of time so that old SP projects won't stop working.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want compatibility with existing packages, your best bet will either be something like the tinystan object (which is basically just a dict with ndarray entries), or for maximum interoperability something like ArViz's inferencedata object

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think simpler is better. And we should also try to make it as consistent as possible between R and Python.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R is a slightly different story, but it has a similar sort of lock-in by a package called posterior. If whatever we provide is not at least easily convertible to something posterior can understand, it won't be used

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One approach could be to provide the data to the script in as raw a format as possible. Then have a standard first few of lines that load it into posterior, or arviz, or pandas, etc. Those conversion lines would be part of the analysis.py/analysis.r.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems fine to me

@magland
Copy link
Collaborator Author

magland commented Jun 27, 2024

At first look the functionality here looks really good! Have you had a go at plot outputs from webr yet?

I did look into it, but it's a lot less straightforward than for pyodide (a lot of ugly message passing). I suggest we hold off on that until after we are able to incorporate the analysis.py into the UI.

@WardBrian
Copy link
Collaborator

I did look into it, but it's a lot less straightforward than for pyodide (a lot of ugly message passing). I suggest we hold off on that until after we are able to incorporate the analysis.py into the UI.

Sounds good. We may want to look at how the Quarto plugin for webr does things: https://github.com/coatless/quarto-webr/blob/8795c3d75fdc8348734d6c05f1ba6350fb225f8e/_extensions/webr/qwebr-compute-engine.js

@WardBrian
Copy link
Collaborator

Quarto plugin for webr

Here is a very minimal example showing a plot loaded in the same way they do it: https://stackblitz.com/edit/vitejs-vite-6wuedv?file=src%2FApp.tsx

@magland magland changed the title prototypes: data.py, data.r, analysis.py prototypes: data.py, data.r, analysis.py, analysis.r Jun 28, 2024
@magland
Copy link
Collaborator Author

magland commented Jun 28, 2024

@WardBrian I have added support for analysis.r based on your hints.

@WardBrian
Copy link
Collaborator

Nice! Trying it out now

@WardBrian
Copy link
Collaborator

A few notes before this can move to the non-prototype stage:

  1. I noticed yarn build is raising a bunch of warnings on this branch, possibly related to the pyodide instructions here: https://pyodide.org/en/stable/usage/working-with-bundlers.html#vite

  2. Similarly, I think it would make sense to tell vite to separately chunk webr and pyodide outside of index.js

  3. Pyodide is currently blocking the main thread. It has web worker support, but it seems to be something you need to do manually: https://pyodide.org/en/stable/usage/webworker.html -- WebR seems to be in a worker already/by default.

@magland
Copy link
Collaborator Author

magland commented Jul 1, 2024

A few notes before this can move to the non-prototype stage:

  1. I noticed yarn build is raising a bunch of warnings on this branch, possibly related to the pyodide instructions here: https://pyodide.org/en/stable/usage/working-with-bundlers.html#vite

If I follow the link in the warning it goes to
https://vitejs.dev/guide/troubleshooting.html#module-externalized-for-browser-compatibility
which explains that the problem is with the pyodide code. I tried following those bundling instructions that you provided, but it didn't help. I could be wrong, but I think we just need to ignore those warnings.

  1. Similarly, I think it would make sense to tell vite to separately chunk webr and pyodide outside of index.js

I tried lazy importing which resulted in code splitting. However, the split off chunks were just around 16 kb for pyodide and 60 kb for webr... so not worth it. The bulk of the source code seems to be dynamically loaded at runtime.

  1. Pyodide is currently blocking the main thread. It has web worker support, but it seems to be something you need to do manually: https://pyodide.org/en/stable/usage/webworker.html -- WebR seems to be in a worker already/by default.

I agree it would be better to do pyodide in a web worker. But I don't think this should "block" us right now since I wouldn't expect user scripts to be very computationally intensive. You can correct me on that.

@WardBrian
Copy link
Collaborator

Computationally intensive in terms of number-crunching no, but I certainly think a common use case for data.py will be "fetch this existing dataset from the internet and munge it", so freezing during that fetch could be quite annoying

@magland
Copy link
Collaborator Author

magland commented Jul 1, 2024

Computationally intensive in terms of number-crunching no, but I certainly think a common use case for data.py will be "fetch this existing dataset from the internet and munge it", so freezing during that fetch could be quite annoying

If you feel this is important, I can take a crack at it for data.py. But for analysis.py I think this will prove difficult (if not very difficult) because of the need for this magic:

(document as any).pyodideMplTarget = outputDiv;

@WardBrian
Copy link
Collaborator

I think it is at least worth figuring out if it is possible during this prototyping phase. The fact that jupyterlite doesn't freeze its UI means there must be something there, but it definitely seems possible that if we see what the worker solution requires we may decide that the cure is worse than the disease

@magland
Copy link
Collaborator Author

magland commented Jul 1, 2024

I think it is at least worth figuring out if it is possible during this prototyping phase. The fact that jupyterlite doesn't freeze its UI means there must be something there, but it definitely seems possible that if we see what the worker solution requires we may decide that the cure is worse than the disease

Okay, I'll do it for data.py. (I'm almost positive that jupyterlite use a much more complex system to produce graphical outputs than simply a document.pyodideMplTarget = div ... so I'll hold off on the analysis.py part)

@WardBrian
Copy link
Collaborator

It seems like there are a few different solutions out there, see pyodide/matplotlib-pyodide#6 (comment)

Most of them look like something not too dissimilar to the webR code where you can end up with a list of pictures and separately a list of text outputs from a given program, and then you need to draw them back on the main thread.

@magland
Copy link
Collaborator Author

magland commented Jul 1, 2024

@WardBrian I've got data.py running with a worker.

Do you imagine a need for worker on analysis.py at this point?

@WardBrian
Copy link
Collaborator

Here's a working (but quite messy) example of pyodide + matplotlib in a worker, in a similar style to how we do webR now: https://gist.github.com/WardBrian/d89e939c43134ea405fd68084e702ddf

I think it is possible that analysis scripts could take a long time to run (in particular, big plots take a long time sometimes), and this also lets us add a 'cancel' button which seems valuable. This should also work for any package which is built on top of matplotlib like seaborne, though it will be worth testing for that later.

@magland
Copy link
Collaborator Author

magland commented Jul 2, 2024

@WardBrian @jsoules wonder if you could try to track down the build error here. I'm stuck.

There seems to be a problem importing loadPyodide from the web worker dataPyWorker.ts

[vite:worker] Invalid value "iife" for option "output.format" - UMD and IIFE output formats are not supported for code-splitting builds.

This works in vite dev, but fails on yarn build.

@WardBrian
Copy link
Collaborator

WardBrian commented Jul 2, 2024

Google suggests adding

  worker: {
    format: 'es',
  },

to the vite config, but I'd be at least a little nervous about this causing issues with the other workers we already have. It's also possible to have the worker be a non-module worker and use importScripts() for pyodide, but that loses some other niceties

@magland
Copy link
Collaborator Author

magland commented Jul 2, 2024

worker: {
format: 'es',
},

This seems to have worked! You definitely have a knack.

@WardBrian
Copy link
Collaborator

This seems to have worked! You definitely have a knack.

I don't think I'm that far away from having Tetris Dreams for the MDN web worker documentation at this point

@magland
Copy link
Collaborator Author

magland commented Jul 2, 2024

Okay I've implemented analysis.py in a worker... and I've also left in the non-worker case for comparison, and in case we want to fall back to that.

Copy link
Collaborator

@WardBrian WardBrian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes on the worker -- overall a much cleaner implementation than my hacky thing, and I think it's definitely the way to go. Even just playing around with the small example program we have loaded, it's a much smoother experience.

With an eye toward this becoming a non-prototype, two questions:

  1. Should data.py and analysis.py be using the same Worker code? (I lean toward yes, it seems like a reasonable thing to do)
  2. Should data.py and analysis.py use the same Worker instance? (I lean toward no for an initial implementation)


// here's where we can pass in globals
const globals = pyodide.toPy({ _sp_example_global: 5 });
const script = MPLPreample + '\n' + code + '\nprint("-----x", len(SP_IMAGES))';
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we currently imagining that the worker will be destroyed after every run? The preamble could be run as part of initialization if not

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it hurts to have it here, and in fact it makes things more robust in my mind.

succeeded = false;
}
catch (e) {
console.error(e);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do something other than just log this? Like post an error message?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now posted to the main thread, and overall stdout, stderr posting has been improved. The display in the output window is also formatted better.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does look much better!

The images need to be constrained in size in some way I think -- the arviz example I posted below leads to a very large image which scrolls off the screen

@magland
Copy link
Collaborator Author

magland commented Jul 2, 2024

  1. Should data.py and analysis.py be using the same Worker code? (I lean toward yes, it seems like a reasonable thing to do)

I agree. But I think we should make that merge only when the time is right.

  1. Should data.py and analysis.py use the same Worker instance? (I lean toward no for an initial implementation)

I agree. Different dependencies. And I would guess that the user will not usually use both in a single session.

@WardBrian
Copy link
Collaborator

Here's a good proof-of-concept for analysis.R, straight out of bayesplot's examples: https://mc-stan.org/bayesplot/reference/PPC-scatterplots.html#examples

webr::shim_install()
install.packages("bayesplot")

library(bayesplot)
y <- example_y_data()
yrep <- example_yrep_draws()
p1 <- ppc_scatter_avg(y, yrep)
p1


# don't draw line x=y
ppc_scatter_avg(y, yrep, ref_line = FALSE)


p2 <- ppc_scatter(y, yrep[20:23, ], alpha = 0.5, size = 1.5)
p2


# give x and y axes the same limits
lims <- ggplot2::lims(x = c(0, 160), y = c(0, 160))
p1 + lims

p2 + lims


# for ppc_scatter_avg_grouped the default is to allow the facets
# to have different x and y axes
group <- example_group_data()
ppc_scatter_avg_grouped(y, yrep, group)


# let x-axis vary but force y-axis to be the same
ppc_scatter_avg_grouped(y, yrep, group, facet_args = list(scales = "free_x"))

This works as is with no modifications to the code. We might want to call webr::shim_install() automatically? https://docs.r-wasm.org/webr/latest/packages.html#shimming-install.packages-for-webr

@WardBrian
Copy link
Collaborator

@magland -- How do we proceed from here? I personally feel like the functionality is good enough at this point that it is worth splitting the PR into two that actually integrate into the rest of the site (assuming you didn't intend for this branch to ever be merged directly -- correct me if I'm wrong there).

I suspect some more things will jump out to us once it is placed in context, and it is probably a good opportunity to think about code organization etc now that it seems we are in a pretty good place in terms of feature exploration. Thoughts?

@magland
Copy link
Collaborator Author

magland commented Jul 2, 2024

@magland -- How do we proceed from here? I personally feel like the functionality is good enough at this point that it is worth splitting the PR into two that actually integrate into the rest of the site (assuming you didn't intend for this branch to ever be merged directly -- correct me if I'm wrong there).

I suspect some more things will jump out to us once it is placed in context, and it is probably a good opportunity to think about code organization etc now that it seems we are in a pretty good place in terms of feature exploration. Thoughts?

I didn't have a specific plan. My inclination is to just start new PRs and copy the code over, being sure to reference this discussion in the new PRs.

In terms of code organization, I don't have strong opinions, and I don't think too much thought needs to be put into it. Just tidy things up.

@WardBrian
Copy link
Collaborator

My inclination is to just start new PRs and copy the code over, being sure to reference this discussion in the new PRs.

That sounds good to me!

Copy link
Collaborator

@jsoules jsoules left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of notes. I didn't get too into the weeds since it seems like we won't be merging this version directly; I'll be more detailed with comments for the subsequent PRs.

outputDiv={outputDiv}
/>
)}
<AnalysisPyOutputWindow key={`analysis-py-output-window-${useWorker}`} width={0} height={0} onOutputDiv={setOutputDiv} />
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the onOutputDiv structure here. Why the callback vs. just a simple ref from the beginning?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, passing a ref to the div and then utilizing .current on that ref is susceptible to race conditions etc. With the callback, it is much more clear whether the div has been loaded yet, and the state changes on the load. Maybe there's a more standard way to do it, but I have found that this callback mechanism is robust and is more in line with state management that I am familiar with for react.

Comment on lines 57 to 59
} else {
return pyodide;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the absence of a useState or useRef here, I'm not sure that this could ever be the case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Fixed.

// worker creation
useEffect(() => {
const worker = new Worker(analysisPyWorkerURL, {
name: "dataPyWorker",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit confusing--is it the data worker or the analysis worker?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

readOnly: boolean;
width: number;
height: number;
outputDiv?: HTMLDivElement | null;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realistically, is it ever going to be okay for this to be null/undefined?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's null until the div is loaded and the callback is triggered.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, a better answer: I think we want to be able to use this component without specifying an output div.

Comment on lines 78 to 82
const divElement = document.createElement("div");
divElement.style.color = "blue";
const preElement = document.createElement("pre");
divElement.appendChild(preElement);
preElement.textContent = dd.data;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd wrap all of this in a function which could be shared with the stderr version below.

Probably preferable to handle the style via a named css class rather than manual styling.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest this can be cleaned up later.

Comment on lines 45 to 47
// the runPython call is going to be blocking, so we want to give
// react a chance to update the status in the UI.
await new Promise((resolve) => setTimeout(resolve, 100));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still relevant?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

Comment on lines 70 to 104
const resultToData = (result: any): any => {
if (result === null || result === undefined) {
return result;
}
if (typeof result !== "object") {
return result;
}
if (result instanceof Map) {
const ret: { [key: string]: any } = {};
for (const k of result.keys()) {
ret[k] = resultToData(result.get(k));
}
return ret;
} else if (
result instanceof Int16Array ||
result instanceof Int32Array ||
result instanceof Int8Array ||
result instanceof Uint16Array ||
result instanceof Uint32Array ||
result instanceof Uint8Array ||
result instanceof Uint8ClampedArray ||
result instanceof Float32Array ||
result instanceof Float64Array
) {
return Array.from(result);
} else if (result instanceof Array) {
return result.map(resultToData);
} else {
const ret: { [key: string]: any } = {};
for (const k of Object.keys(result)) {
ret[k] = resultToData(result[k]);
}
return ret;
}
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we're ready to make this production code, we should probably document the expectations of this format & the internal representation (whatever it is).

import { PyodideInterface, loadPyodide } from "pyodide";

let pyodide: PyodideInterface | null = null;
const loadPyodideInstance = async () => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may have been discussed already, but the code here seems very similar to the analysis worker, and we would likely benefit from combining them (while using separate instances as appropriate).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@magland
Copy link
Collaborator Author

magland commented Jul 3, 2024

In preparation for incorporating into the UI, this latest commit I did the following.

Combined worker code for data.py and analysis.py so there is now one source file, but still two separate worker instances.

Created pyodidedWorker.ts (worker thread), pyodideWorkerInterface (main thread), and pyodideWorkerTypes (types for messages, status, etc)

Overall I think this implementation is much more solid.

@magland
Copy link
Collaborator Author

magland commented Jul 3, 2024

With this last commit I have separated out the console and image outputs for analysis.py, analysis.r and included console output windows for data.py and data.r

image

image

@magland magland mentioned this pull request Jul 3, 2024
@WardBrian
Copy link
Collaborator

Note: We should probably add a call to https://pyodide.org/en/stable/usage/api/js-api.html#pyodide.loadPackagesFromImports to the pyodide worker to allow users to use other built-in packages that aren't loaded by default. This can actually replace us manually loading numpy, I think, but the special logic is still needed for arviz since that needs to be installed from pip

@magland
Copy link
Collaborator Author

magland commented Jul 25, 2024

@WardBrian should I make a draft of the analysis.R from the main branch and then hand it over to you for creating "draws" object? I didn't want to do this simultaneously if you were planning on working on it.

@WardBrian
Copy link
Collaborator

WardBrian commented Jul 25, 2024

Oh, I wasn’t working on it because I thought you wanted to!

I don’t mind either way, if you want to do it that way it sounds good to me

@magland magland closed this Jul 25, 2024
@WardBrian WardBrian deleted the python-r-prototypes branch July 26, 2024 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants