-
Notifications
You must be signed in to change notification settings - Fork 12
Description
In the past we've raised some issues about peppy performance
(See #388 #387). Peppy is fine for small projects (hundreds or even thousands of sample rows, but it gets slow when we are dealing with huge projects, like tens to hundreds of thousands of samples.
It would be nice if peppy could handle these very large projects.
One of the problems is that peppy is storing sample information in two forms: a table (as a pandas data frame object), and as a list of Sample
objects. This is duplicating the information in memory.
An idea for improving the performance could be to switch to a single-memory model. But we really want to be able to access the metadata in both ways for different use cases... so what about using the pandas data.frame as the main data structure, and then providing some kind of a generator that could go through it and create objects on the fly, in case someone wants the list-based approach?
This could be one way to increase performance.