Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 52 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
[![PyPI-Server](https://img.shields.io/pypi/v/compressed-lists.svg)](https://pypi.org/project/compressed-lists/)
![Unit tests](https://github.com/BiocPy/compressed-lists/actions/workflows/pypi-test.yml/badge.svg)
![Unit tests](https://github.com/BiocPy/compressed-lists/actions/workflows/run-tests.yml/badge.svg)

# compressed-lists
# CompressedList Implementation in Python

> Add a short description here!
A Python implementation of the `CompressedList` class from R/Bioconductor for memory-efficient list-like objects.

A longer description of your project goes here...
`CompressedList` is a memory-efficient container for list-like objects. Instead of storing each list element separately, it concatenates all elements into a single vector-like object and maintains information about where each original element begins and ends. This approach is significantly more memory-efficient than standard lists, especially when dealing with many list elements.

## Install

Expand All @@ -15,6 +15,54 @@ To get started, install the package from [PyPI](https://pypi.org/project/compres
pip install compressed-lists
```

## Usage


```py
from compressed_lists import CompressedIntegerList, CompressedStringList

# Create a CompressedIntegerList
int_data = [[1, 2, 3], [4, 5], [6, 7, 8, 9]]
names = ["A", "B", "C"]
int_list = CompressedIntegerList.from_list(int_data, names)

# Access elements
print(int_list[0]) # [1, 2, 3]
print(int_list["B"]) # [4, 5]
print(int_list[1:3]) # Slice of elements

# Apply a function to each element
squared = int_list.lapply(lambda x: [i**2 for i in x])
print(squared[0]) # [1, 4, 9]

# Convert to a regular Python list
regular_list = int_list.to_list()

# Create a CompressedStringList
char_data = [["apple", "banana"], ["cherry", "date", "elderberry"], ["fig"]]
char_list = CompressedStringList.from_list(char_data)
```

### Partitioning

The `Partitioning` class handles the information about where each element begins and ends in the concatenated data. It allows for efficient extraction of elements without storing each element separately.

```python
from compressed_lists import Partitioning

# Create partitioning from end positions
ends = [3, 5, 10]
names = ["A", "B", "C"]
part = Partitioning(ends, names)

# Get partition range for an element
start, end = part[1] # Returns (3, 5)
```

> [!NOTE]
>
> Check out the [documentation](https://biocpy.github.io/compressed-lists) for extending CompressedLists to custom data types.

<!-- biocsetup-notes -->

## Note
Expand Down
1 change: 1 addition & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -299,6 +299,7 @@
"scipy": ("https://docs.scipy.org/doc/scipy/reference", None),
"setuptools": ("https://setuptools.pypa.io/en/stable/", None),
"pyscaffold": ("https://pyscaffold.org/en/stable", None),
"biocutils": ("https://biocpy.github.io/BiocUtils", None),
}

print(f"loading configurations for {project} {version} ...", file=sys.stderr)
Expand Down
18 changes: 9 additions & 9 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,16 @@
# compressed-lists

Add a short description here!
A Python implementation of the `CompressedList` class from R/Bioconductor for memory-efficient list-like objects.

`CompressedList` is a memory-efficient container for list-like objects. Instead of storing each list element separately, it concatenates all elements into a single vector-like object and maintains information about where each original element begins and ends. This approach is significantly more memory-efficient than standard lists, especially when dealing with many list elements.

## Note
## Install

> This is the main page of your project's [Sphinx] documentation. It is
> formatted in [Markdown]. Add additional pages by creating md-files in
> `docs` or rst-files (formatted in [reStructuredText]) and adding links to
> them in the `Contents` section below.
>
> Please check [Sphinx] and [MyST] for more information
> about how to document your project and how to configure your preferences.
To get started, install the package from [PyPI](https://pypi.org/project/compressed-lists/)

```bash
pip install compressed-lists
```


## Contents
Expand All @@ -20,6 +19,7 @@ Add a short description here!
:maxdepth: 2

Overview <readme>
Tutorial <tutorial>
Contributions & Help <contributing>
License <license>
Authors <authors>
Expand Down
189 changes: 189 additions & 0 deletions docs/tutorial.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,189 @@
---
file_format: mystnb
kernelspec:
name: python
---

# Basic Usage

```{code-cell}
from compressed_lists import CompressedIntegerList, CompressedStringList

# Create a CompressedIntegerList
int_data = [[1, 2, 3], [4, 5], [6, 7, 8, 9]]
names = ["A", "B", "C"]
int_list = CompressedIntegerList.from_list(int_data, names)

# Access elements
print(int_list[0]) # [1, 2, 3]
print(int_list["B"]) # [4, 5]
print(int_list[1:3]) # Slice of elements

# Apply a function to each element
squared = int_list.lapply(lambda x: [i**2 for i in x])
print(squared[0]) # [1, 4, 9]

# Convert to a regular Python list
regular_list = int_list.to_list()

# Create a CompressedStringList
char_data = [["apple", "banana"], ["cherry", "date", "elderberry"], ["fig"]]
char_list = CompressedStringList.from_list(char_data)
```

## Partitioning

The `Partitioning` class handles the information about where each element begins and ends in the concatenated data. It allows for efficient extraction of elements without storing each element separately.

```{code-cell}
from compressed_lists import Partitioning

# Create partitioning from end positions
ends = [3, 5, 10]
names = ["A", "B", "C"]
part = Partitioning(ends, names)

# Get partition range for an element
start, end = part[1]
print(start, end)
```

# Creating Custom CompressedList Subclasses

`CompressedList` can be easily it can be extended to support custom data types. Here's a step-by-step guide to creating your own `CompressedList` subclass:

## 1. Subclass CompressedList

Create a new class that inherits from `CompressedList` with appropriate type annotations:

```python
from typing import List, TypeVar, Generic
from compressed_lists import CompressedList, Partitioning
import numpy as np

class CustomCompressedList(CompressedList):
"""A custom CompressedList for your data type."""
pass
```

## 2. Implement the Constructor

The constructor should initialize the superclass with the appropriate data:

```python
def __init__(self,
unlist_data: Any, # Replace with your data type
partitioning: Partitioning,
element_metadata: dict = None,
metadata: dict = None):
super().__init__(unlist_data, partitioning,
element_type="custom_type", # Set your element type
element_metadata=element_metadata,
metadata=metadata)
```

## 3. Implement _extract_range Method

This method defines how to extract a range of elements from your unlisted data:

```python
def _extract_range(self, start: int, end: int) -> List[T]:
"""Extract a range from unlisted data."""
# For example, with numpy arrays:
return self.unlist_data[start:end].tolist()

# Or for other data types:
# return self.unlist_data[start:end]
```

## 4. Implement from_list Class Method

This factory method creates a new instance from a list:

```python
@classmethod
def from_list(cls, lst: List[List[T]], names: list = None,
metadata: dict = None) -> 'CustomCompressedList':
"""Create a new CustomCompressedList from a list."""
# Flatten the list
flat_data = []
for sublist in lst:
flat_data.extend(sublist)

# Create partitioning
partitioning = Partitioning.from_list(lst, names)

# Create unlisted data in your preferred format
# For example, with numpy:
unlist_data = np.array(flat_data, dtype=np.float64)

return cls(unlist_data, partitioning, metadata=metadata)
```

## Complete Example: CompressedFloatList

Here's a complete example of a custom CompressedList for floating-point numbers:

```{code-cell}
import numpy as np
from compressed_lists import CompressedList, Partitioning
from typing import List

class CompressedFloatList(CompressedList):
def __init__(self,
unlist_data: np.ndarray,
partitioning: Partitioning,
element_metadata: dict = None,
metadata: dict = None):
super().__init__(unlist_data, partitioning,
element_type="float",
element_metadata=element_metadata,
metadata=metadata)

def _extract_range(self, start: int, end: int) -> List[float]:
return self.unlist_data[start:end].tolist()

@classmethod
def from_list(cls, lst: List[List[float]], names: list = None,
metadata: dict = None) -> 'CompressedFloatList':
# Flatten the list
flat_data = []
for sublist in lst:
flat_data.extend(sublist)

# Create partitioning
partitioning = Partitioning.from_list(lst, names)

# Create unlist_data
unlist_data = np.array(flat_data, dtype=np.float64)

return cls(unlist_data, partitioning, metadata=metadata)

# Usage
float_data = [[1.1, 2.2, 3.3], [4.4, 5.5], [6.6, 7.7, 8.8, 9.9]]
float_list = CompressedFloatList.from_list(float_data, names=["X", "Y", "Z"])
print(float_list["Y"])
```

## For More Complex Data Types

For more complex data types, you would follow the same pattern but customize the storage and extraction methods to suit your data.

For example, with a custom object:

```python
class MyObject:
def __init__(self, value):
self.value = value

class CompressedMyObjectList(CompressedList[List[MyObject]]):
# Implementation details...

def _extract_range(self, start: int, end: int) -> List[MyObject]:
return self.unlist_data[start:end]

@classmethod
def from_list(cls, lst: List[List[MyObject]], ...):
# Custom flattening and storage logic
# ...
```
6 changes: 3 additions & 3 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,14 @@ version_scheme = "no-guess-dev"
line-length = 120
src = ["src"]
exclude = ["tests"]
extend-ignore = ["F821"]
lint.extend-ignore = ["F821"]

[tool.ruff.pydocstyle]
[tool.ruff.lint.pydocstyle]
convention = "google"

[tool.ruff.format]
docstring-code-format = true
docstring-code-line-length = 20

[tool.ruff.per-file-ignores]
[tool.ruff.lint.per-file-ignores]
"__init__.py" = ["E402", "F401"]
7 changes: 4 additions & 3 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,10 @@ license = MIT
license_files = LICENSE.txt
long_description = file: README.md
long_description_content_type = text/markdown; charset=UTF-8; variant=GFM
url = https://github.com/pyscaffold/pyscaffold/
url = https://github.com/biocpy/compressed-lists
# Add here related links, for example:
project_urls =
Documentation = https://pyscaffold.org/
Documentation = https://github.com/biocpy/compressed-lists
# Source = https://github.com/pyscaffold/pyscaffold/
# Changelog = https://pyscaffold.org/en/latest/changelog.html
# Tracker = https://github.com/pyscaffold/pyscaffold/issues
Expand All @@ -41,14 +41,15 @@ package_dir =
=src

# Require a min/specific Python version (comma-separated conditions)
# python_requires = >=3.8
python_requires = >=3.9

# Add here dependencies of your project (line-separated), e.g. requests>=2.2,<3.0.
# Version specifiers like >=2.2,<3.0 avoid problems due to API changes in
# new major versions. This works if the required packages follow Semantic Versioning.
# For more information, check out https://semver.org/.
install_requires =
importlib-metadata; python_version<"3.8"
biocutils


[options.packages.find]
Expand Down
Loading