ecpfs

An implementation of the extended Cluster Pruning (eCP) index.

The index format is stored in a versatile directory format, that enables flexible loading solutions in any language.

Format

"collection" -> "name": str
"collection" -> "embeddings": np.ndarray, shape=(N, dim)

"index" -> "target_cluster_size": number of items to aim for in each cluster
"index" -> "total_clusters": N / target_cluster_size
"index" -> "node_size": total_clusters ** pow(1./L)
"index" -> "levels": height of tree
"index" -> "representetive_embeddings": Cluster leaders, np.ndarray
"index" -> "representative_ids": Item ids of the cluster leaders, List[int]
"index" -> "representative_option": How the representatives were selected, ["offset", "random", "dissimilar"]

"index" -> "root": Node for top level containing total_clusters ** pow(1./L) embeddings
"index" -> "lvl_[0..L-1]" -> "node_id"
"index" -> "leafs": Clusters, similar to Node but no cluster_ids is set

"node_id": {
	"embeddings": np.ndarray, shape=(node_size, dim)
	"item_ids": List[int],
	"node_ids": List[int],
	"border": Tuple[int, dtype]
}

Can be further extended and modified.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
data		data
ecpfs		ecpfs
save		save
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ecpfs

Format

About

Releases

Packages

Languages

License

Ok2610/ecpfs

Folders and files

Latest commit

History

Repository files navigation

ecpfs

Format

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages