Skip to content

Commit 1265f6d

Browse files
authored
Merge pull request #51 from phenology/development
Development
2 parents 470625f + 4fd711f commit 1265f6d

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

67 files changed

+1566
-1476
lines changed

CHANGELOG.rst

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,18 @@ This project adheres to `Semantic Versioning <http://semver.org/>`_.
88
[Unreleased]
99
************
1010

11+
[0.5.0] - 2021-09-23
12+
********************
13+
14+
Added
15+
-----
16+
* k-means implementation for tri-clustering
17+
* utility functions to calculate cluster-based averages for tri-clustering
18+
19+
Changed
20+
-------
21+
* Best k value in k-means is now selected automatically using the Silhouette score
22+
1123
[0.4.0] - 2021-07-29
1224
********************
1325

CITATION.cff

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ authors:
3131
family-names: Zurita-Milla
3232
given-names: Raul
3333
cff-version: "1.0.3"
34-
date-released: 2021-07-27
34+
date-released: 2021-09-23
3535
doi: "10.5281/zenodo.3979172"
3636
keywords:
3737
- "clustering"
@@ -42,5 +42,5 @@ keywords:
4242
license: "Apache-2.0"
4343
message: "If you use this software, please cite it using these metadata."
4444
title: "Clustering Geo-Data Cubes (CGC): A Clustering Tool for Geospatial Applications"
45-
version: "0.4.0"
45+
version: "0.5.0"
4646
...

README.rst

Lines changed: 20 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
:widths: 25 25
33
:header-rows: 1
44

5-
* - fair-software.nl recommendations
5+
* - `fair-software.nl <https://fair-software.nl>`_ recommendations
66
- Badges
77
* - \1. Code repository
88
- |GitHub Badge|
@@ -53,18 +53,16 @@
5353
:target: https://cgc.readthedocs.io/en/latest/?badge=latest
5454
:alt: Documentation Status
5555

56-
################################################################################
5756
CGC: Clustering Geo-Data Cubes
58-
################################################################################
57+
==============================
5958

6059
Clustering Geo-Data Cubes (CGC) is a Python package to perform clustering analysis for multidimensional geospatial data.
6160
The included tools allow the user to efficiently run tasks in parallel on local and distributed systems.
6261

63-
6462
Installation
6563
------------
6664

67-
To install cgc, do:
65+
To install CGC, do:
6866

6967
.. code-block:: console
7068
@@ -85,22 +83,22 @@ Run tests (including coverage) with:
8583
8684
python setup.py test
8785
88-
8986
Documentation
90-
*************
87+
-------------
9188

9289
The project's full documentation can be found `here <https://cgc.readthedocs.io/en/latest/>`_.
9390

9491
Contributing
95-
************
92+
------------
9693

97-
If you want to contribute to the development of cgc,
98-
have a look at the `contribution guidelines <CONTRIBUTING.rst>`_.
94+
If you want to contribute to the development of cgc, have a look at the `contribution guidelines`_.
95+
96+
.. _contribution guidelines: https://github.com/phenology/cgc/tree/master/CONTRIBUTING.md
9997

10098
License
101-
*******
99+
-------
102100

103-
Copyright (c) 2020,
101+
Copyright (c) 2020-2021,
104102

105103
Licensed under the Apache License, Version 2.0 (the "License");
106104
you may not use this file except in compliance with the License.
@@ -114,9 +112,16 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
114112
See the License for the specific language governing permissions and
115113
limitations under the License.
116114

115+
Credits
116+
-------
117117

118+
The code has been developed as a collaborative effort between the `ITC, University of Twente`_ and
119+
`the Netherlands eScience Center`_ within the generalization of the project
120+
`High spatial resolution phenological modelling at continental scales`_.
118121

119-
Credits
120-
*******
122+
.. _ITC, University of Twente: https://www.itc.nl
123+
.. _High spatial resolution phenological modelling at continental scales: https://www.esciencecenter.nl/projects/high-spatial-resolution-phenological-modelling-at-continental-scales/
124+
.. _the Netherlands eScience Center: https://www.esciencecenter.nl
121125

122-
This package was created with `Cookiecutter <https://github.com/audreyr/cookiecutter>`_ and the `NLeSC/python-template <https://github.com/NLeSC/python-template>`_.
126+
This package was created with `Cookiecutter <https://github.com/audreyr/cookiecutter>`_ and the
127+
`NLeSC/python-template <https://github.com/NLeSC/python-template>`_.

cgc/__version__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = '0.4.0'
1+
__version__ = '0.5.0'

cgc/coclustering.py

Lines changed: 65 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -15,19 +15,51 @@
1515

1616
class CoclusteringResults(Results):
1717
"""
18-
Contains results and metadata of a co-clustering calculation
18+
Contains results and metadata of a co-clustering calculation.
19+
20+
:var row_clusters: Final row cluster assignment.
21+
:type row_clusters: numpy.ndarray
22+
:var col_clusters: Final column cluster assignment.
23+
:type col_clusters: numpy.ndarray
24+
:var error: Approximation error of the co-clustering.
25+
:type error: float
26+
:var nruns_completed: Number of successfully completed runs.
27+
:type nruns_completed: int
28+
:var nruns_converged: Number of converged runs.
29+
:type nruns_converged: int
1930
"""
20-
def reset(self):
21-
self.row_clusters = None
22-
self.col_clusters = None
23-
self.error = None
24-
self.nruns_completed = 0
25-
self.nruns_converged = 0
31+
row_clusters = None
32+
col_clusters = None
33+
error = None
34+
nruns_completed = 0
35+
nruns_converged = 0
2636

2737

2838
class Coclustering(object):
2939
"""
30-
Perform the co-clustering analysis of a 2D array
40+
Perform a co-clustering analysis for a two-dimensional array.
41+
42+
:param Z: Data matrix for which to run the co-clustering analysis
43+
:type Z: numpy.ndarray or dask.array.Array
44+
:param nclusters_row: Number of row clusters.
45+
:type nclusters_row: int
46+
:param nclusters_col: Number of column clusters.
47+
:type nclusters_col: int
48+
:param conv_threshold: Convergence threshold for the objective function.
49+
:type conv_threshold: float, optional
50+
:param max_iterations: Maximum number of iterations.
51+
:type max_iterations: int, optional
52+
:param nruns: Number of differently-initialized runs.
53+
:type nruns: int, optional
54+
:param epsilon: Numerical parameter, avoids zero arguments in the
55+
logarithm that appears in the expression of the objective function.
56+
:type epsilon: float, optional
57+
:param output_filename: Name of the JSON file where to write the results.
58+
:type output_filename: string, optional
59+
:param row_clusters_init: Initial row cluster assignment.
60+
:type row_clusters_init: numpy.ndarray or array_like, optional
61+
:param col_clusters_init: Initial column cluster assignment.
62+
:type col_clusters_init: numpy.ndarray or array_like, optional
3163
"""
3264
def __init__(self,
3365
Z,
@@ -40,20 +72,6 @@ def __init__(self,
4072
output_filename='',
4173
row_clusters_init=None,
4274
col_clusters_init=None):
43-
"""
44-
Initialize the object
45-
46-
:param Z: m x n data matrix
47-
:param nclusters_row: number of row clusters
48-
:param nclusters_col: number of column clusters
49-
:param conv_threshold: convergence threshold for the objective function
50-
:param max_iterations: maximum number of iterations
51-
:param nruns: number of differently-initialized runs
52-
:param epsilon: numerical parameter, avoids zero arguments in log
53-
:param output_filename: name of the file where to write the clusters
54-
:param row_clusters_init: initial row clusters
55-
:param col_clusters_init: initial column clusters
56-
"""
5775
# Input parameters -----------------
5876
self.nclusters_row = nclusters_row
5977
self.nclusters_col = nclusters_col
@@ -80,11 +98,16 @@ def __init__(self,
8098

8199
def run_with_dask(self, client=None, low_memory=True):
82100
"""
83-
Run the co-clustering with Dask
84-
85-
:param client: Dask client
86-
:param low_memory: if false, all runs are submitted to the Dask cluster
87-
:return: co-clustering results
101+
Run the co-clustering analysis using Dask.
102+
103+
:param client: Dask client. If not specified, the default
104+
`LocalCluster` is employed.
105+
:type client: dask.distributed.Client, optional
106+
:param low_memory: If False, all runs are submitted to the Dask cluster
107+
(experimental feature, discouraged).
108+
:type low_memory: bool, optional
109+
:return: Co-clustering results.
110+
:type: cgc.coclustering.CoclusteringResults
88111
"""
89112
self.client = client if client is not None else Client()
90113

@@ -101,14 +124,19 @@ def run_with_threads(self,
101124
low_memory=False,
102125
numba_jit=False):
103126
"""
104-
Run the co-clustering using an algorithm based on numpy + threading
105-
(only suitable for local runs)
106-
107-
:param nthreads: number of threads
108-
:param low_memory: if true, use a memory-conservative algorithm
109-
:param numba_jit: if true, and low_memory is true, then use Numba
110-
just-in-time compilation to improve performance
111-
:return: co-clustering results
127+
Run the co-clustering using an algorithm based on Numpy plus threading
128+
(only suitable for local runs).
129+
130+
:param nthreads: Number of threads employed to simultaneously run
131+
differently-initialized co-clustering analysis.
132+
:type nthreads: int, optional
133+
:param low_memory: If True, use a memory-conservative algorithm.
134+
:type low_memory: bool, optional
135+
:param numba_jit: If True, and low_memory is True, then use Numba
136+
just-in-time compilation to improve the performance.
137+
:type numba_jit: bool, optional
138+
:return: Co-clustering results.
139+
:type: cgc.coclustering.CoclusteringResults
112140
"""
113141
with ThreadPoolExecutor(max_workers=nthreads) as executor:
114142
futures = [
@@ -143,7 +171,7 @@ def run_with_threads(self,
143171
return self.results
144172

145173
def _dask_runs_memory(self):
146-
""" Memory efficient Dask implementation: sequential runs """
174+
""" Memory efficient Dask implementation: sequential runs. """
147175
for r in range(self.nruns):
148176
logger.info(f'Run {self.results.nruns_completed}')
149177
converged, niters, row, col, e = coclustering_dask.coclustering(
@@ -171,7 +199,7 @@ def _dask_runs_memory(self):
171199
def _dask_runs_performance(self):
172200
"""
173201
Faster but memory-intensive Dask implementation: all runs are
174-
simultaneosly submitted to the scheduler
202+
simultaneously submitted to the scheduler (experimental, discouraged).
175203
"""
176204
Z = self.client.scatter(self.Z)
177205
futures = [self.client.submit(

cgc/coclustering_dask.py

Lines changed: 24 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -31,19 +31,30 @@ def coclustering(Z, nclusters_row, nclusters_col, errobj, niters, epsilon,
3131
col_clusters_init=None, row_clusters_init=None,
3232
run_on_worker=False):
3333
"""
34-
Run the co-clustering, Dask implementation
35-
36-
:param Z: m x n data matrix
37-
:param nclusters_row: num row clusters
38-
:param nclusters_col: number of column clusters
39-
:param errobj: convergence threshold for the objective function
40-
:param niters: maximum number of iterations
41-
:param epsilon: numerical parameter, avoids zero arguments in log
42-
:param row_clusters_init: initial row cluster assignment
43-
:param col_clusters_init: initial column cluster assignment
44-
:param run_on_worker: whether the function is submitted to a Dask worker
45-
:return: has converged, number of iterations performed. final row and
46-
column clustering, error value
34+
Run the co-clustering analysis, Dask implementation.
35+
36+
:param Z: Data matrix for which to run the co-clustering analysis
37+
:type Z: dask.array.Array or array_like
38+
:param nclusters_row: Number of row clusters.
39+
:type nclusters_row: int
40+
:param nclusters_col: Number of column clusters.
41+
:type nclusters_col: int
42+
:param errobj: Convergence threshold for the objective function.
43+
:type errobj: float, optional
44+
:param niters: Maximum number of iterations.
45+
:type niters: int, optional
46+
:param epsilon: Numerical parameter, avoids zero arguments in the
47+
logarithm that appears in the expression of the objective function.
48+
:type epsilon: float, optional
49+
:param row_clusters_init: Initial row cluster assignment.
50+
:type row_clusters_init: numpy.ndarray or array_like, optional
51+
:param col_clusters_init: Initial column cluster assignment.
52+
:type col_clusters_init: numpy.ndarray or array_like, optional
53+
:param run_on_worker: Whether the function is submitted to a Dask worker
54+
:type run_on_worker: bool, optional
55+
:return: Has converged, number of iterations performed, final row and
56+
column clustering, approximation error of the co-clustering.
57+
:type: tuple
4758
"""
4859
client = get_client()
4960

cgc/coclustering_numpy.py

Lines changed: 27 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -113,20 +113,33 @@ def coclustering(Z,
113113
row_clusters_init=None,
114114
col_clusters_init=None):
115115
"""
116-
Run the co-clustering, Numpy-based implementation
117-
118-
:param Z: m x n data matrix
119-
:param nclusters_row: number of row clusters
120-
:param nclusters_col: number of column clusters
121-
:param errobj: convergence threshold for the objective function
122-
:param niters: maximum number of iterations
123-
:param epsilon: numerical parameter, avoids zero arguments in log
124-
:param low_memory: boolean, set low memory usage version
125-
:param numba_jit: boolean, set numba optimized single node version
126-
:param row_clusters_init: initial row cluster assignment
127-
:param col_clusters_init: initial column cluster assignment
128-
:return: has converged, number of iterations performed, final row and
129-
column clustering, error value
116+
Run the co-clustering analysis, Numpy-based implementation.
117+
118+
:param Z: Data matrix for which to run the co-clustering analysis
119+
:type Z: numpy.ndarray
120+
:param nclusters_row: Number of row clusters.
121+
:type nclusters_row: int
122+
:param nclusters_col: Number of column clusters.
123+
:type nclusters_col: int
124+
:param errobj: Convergence threshold for the objective function.
125+
:type errobj: float, optional
126+
:param niters: Maximum number of iterations.
127+
:type niters: int, optional
128+
:param epsilon: Numerical parameter, avoids zero arguments in the
129+
logarithm that appears in the expression of the objective function.
130+
:type epsilon: float, optional
131+
:param low_memory: Make use of a low-memory version of the algorithm.
132+
:type low_memory: bool, optional
133+
:param numba_jit: Make use of Numba JIT acceleration (only if low_memory
134+
is True).
135+
:type numba_jit: bool, optional
136+
:param row_clusters_init: Initial row cluster assignment.
137+
:type row_clusters_init: numpy.ndarray or array_like, optional
138+
:param col_clusters_init: Initial column cluster assignment.
139+
:type col_clusters_init: numpy.ndarray or array_like, optional
140+
:return: Has converged, number of iterations performed, final row and
141+
column clustering, approximation error of the co-clustering.
142+
:type: tuple
130143
"""
131144
[m, n] = Z.shape
132145

0 commit comments

Comments
 (0)