Skip to content

Commit c8c2f87

Browse files
committed
finish the API
1 parent bccd615 commit c8c2f87

File tree

2 files changed

+41
-26
lines changed

2 files changed

+41
-26
lines changed

README.md

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,10 +11,17 @@ which interplay with the functions:
1111
- `cluster`
1212
- `cluster_number`
1313
- `cluster_labels`
14+
- `cluster_probs`
1415

1516
To create new clustering algorithms simply create a new
1617
subtype of `ClusteringAlgorithm` that extends `cluster`
17-
so that it returns a new subtype of `ClusteringResult`
18-
which itself extends `cluster_labels`.
18+
so that it returns a new subtype of `ClusteringResult`.
19+
The result must extend `cluster_number, cluster_labels`
20+
and optionally `cluster_probs`.
21+
22+
Note that data input type must always be `AbstractVector` of vectors
23+
(anything that can have distance defined).
24+
Two helper functions `each_data_point, input_data_size` can help
25+
making this harmonious with matrix inputs.
1926

2027
For more, see the docstring of `cluster`.

src/ClusteringAPI.jl

Lines changed: 32 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,7 @@
11
module ClusteringAPI
22

3-
# Use the README as the module docs
4-
@doc let
5-
path = joinpath(dirname(@__DIR__), "README.md")
6-
include_dependency(path)
7-
read(path, String)
8-
end ClusteringAPI
9-
103
export ClusteringAlgorithm, ClusteringResults
11-
export cluster, cluster_number, cluster_labels
4+
export cluster, cluster_number, cluster_labels, cluster_probs
125

136
abstract type ClusteringAlgorithm end
147
abstract type ClusteringResults end
@@ -18,29 +11,33 @@ abstract type ClusteringResults end
1811
1912
Cluster input `data` according to the algorithm specified by `ca`.
2013
All options related to the algorithm are given as keyword arguments when
21-
constructing `ca`. The input data can be specified two ways:
14+
constructing `ca`.
2215
23-
- as a (d, m) matrix, with d the dimension of the data points and m the amount of
24-
data points (i.e., each column is a data point).
25-
- as a length-m vector of length-d vectors (i.e., each inner vector is a data point).
16+
The input data are a length-m vector of length-d vectors.
17+
"Vector" here is considered in the generalized sense, i.e., any objects that
18+
a distance can be defined on them. Some clustering algorithms may allow alternative
19+
data input type for performance acceleration.
2620
21+
The output is always a subtype of `ClusteringResults` that can be further queried.
2722
The cluster labels are always the
28-
positive integers `1:n` with `n::Int` the number of created clusters.
23+
positive integers `1:n` with `n::Int` the number of created clusters,
24+
Data points that couldn't get clustered (e.g., outliers or noise)
25+
get assigned negative integers, typically just `-1`.
2926
30-
The output is always a subtype of `ClusteringResults`,
31-
which always extends the following two methods:
27+
`ClusteringResults` subtypes always implement the following functions:
3228
29+
- `cluster_labels(cr)` returns a length-m vector `labels::Vector{Int}` containing
30+
the clustering labels (most of which are of `1:n` while some may be negative integers).
31+
- `cluster_probs(cr)` returns `probs` a length-m vector of length-`n` vectors
32+
containing the "probabilities" or "score" of each point belonging to one of
33+
the created clusters (used with fuzzy clustering algorithms).
3334
- `cluster_number(cr)` returns `n`.
34-
- `cluster_labels(cr)` returns `labels::Vector{Int}` a length-m vector of labels
35-
mapping each data point to each cluster (`1:n`).
36-
37-
and always includes `ca` in the field `algorithm`.
3835
3936
Other algorithm-related output can be obtained as a field of the result type,
40-
or other specific functions of the result type.
41-
This is described in the individual algorithm implementations.
37+
or by using other specific functions of the result type.
38+
This is described in the individual algorithm implementations docstrings.
4239
"""
43-
function cluster(ca::ClusteringAlgorithm, data::AbstractMatrix)
40+
function cluster(ca::ClusteringAlgorithm, data)
4441
throw(ArgumentError("No implementation for `cluster` for $(typeof(ca))."))
4542
end
4643

@@ -50,18 +47,29 @@ end
5047
Return the number of created clusters in the output of [`cluster`](@ref).
5148
"""
5249
function cluster_number(cr::ClusteringResults)
53-
return length(Set(cluster_labels(cr))) # fastest way to count unique elements
50+
return count(>(0), Set(cluster_labels(cr))) # fastest way to count positive labels
5451
end
5552

5653
"""
57-
cluster_labels(cr::ClusteringResults) → labels::Vector{Int}
54+
cluster_labels(cr::ClusteringResults) → probs::Vector{Vector{Real}}
5855
5956
Return the cluster labels of the data points used in [`cluster`](@ref).
6057
"""
6158
function cluster_labels(cr::ClusteringResults)
6259
return cr.labels # typically there
6360
end
6461

62+
"""
63+
cluster_probs(cr::ClusteringResults) → probs::Vector{Vector{Real}}
64+
65+
Return the cluster probabilities of the data points used in [`cluster`](@ref).
66+
They are length-`n` vectors containing the "probabilities" or "score" of each point
67+
belonging to one of the created clusters (used with fuzzy clustering algorithms).
68+
"""
69+
function cluster_labels(cr::ClusteringResults)
70+
return cr.labels # typically there
71+
end
72+
6573
# two helper functions for agnostic input data type
6674
"""
6775
input_data_size(data) → (d, m)

0 commit comments

Comments
 (0)