Skip to content

Commit

Permalink
add obsdim argument to obsview (#202)
Browse files Browse the repository at this point in the history
  • Loading branch information
CarloLucibello authored Feb 8, 2025
1 parent 7c29e64 commit 97460d7
Show file tree
Hide file tree
Showing 9 changed files with 258 additions and 13 deletions.
2 changes: 1 addition & 1 deletion docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ makedocs(;
modules=[MLUtils, MLCore],
sitename = "MLUtils.jl",
pages = ["Home" => "index.md",
"Guides" => "guides.md",
"Guide" => "guide.md",
"API" => "api.md"],
)

Expand Down
1 change: 1 addition & 0 deletions docs/src/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ BatchView
eachobs
DataLoader
obsview
ObsDim
ObsView
randobs
slidingwindow
Expand Down
35 changes: 34 additions & 1 deletion docs/src/guides.md → docs/src/guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ DocTestSetup = quote
end
```

# Guides
# Guide

## Datasets

Expand Down Expand Up @@ -109,10 +109,43 @@ In order to avoid unnecessary memory allocations, MLUtils.jl provides the [`obsv
it returns a wrapper type [`ObsView`](@ref), which behaves like a dataset and can be used with any function that accepts datasets. Users can also specify the behavior of `obsview` on their custom types by implementing the `obsview` method for their type. As an example, for array data, `obsview(data, indices)` will return a subarray:

```jldoctest
julia> data = [1 2 3; 4 5 6]
2×3 Matrix{Int64}:
1 2 3
4 5 6
julia> obsview([1 2 3; 4 5 6], 1:2)
2×2 view(::Matrix{Int64}, :, 1:2) with eltype Int64:
1 2
4 5
```

When working with arrays, it is also possible to use an [`ObsDim`](@ref) object as input to [`obsview`](@ref) to specify the dimension along which the observations are stored. This is useful when the last dimension is not the observation dimension.

An example of this are 3D arrays used as inputs to recurrent neural networks and transformers,
usually having size `(n_features, n_timesteps, n_samples)`. In the case in which we want to treat the timesteps as observations, we can proceed as follows:

```jldoctest
julia> data = reshape([1:24;], 3, 4, 2)
3×4×2 Array{Int64, 3}:
[:, :, 1] =
1 4 7 10
2 5 8 11
3 6 9 12
[:, :, 2] =
13 16 19 22
14 17 20 23
15 18 21 24
julia> ov = obsview(data, ObsDim(2));
julia> getobs(ov, 1)
3×2 Matrix{Int64}:
1 13
2 14
3 15
```



81 changes: 81 additions & 0 deletions docs/src/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# MLUtils.jl

[![](https://img.shields.io/badge/docs-stable-blue.svg)](https://JuliaML.github.io/MLUtils.jl/stable)
[![](https://img.shields.io/badge/docs-dev-blue.svg)](https://JuliaML.github.io/MLUtils.jl/dev)
[![](https://github.com/JuliaML/MLUtils.jl/actions/workflows/CI.yml/badge.svg?branch=main)](https://github.com/JuliaML/MLUtils.jl/actions/workflows/CI.yml?query=branch%3Amain)
[![](https://codecov.io/gh/JuliaML/MLUtils.jl/branch/main/graph/badge.svg)](https://codecov.io/gh/JuliaML/MLUtils.jl)

*MLUtils.jl* defines interfaces and implements common utilities for Machine Learning pipelines.

## Features

- An extensible dataset interface (`numobs` and `getobs`).
- Data iteration and dataloaders (`eachobs` and `DataLoader`).
- Lazy data views (`obsview`).
- Resampling procedures (`undersample` and `oversample`).
- Train/test splits (`splitobs`, stratified split)
- Data partitioning and aggregation tools (`batch`, `batch_sequence`, `unbatch`, `chunk`, `group_counts`, `group_indices`).
- Folds for cross-validation (`kfolds`, `leavepout`).
- Datasets lazy tranformations (`mapobs`, `filterobs`, `groupobs`, `joinobs`, `shuffleobs`).
- Toy datasets for demonstration purpose.
- Other data handling utilities (`flatten`, `normalise`, `unsqueeze`, `stack`, `unstack`, `slidingwindow`).


## Examples

Let us take a look at a hello world example to get a feeling for
how to use this package in a typical ML scenario.

```julia
using MLUtils

# X is a matrix of floats
# Y is a vector of strings
X, Y = load_iris()

# The iris dataset is ordered according to their labels,
# which means that we should shuffle the dataset before
# partitioning it into training- and test-set.
Xs, Ys = shuffleobs((X, Y))

# We leave out 15 % of the data for testing
cv_data, test_data = splitobs((Xs, Ys); at=0.85)

# Next we partition the data using a 10-fold scheme.
for (train_data, val_data) in kfolds(cv_data; k=10)

# We apply a lazy transform for data augmentation
train_data = mapobs(xy -> (xy[1] .+ 0.1 .* randn.(), xy[2]), train_data)

for epoch = 1:10
# Iterate over the data using mini-batches of 5 observations each
for (x, y) in eachobs(train_data, batchsize=5)
# ... train supervised model on minibatches here
end
end
end
```

In the above code snippet, the inner loop for `eachobs` is the
only place where data other than indices is actually being
copied. In fact, while `x` and `y` are materialized arrays,
all the rest are data views.


## Historical Notes

*MLUtils.jl* brings together functionalities previously found in [LearnBase.jl](https://github.com/JuliaML/LearnBase.jl) , [MLDataPattern.jl](https://github.com/JuliaML/MLDataPattern.jl) and [MLLabelUtils.jl](https://github.com/JuliaML/MLLabelUtils.jl). These packages are now discontinued.

Other features were ported from the deep learning library [Flux.jl](https://github.com/FluxML/Flux.jl), as they are of general use.


## Alternatives and Related Packages

- [MLJ.jl](https://alan-turing-institute.github.io/MLJ.jl/dev/) is a more complete package for managing the whole machine learning pipeline if you are looking for a sklearn replacement.

- [NNlib.jl](https://github.com/FluxML/NNlib.jl) provides utility functions for neural networks.

- [TableTransforms.jl](https://github.com/JuliaML/TableTransforms.jl) contains transformations for tabular datasets.

- [DataAugmentation.jl](https://github.com/FluxML/DataAugmentation.jl). Efficient, composable data augmentation for machine and deep learning with support for n-dimensional images, keypoints and categorical masks.

2 changes: 1 addition & 1 deletion src/MLUtils.jl
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ include("batchview.jl")
export batchsize, BatchView

include("obsview.jl")
export obsview, ObsView
export obsview, ObsView, ObsDim

include("dataloader.jl")
export eachobs, DataLoader
Expand Down
79 changes: 79 additions & 0 deletions src/obsview.jl
Original file line number Diff line number Diff line change
Expand Up @@ -229,14 +229,93 @@ obsview(data) = obsview(data, 1:numobs(data))

obsview(A::SubArray) = A

"""
obsview(data::AbstractArray, [obsdim])
obsview(data::AbstractArray, idxs, [obsdim])
Return a view of the array `data` that correspond to the given
indices `idxs`. If `obsdim` of type [`ObsDim`] is provided, the observation
dimension of the array is assumed to be along that dimension, otherwise
it is assumed to be the last dimension.
If `idxs` is not provided, it will be assumed to be `1:numobs(data)`.
# Examples
```jldoctest
julia> x = rand(4, 5, 2);
julia> v = obsview(x, 2:3, ObsDim(2));
julia> numobs(v)
2
julia> getobs(v, 1) == x[:, 2, :]
true
julia> getobs(v, 1:2) == x[:, 2:3, :]
true
```
"""
obsview(data::AbstractArray) = obsview(data, 1:numobs(data))

function obsview(A::AbstractArray{T,N}, idx) where {T,N}
I = ntuple(_ -> :, N-1)
return view(A, I..., idx)
end

getobs(a::SubArray) = getobs(a.parent, last(a.indices))

### Arrays + ObsDim

"""
ObsDim(d::Int)
Type to specify the observation dimension of an array.
It can be used in combination with [`obsview`](@ref).
"""
struct ObsDim{D} end

ObsDim(d::Int) = ObsDim{d}()
ObsDim(obsdim::ObsDim) = obsdim

function obsview(data::A, ::ObsDim{D}) where {A<:AbstractArray,D}
idx = 1:size(data, D)
return ArrayObsView{D,A,typeof(idx)}(data, idx)
end

function obsview(data::A, idx, ::ObsDim{D}) where {A<:AbstractArray,D}
return ArrayObsView{D,A,typeof(idx)}(data, idx)
end

struct ArrayObsView{ObsDim,A<:AbstractArray,I<:AbstractVector} <: AbstractDataContainer
data::A
indices::I
end

Base.length(x::ArrayObsView) = length(x.indices)

function Base.getindex(x::ArrayObsView{ObsDim}, idx) where {ObsDim}
# return a view, consistently with ObsView behaviour
return selectdim(x.data, ObsDim, x.indices[idx])
end

function getobs(x::ArrayObsView{ObsDim,A}, idx) where {ObsDim, T, N, A<:AbstractArray{T,N}}
Ipre = ntuple(_ -> :, ObsDim-1)
Ipost = ntuple(_ -> :, N-ObsDim)
return x.data[Ipre..., x.indices[idx], Ipost...]
end

getobs(x::ArrayObsView) = getobs(x, 1:length(x))

function Base.show(io::IO, x::ArrayObsView{ObsDim}) where {ObsDim}
print(io, "ArrayObsView($(summary(x.data)), obsdim=$(ObsDim), numobs=$(length(x)))")
end

##### Tuples / NamedTuples
function obsview(tup::Union{Tuple, NamedTuple}, indices)
return map(data -> obsview(data, indices), tup)
end


37 changes: 27 additions & 10 deletions src/slidingwindow.jl
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
struct SlidingWindow{T}
struct SlidingWindow{T} <: AbstractDataContainer
data::T
size::Int
stride::Int
count::Int
obsdim::Union{Nothing,Int}
end

Base.length(A::SlidingWindow) = A.count
Expand All @@ -19,26 +20,39 @@ function getrange(A::SlidingWindow, i::Int)
end

function Base.show(io::IO, A::SlidingWindow)
return print(io, "slidingwindow($(A.data), size=$(A.size), stride=$(A.stride))")
print(io, "slidingwindow($(summary(A.data)), size=$(A.size), stride=$(A.stride)")
if A.obsdim !== nothing
print(io, ", obsdim=$(A.obsdim)")
end
print(io, ")")
end

Base.iterate(A::SlidingWindow, i::Int=1) = i > length(A) ? nothing : (A[i], i+1)

"""
slidingwindow(data; size, stride=1) -> SlidingWindow
slidingwindow(data; size, stride=1, obsdim=nothing) -> SlidingWindow
Return a vector-like view of the `data` for which each element is
a fixed size "window" of `size` adjacent observations. Note that only complete
a fixed size "window" of `size` adjacent observations.
`stride` specifies the distance between the start elements of each
adjacent window. The default value is 1. Note that only complete
windows are included in the output, which implies that it is
possible for excess observations to be omitted from the view.
`obsdim` specifies the dimension along which the observations are
indexed for the data types that support it (e.g. arrays).
By default, the observations are indexed along the last
dimension of the data. If `obsdim` is specified it will be
passed to `obsview` to get a view of the data along that dimension.
Note that the windows are not materialized at construction time.
To actually get a copy of the data at some window use indexing or [`getobs`](@ref).
When indexing the data is accessed as `getobs(data, idxs)`, with `idxs` an appropriate range of indexes.
# Examples
```jldoctest
julia> s = slidingwindow(11:30, size=6)
slidingwindow(11:30, size=6, stride=1)
slidingwindow(20-element UnitRange{Int64}, size=6, stride=1)
julia> s[1] # == getobs(data, 1:6)
11:16
Expand All @@ -53,7 +67,7 @@ By default the stride is equal to 1.
```jldoctest
julia> s = slidingwindow(11:30, size=6, stride=3)
slidingwindow(11:30, size=6, stride=3)
slidingwindow(20-element UnitRange{Int64}, size=6, stride=3)
julia> for w in s; println(w); end
11:16
Expand All @@ -63,11 +77,14 @@ julia> for w in s; println(w); end
23:28
```
"""
function slidingwindow(data; size::Int, stride::Int=1)
function slidingwindow(data; size::Int, stride::Int=1, obsdim=nothing)
size > 0 || throw(ArgumentError("Specified window size must be strictly greater than 0. Actual: $size"))
if obsdim !== nothing
data = obsview(data, ObsDim(obsdim))
end
size <= numobs(data) || throw(ArgumentError("Specified window size is too large for the given number of observations"))
stride > 0 || throw(ArgumentError("Specified stride must be strictly greater than 0. Actual: $stride"))
count = floor(Int, (numobs(data) - size + stride) / stride)
return SlidingWindow(data, size, stride, count)
return SlidingWindow(data, size, stride, count, obsdim)
end

23 changes: 23 additions & 0 deletions test/obsview.jl
Original file line number Diff line number Diff line change
Expand Up @@ -176,6 +176,7 @@ end
@test @inferred(size(A)) == (15,)
@test @inferred(A[2:3]) == obsview(var, 2:3)
@test @inferred(A[[1,3]]) == obsview(var, [1,3])

@test @inferred(A[1]) == obsview(var, 1)
@test @inferred(A[11]) == obsview(var, 11)
@test @inferred(A[15]) == obsview(var, 15)
Expand Down Expand Up @@ -233,3 +234,25 @@ end
@test count == 15
end
end

@testset "obsview(array, obsdim)" begin
x = rand(2, 3, 4)

v0 = @inferred(obsview(x))
@test @inferred(getobs(v0, 1)) == x[:,:,1]
@test @inferred(getobs(v0, 2)) == x[:,:,2]
@test getobs(v0, 1) isa Matrix{Float64}
@test numobs(v0) == 4

v2 = @inferred(obsview(x, ObsDim(2)))
@test @inferred(getobs(v2, 1)) == x[:,1,:]
@test @inferred(getobs(v2, 2)) == x[:,2,:]
@test getobs(v2, 1) isa Matrix{Float64}
@test numobs(v2) == 3

v1 = @inferred(obsview(x, ObsDim(1)))
@test @inferred(getobs(v1, 1)) == x[1,:,:]
@test @inferred(getobs(v1, 2)) == x[2,:,:]
@test getobs(v1, 1) isa Matrix{Float64}
@test numobs(v1) == 2
end
11 changes: 11 additions & 0 deletions test/slidingwindow.jl
Original file line number Diff line number Diff line change
Expand Up @@ -28,4 +28,15 @@
c += 1
end
@test c == 5

@testset "obsdim" begin
x = rand(2, 6, 4)
v = slidingwindow(x, size=2, obsdim=2)
@test length(v) == 5
@test v[1] isa Array{Float64,3}
@test v[1] == x[:, 1:2, :]
@test v[2] == x[:, 2:3, :]
@test v[3] == x[:, 3:4, :]
@test v[4] == x[:, 4:5, :]
end
end

0 comments on commit 97460d7

Please sign in to comment.