add obsdim argument to obsview (#202)

JuliaML · Feb 8, 2025 · 97460d7 · 97460d7
1 parent 7c29e64
commit 97460d7
Show file tree

Hide file tree

Showing 9 changed files with 258 additions and 13 deletions.
diff --git a/docs/make.jl b/docs/make.jl
@@ -18,7 +18,7 @@ makedocs(;
     modules=[MLUtils, MLCore],
     sitename = "MLUtils.jl",
     pages = ["Home" => "index.md",
-             "Guides" => "guides.md",
+             "Guide" => "guide.md",
              "API" => "api.md"],
 )
 

diff --git a/docs/src/api.md b/docs/src/api.md
@@ -36,6 +36,7 @@ BatchView
 eachobs
 DataLoader
 obsview
+ObsDim
 ObsView
 randobs
 slidingwindow

diff --git a/docs/src/guides.md → docs/src/guide.md b/docs/src/guides.md → docs/src/guide.md
@@ -4,7 +4,7 @@ DocTestSetup = quote
 end
 ```
 
-# Guides
+# Guide
 
 ## Datasets 
 
@@ -109,10 +109,43 @@ In order to avoid unnecessary memory allocations, MLUtils.jl provides the [`obsv
 it returns a wrapper type [`ObsView`](@ref), which behaves like a dataset and can be used with any function that accepts datasets. Users can also specify the behavior of `obsview` on their custom types by implementing the `obsview` method for their type. As an example, for array data, `obsview(data, indices)` will return a subarray:
 
 ```jldoctest
+julia> data = [1 2 3; 4 5 6]
+2×3 Matrix{Int64}:
+ 1  2  3
+ 4  5  6
+
 julia> obsview([1 2 3; 4 5 6], 1:2)
 2×2 view(::Matrix{Int64}, :, 1:2) with eltype Int64:
  1  2
  4  5
 ```
 
+When working with arrays, it is also possible to use an [`ObsDim`](@ref) object as input to [`obsview`](@ref) to specify the dimension along which the observations are stored. This is useful when the last dimension is not the observation dimension. 
+
+An example of this are 3D arrays used as inputs to recurrent neural networks and transformers,
+usually having size `(n_features, n_timesteps, n_samples)`. In the case in which we want to treat the timesteps as observations, we can proceed as follows:
+
+```jldoctest
+julia> data = reshape([1:24;], 3, 4, 2)
+3×4×2 Array{Int64, 3}:
+[:, :, 1] =
+ 1  4  7  10
+ 2  5  8  11
+ 3  6  9  12
+
+[:, :, 2] =
+ 13  16  19  22
+ 14  17  20  23
+ 15  18  21  24
+
+julia> ov = obsview(data, ObsDim(2));
+
+julia> getobs(ov, 1)
+3×2 Matrix{Int64}:
+ 1  13
+ 2  14
+ 3  15
+```
+
+
 
diff --git a/docs/src/index.md b/docs/src/index.md
@@ -0,0 +1,81 @@
+# MLUtils.jl
+
+[![](https://img.shields.io/badge/docs-stable-blue.svg)](https://JuliaML.github.io/MLUtils.jl/stable)
+[![](https://img.shields.io/badge/docs-dev-blue.svg)](https://JuliaML.github.io/MLUtils.jl/dev)
+[![](https://github.com/JuliaML/MLUtils.jl/actions/workflows/CI.yml/badge.svg?branch=main)](https://github.com/JuliaML/MLUtils.jl/actions/workflows/CI.yml?query=branch%3Amain)
+[![](https://codecov.io/gh/JuliaML/MLUtils.jl/branch/main/graph/badge.svg)](https://codecov.io/gh/JuliaML/MLUtils.jl)
+
+*MLUtils.jl* defines interfaces and implements common utilities for Machine Learning pipelines.
+
+## Features
+
+- An extensible dataset interface  (`numobs` and `getobs`).
+- Data iteration and dataloaders (`eachobs` and `DataLoader`).
+- Lazy data views (`obsview`). 
+- Resampling procedures (`undersample` and `oversample`).
+- Train/test splits (`splitobs`, stratified split) 
+- Data partitioning and aggregation tools (`batch`, `batch_sequence`, `unbatch`, `chunk`, `group_counts`, `group_indices`).
+- Folds for cross-validation (`kfolds`, `leavepout`).
+- Datasets lazy tranformations (`mapobs`, `filterobs`, `groupobs`, `joinobs`, `shuffleobs`).
+- Toy datasets for demonstration purpose. 
+- Other data handling utilities (`flatten`, `normalise`, `unsqueeze`, `stack`, `unstack`, `slidingwindow`).
+
+
+## Examples
+
+Let us take a look at a hello world example to get a feeling for 
+how to use this package in a typical ML scenario. 
+
+```julia
+using MLUtils
+
+# X is a matrix of floats
+# Y is a vector of strings
+X, Y = load_iris()
+
+# The iris dataset is ordered according to their labels,
+# which means that we should shuffle the dataset before
+# partitioning it into training- and test-set.
+Xs, Ys = shuffleobs((X, Y))
+
+# We leave out 15 % of the data for testing
+cv_data, test_data = splitobs((Xs, Ys); at=0.85)
+
+# Next we partition the data using a 10-fold scheme.
+for (train_data, val_data) in kfolds(cv_data; k=10)
+
+    # We apply a lazy transform for data augmentation
+    train_data = mapobs(xy -> (xy[1] .+ 0.1 .* randn.(), xy[2]),  train_data)
+
+    for epoch = 1:10
+        # Iterate over the data using mini-batches of 5 observations each
+        for (x, y) in eachobs(train_data, batchsize=5)
+            # ... train supervised model on minibatches here
+        end
+    end
+end
+```
+
+In the above code snippet, the inner loop for `eachobs` is the
+only place where data other than indices is actually being
+copied. In fact, while `x` and `y` are materialized arrays, 
+all the rest are data views. 
+
+
+## Historical Notes
+
+*MLUtils.jl* brings together functionalities previously found in [LearnBase.jl](https://github.com/JuliaML/LearnBase.jl) , [MLDataPattern.jl](https://github.com/JuliaML/MLDataPattern.jl) and [MLLabelUtils.jl](https://github.com/JuliaML/MLLabelUtils.jl). These packages are now discontinued. 
+
+Other features were ported from the deep learning library [Flux.jl](https://github.com/FluxML/Flux.jl), as they are of general use. 
+
+
+## Alternatives and Related Packages
+
+- [MLJ.jl](https://alan-turing-institute.github.io/MLJ.jl/dev/) is a more complete package for managing the whole machine learning pipeline if you are looking for a sklearn replacement.
+
+- [NNlib.jl](https://github.com/FluxML/NNlib.jl) provides utility functions for neural networks.
+
+- [TableTransforms.jl](https://github.com/JuliaML/TableTransforms.jl) contains transformations for tabular datasets.
+
+- [DataAugmentation.jl](https://github.com/FluxML/DataAugmentation.jl). Efficient, composable data augmentation for machine and deep learning with support for n-dimensional images, keypoints and categorical masks.
+
diff --git a/src/MLUtils.jl b/src/MLUtils.jl
@@ -43,7 +43,7 @@ include("batchview.jl")
 export batchsize, BatchView
 
 include("obsview.jl")
-export obsview, ObsView
+export obsview, ObsView, ObsDim
 
 include("dataloader.jl")
 export eachobs, DataLoader

diff --git a/src/obsview.jl b/src/obsview.jl
@@ -229,14 +229,93 @@ obsview(data) = obsview(data, 1:numobs(data))
 
 obsview(A::SubArray) = A
 
+"""
+    obsview(data::AbstractArray, [obsdim])
+    obsview(data::AbstractArray, idxs, [obsdim])
+
+Return a view of the array `data` that correspond to the given
+indices `idxs`. If `obsdim` of type [`ObsDim`] is provided, the observation 
+dimension of the array is assumed to be along that dimension, otherwise
+it is assumed to be the last dimension.
+
+If `idxs` is not provided, it will be assumed to be `1:numobs(data)`.
+
+# Examples
+
+```jldoctest
+julia> x = rand(4, 5, 2);
+
+julia> v = obsview(x, 2:3, ObsDim(2));
+
+julia> numobs(v)
+2
+
+julia> getobs(v, 1) == x[:, 2, :]
+true
+
+julia> getobs(v, 1:2) == x[:, 2:3, :]
+true
+```
+"""
+obsview(data::AbstractArray) = obsview(data, 1:numobs(data))
+
 function obsview(A::AbstractArray{T,N}, idx) where {T,N}
     I = ntuple(_ -> :, N-1)
     return view(A, I..., idx)
 end
 
 getobs(a::SubArray) = getobs(a.parent, last(a.indices))
 
+### Arrays + ObsDim
+
+"""
+    ObsDim(d::Int)
+
+Type to specify the observation dimension of an array.
+
+It can be used in combination with [`obsview`](@ref).
+"""
+struct ObsDim{D} end
+
+ObsDim(d::Int) = ObsDim{d}()
+ObsDim(obsdim::ObsDim) = obsdim
+
+function obsview(data::A, ::ObsDim{D}) where {A<:AbstractArray,D}
+    idx = 1:size(data, D)
+    return ArrayObsView{D,A,typeof(idx)}(data, idx)
+end
+
+function obsview(data::A, idx, ::ObsDim{D}) where {A<:AbstractArray,D}
+    return ArrayObsView{D,A,typeof(idx)}(data, idx)
+end
+
+struct ArrayObsView{ObsDim,A<:AbstractArray,I<:AbstractVector} <: AbstractDataContainer
+    data::A
+    indices::I
+end
+
+Base.length(x::ArrayObsView) = length(x.indices)
+
+function Base.getindex(x::ArrayObsView{ObsDim}, idx) where {ObsDim}
+    # return a view, consistently with ObsView behaviour
+    return selectdim(x.data, ObsDim, x.indices[idx])
+end
+
+function getobs(x::ArrayObsView{ObsDim,A}, idx) where {ObsDim, T, N, A<:AbstractArray{T,N}}
+    Ipre = ntuple(_ -> :, ObsDim-1)
+    Ipost = ntuple(_ -> :, N-ObsDim)
+    return x.data[Ipre..., x.indices[idx], Ipost...]
+end
+
+getobs(x::ArrayObsView) = getobs(x, 1:length(x))
+
+function Base.show(io::IO, x::ArrayObsView{ObsDim}) where {ObsDim}
+    print(io, "ArrayObsView($(summary(x.data)), obsdim=$(ObsDim), numobs=$(length(x)))")
+end
+
 ##### Tuples / NamedTuples
 function obsview(tup::Union{Tuple, NamedTuple}, indices)
     return map(data -> obsview(data, indices), tup)
 end
+
+
diff --git a/src/slidingwindow.jl b/src/slidingwindow.jl
@@ -1,8 +1,9 @@
-struct SlidingWindow{T}
+struct SlidingWindow{T} <: AbstractDataContainer
     data::T
     size::Int
     stride::Int
     count::Int
+    obsdim::Union{Nothing,Int}
 end
 
 Base.length(A::SlidingWindow) = A.count
@@ -19,26 +20,39 @@ function getrange(A::SlidingWindow, i::Int)
 end
 
 function Base.show(io::IO, A::SlidingWindow)
-    return print(io, "slidingwindow($(A.data), size=$(A.size), stride=$(A.stride))")
+    print(io, "slidingwindow($(summary(A.data)), size=$(A.size), stride=$(A.stride)")
+    if A.obsdim !== nothing
+        print(io, ", obsdim=$(A.obsdim)")
+    end
+    print(io, ")")
 end
 
-Base.iterate(A::SlidingWindow, i::Int=1) = i > length(A) ? nothing : (A[i], i+1)
-
 """
-    slidingwindow(data; size, stride=1) -> SlidingWindow
+    slidingwindow(data; size, stride=1, obsdim=nothing) -> SlidingWindow
 
 Return a vector-like view of the `data` for which each element is
-a fixed size "window" of `size` adjacent observations. Note that only complete
+a fixed size "window" of `size` adjacent observations. 
+
+`stride` specifies the distance between the start elements of each
+adjacent window. The default value is 1. Note that only complete
 windows are included in the output, which implies that it is
 possible for excess observations to be omitted from the view.
 
+`obsdim` specifies the dimension along which the observations are
+indexed for the data types that support it (e.g. arrays). 
+By default, the observations are indexed along the last
+dimension of the data. If `obsdim` is specified it will be 
+passed to `obsview` to get a view of the data along that dimension.
+
 Note that the windows are not materialized at construction time. 
 To actually get a copy of the data at some window use indexing or [`getobs`](@ref).
 
 When indexing the data is accessed as `getobs(data, idxs)`, with `idxs` an appropriate range of indexes.
+
+# Examples
 ```jldoctest
 julia> s = slidingwindow(11:30, size=6)
-slidingwindow(11:30, size=6, stride=1)
+slidingwindow(20-element UnitRange{Int64}, size=6, stride=1)
 
 julia> s[1]  # == getobs(data, 1:6)
 11:16
@@ -53,7 +67,7 @@ By default the stride is equal to 1.
 
 ```jldoctest
 julia> s = slidingwindow(11:30, size=6, stride=3)
-slidingwindow(11:30, size=6, stride=3)
+slidingwindow(20-element UnitRange{Int64}, size=6, stride=3)
 
 julia> for w in s; println(w); end
 11:16
@@ -63,11 +77,14 @@ julia> for w in s; println(w); end
 23:28
 ```
 """
-function slidingwindow(data; size::Int, stride::Int=1)
+function slidingwindow(data; size::Int, stride::Int=1, obsdim=nothing)
     size > 0 || throw(ArgumentError("Specified window size must be strictly greater than 0. Actual: $size"))
+    if obsdim !== nothing
+        data = obsview(data, ObsDim(obsdim))
+    end
     size <= numobs(data) || throw(ArgumentError("Specified window size is too large for the given number of observations"))
     stride > 0 || throw(ArgumentError("Specified stride must be strictly greater than 0. Actual: $stride"))
     count = floor(Int, (numobs(data) - size + stride) / stride)
-    return SlidingWindow(data, size, stride, count)
+    return SlidingWindow(data, size, stride, count, obsdim)
 end
 
diff --git a/test/obsview.jl b/test/obsview.jl
@@ -176,6 +176,7 @@ end
             @test @inferred(size(A)) == (15,)
             @test @inferred(A[2:3]) == obsview(var, 2:3)
             @test @inferred(A[[1,3]]) == obsview(var, [1,3])
+
             @test @inferred(A[1]) == obsview(var, 1)
             @test @inferred(A[11]) == obsview(var, 11)
             @test @inferred(A[15]) == obsview(var, 15)
@@ -233,3 +234,25 @@ end
         @test count == 15
     end
 end
+
+@testset "obsview(array, obsdim)" begin
+    x = rand(2, 3, 4)
+
+    v0 = @inferred(obsview(x))
+    @test @inferred(getobs(v0, 1)) == x[:,:,1]
+    @test @inferred(getobs(v0, 2)) == x[:,:,2]
+    @test getobs(v0, 1) isa Matrix{Float64}
+    @test numobs(v0) == 4
+
+    v2 = @inferred(obsview(x, ObsDim(2)))
+    @test @inferred(getobs(v2, 1)) == x[:,1,:]
+    @test @inferred(getobs(v2, 2)) == x[:,2,:]
+    @test getobs(v2, 1) isa Matrix{Float64}
+    @test numobs(v2) == 3
+
+    v1 = @inferred(obsview(x, ObsDim(1)))
+    @test @inferred(getobs(v1, 1)) == x[1,:,:]
+    @test @inferred(getobs(v1, 2)) == x[2,:,:]
+    @test getobs(v1, 1) isa Matrix{Float64}
+    @test numobs(v1) == 2
+end
diff --git a/test/slidingwindow.jl b/test/slidingwindow.jl
@@ -28,4 +28,15 @@
         c += 1
     end
     @test c == 5
+
+    @testset "obsdim" begin
+        x = rand(2, 6, 4)
+        v = slidingwindow(x, size=2, obsdim=2)
+        @test length(v) == 5
+        @test v[1] isa Array{Float64,3}
+        @test v[1] == x[:, 1:2, :]
+        @test v[2] == x[:, 2:3, :]
+        @test v[3] == x[:, 3:4, :]
+        @test v[4] == x[:, 4:5, :]
+    end
 end
-Original file line number
+Diff line change
@@ Expand Up / @@ -36,6 +36,7 @@ BatchView @@
     eachobs
     DataLoader
     obsview
+    ObsDim
     ObsView
     randobs
     slidingwindow
@@ Expand Down @@