In-Storage Compute (a.k.a. Function Shipping) feature enables the client nodes to offload computations (or ship them) on the Motr-controlled storage nodes. In traditional distributed file systems, data is moved to the computation nodes. This method allows to move computation functions closer to the data location (provided the computation can be split into parts and run in parallel). It gives several advantages, like:
- Reducing the networking overhead on the data movement within the cluster.
- Maximizing the total storage bandwidth and the usage of server nodes computational power.
It may become a breakthrough for some workloads in large systems, when the horizontal scaling is crippled by the networking.
Computations from an external library cannot be linked directly with
a Motr instance. The library is supposed to have an entry function named
void motr_lib_init(void)
. All the computations in the library must
have the following signature:
int comp(struct m0_buf *args, struct m0_buf *out,
struct m0_isc_comp_private *comp_data, int *rc)
And be registered with m0_isc_comp_register()
function.
From demo/libdemo.c
example we can see how comp_min()
and
comp_max()
computations are registered.
args
contains the input information about where the data is located
at the server side, out
is the output with the computation result.
The computation function may be called several times by the server,
depending on the value of rc
(resulting code). For example, on the
first call computation function needs to fetch the data from the disk.
This operation is done in asynchronous way, so after launching the
read operation we exit from the function with -EAGAIN
rc value.
When the data is ready, the computation function is called again. The
read data will be available at comp_data
argument.
Let's see how it is done at demo/libdemo.c
for min/max:
int do_minmax(enum op op, struct m0_buf *in, struct m0_buf *out,
struct m0_isc_comp_private *data, int *rc)
{
int res;
struct m0_stob_io *stio = (struct m0_stob_io *)data->icp_data;
if (stio == NULL) { /* 1st call */
M0_ALLOC_PTR(stio);
if (stio == NULL) {
*rc = -ENOMEM;
return M0_FSO_AGAIN;
}
data->icp_data = stio;
res = launch_io(data, in, rc);
if (*rc != -EAGAIN)
m0_free(stio);
} else {
res = compute_minmax(op, data, out, rc);
m0_isc_io_fini(stio);
m0_free(stio);
}
return res;
}
We can clearly see two phases here: on the 1st one we call launch_io()
,
on the second one, when the data is ready, we do the actual computation on it
by calling compute_minmax()
.
Computation library must be compiled into a dynamically loadable .so library.
With spiel
command (see spiel/spiel.h and demo/util.h) the library
can be loaded with any running Motr instance. A helper function
m0_isc_lib_register
takes the library path which is (IMPORTANT!)
expected to be the same across all the nodes running Motr.
m0iscreg
utility takes the path as an input and loads the library
into all the remote Motr instances.
On successful loading of the library, the output will look like this:
$ m0iscreg -e 192.168.180.171@tcp:12345:4:1 \ -x 192.168.180.171@tcp:12345:1:1 \ -f 0x7200000000000001:0x2c \ -p 0x7000000000000001:0x50 $PWD/libdemo.so m0iscreg success
The four options are standard ones to connect to Motr:
$ m0iscreg -h Usage: m0iscreg OPTIONS libpath -e <addr> endpoint address -x <addr> ha-agent (hax) endpoint address -f <fid> process fid -p <fid> profile fid
The values for them can be taken from the output of hctl status
command. (We'll refer to them as <motr-opts>
below.)
Note: m0iscreg
utility can be used to load any future ISC-library
without modifications.
Let's look at three simple demo computations: ping
, min
and max
.
m0iscdemo
utility can be used to invoke the computations and see
the result:
$ m0iscdemo -h Usage: m0iscdemo OPTIONS COMP OBJ_ID LEN Supported COMPutations: ping, min, max OBJ_ID is two uint64 numbers in hi:lo format (dec or hex) LEN is the length of object (in KiB)
Following are the steps to run the demo.
This functionality pings all the ISC services spanned by the object units. For each unit a separate ping request is sent, so the utility prints "Hello-World@<service-fid>" reply to every one of these requests.
Here is an example for the 4MB object with 1MB units:
$ m0iscdemo <motr-opts> ping 123:12371 4096 Hello-world @192.168.180.171@tcp:12345:2:2 Hello-world @192.168.180.171@tcp:12345:2:2 Hello-world @192.168.180.171@tcp:12345:2:2 Hello-world @192.168.180.171@tcp:12345:2:2
Note: the object length (or the amount to read) must be specified, as Motr does not store the objects lengths in their metadata. In the example above, 4MB length was specified for the object with 1MB units, so 4 ping requests were sent and 4 replies were received.
The cluster configuration in the above example consisted of a single node only, so all the units were located on the same node. That's why the endpoints' addresses in the replies are identical.
In this demo we write an object with real numbers represented as strings delimited by the newline. We can find the minimum or maximum value among these numbers in the object with in-storage compute like this:
$ m0iscdemo <motr-opts> max 123:12371 4096 idx=132151 val=32767.627900 $ m0iscdemo <motr-opts> min 123:12371 4096 idx=180959 val=0.134330
idx=
shows the order number of the found min/max value in the object.
val=
shows the found min/max value.
At the server side the min/max computation is performed on each unit of the object in parallel. The results are sent to the client, which does the final computation among all the min/max values from all the units received from servers.
This benchmark was conducted on the SAGE Prototype Cluster (located in Jülich Computing Centre). SSD pool was used with 8+2 EC configuration, shared among the 3 server nodes (with max 5 SSDs per node).
1GB object:
$ \time m0iscdemo <motr-opts> min 0x3456023:0x87002803 $((1024*1024)) idx=2845139 val=0.100200 2.37user 0.75system 0:15.66elapsed 19%CPU (0avgtext+0avgdata 234728maxresident)k 0inputs+231016outputs (0major+99487minor)pagefaults 0swaps $ $ # Compare with the client computation performance on the same object: $ $ mcp <motr-opts> -v -osz $((1024*1024)) 0x3456023:0x87002803 - | \time ~/minmax min 2021/10/18 15:49:50 mio.go:614: R: off=0 len=33554432 bs=33554432 gs=33554432 speed=500 (Mbytes/sec) ... 2021/10/18 15:50:15 mio.go:614: R: off=1040187392 len=33554432 bs=33554432 gs=33554432 speed=711 (Mbytes/sec) idx=2845139 val=0.100200 23.36user 0.59system 0:31.45elapsed 76%CPU (0avgtext+0avgdata 588maxresident)k 0inputs+0outputs (0major+224minor)pagefaults 0swaps
2GB object:
$ \time m0iscdemo <motr-opts> min 0x3456023:0x87002805 $((2*1024*1024)) idx=2845139 val=0.100200 4.37user 1.01system 0:24.27elapsed 22%CPU (0avgtext+0avgdata 236728maxresident)k 0inputs+262288outputs (0major+164358minor)pagefaults 0swaps $ $ # Client computation: $ $ mcp <motr-opts> -v -osz $((2*1024*1024)) 0x3456023:0x87002805 - | \time ~/minmax min 2021/10/18 16:08:04 mio.go:614: R: off=0 len=33554432 bs=33554432 gs=33554432 speed=492 (Mbytes/sec) ... 2021/10/18 16:08:54 mio.go:614: R: off=2113929216 len=33554432 bs=33554432 gs=33554432 speed=653 (Mbytes/sec) idx=2845139 val=0.100200 46.35user 1.30system 0:56.97elapsed 83%CPU (0avgtext+0avgdata 588maxresident)k 0inputs+0outputs (0major+225minor)pagefaults 0swaps
4GB object:
$ \time m0iscdemo <motr-opts> min 0x3456023:0x87002806 $((4*1024*1024)) idx=2845139 val=0.100200 7.50user 1.05system 0:40.85elapsed 20%CPU (0avgtext+0avgdata 246840maxresident)k 0inputs+362736outputs (0major+173574minor)pagefaults 0swaps $ $ # Client computation: $ $ mcp <motr-opts> -v -osz $((4*1024*1024)) 0x3456023:0x87002806 - | \time ~/minmax min 2021/10/18 16:17:45 mio.go:614: R: off=0 len=33554432 bs=33554432 gs=33554432 speed=516 (Mbytes/sec) ... 2021/10/18 16:19:27 mio.go:614: R: off=4261412864 len=33554432 bs=33554432 gs=33554432 speed=592 (Mbytes/sec) idx=2845139 val=0.100200 93.48user 2.48system 1:48.59elapsed 88%CPU (0avgtext+0avgdata 584maxresident)k 0inputs+0outputs (0major+231minor)pagefaults 0swaps
8GB object:
$ \time m0iscdemo <motr-opts> min 0x3456023:0x87002807 $((8*1024*1024)) idx=2845139 val=0.100200 14.48user 1.57system 1:15.78elapsed 21%CPU (0avgtext+0avgdata 272176maxresident)k 0inputs+1424720outputs (0major+360575minor)pagefaults 0swaps $ $ # Client computation: $ $ mcp <motr-opts> -v -osz $((8*1024*1024)) 0x3456023:0x87002807 - | \time ~/minmax min 2021/10/18 17:33:54 mio.go:614: R: off=0 len=33554432 bs=33554432 gs=33554432 speed=500 (Mbytes/sec) ... 2021/10/18 17:37:17 mio.go:614: R: off=8556380160 len=33554432 bs=33554432 gs=33554432 speed=615 (Mbytes/sec) idx=2845139 val=0.100200 185.60user 4.82system 3:29.11elapsed 91%CPU (0avgtext+0avgdata 588maxresident)k 0inputs+0outputs (0major+265minor)pagefaults 0swaps
We can clearly see that the computation with ISC performs more than 2 times faster (on this cluster and pool configuration), than on the client node with the client utility (which runs exactly the same logic to find min/max as the ISC library). And the bigger the object size, the faster it performs, see the table below.
ISC Performance Comparison table:
Object size (GB) | ISC computation time | Client computation time | Times faster |
---|---|---|---|
1 | 15.66 | 31.45 | 2.0 |
2 | 24.27 | 56.97 | 2.34 |
4 | 40.85 | 1:48.59 | 2.65 |
8 | 1:15.78 | 3:29.11 | 2.75 |