Skip to content
Jared Yanovich edited this page Jun 30, 2015 · 3 revisions

Client I/O and IOS selection

Overview

This document describes IOS load balancing or how IOS are selected for an intended read/write operation.

Two conflicting phenomena must be managed:

  • Exploiting server-side caches
  • Not flooding a single server with requests

To complicate matters, we don't have a method for performance sharing between clients, so some other means may be necessary such as storing real-time performance state.

Also, a method may be needed to make load-balancing/throughput decisions which take into account several I/O Systems (applies only to read). One example would be "3/4 servers are down on the preferred IOS. Should we start to forward requests to secondary or tertiary IOS's?" Another could be "fileA is stored on an archiver at local-site and parallel-fs at remote-site - which site is chosen?

Perhaps the IOS's themselves could maintain statistics (long term or just run-time) which summarize their ability to process different types and sizes of IO requests?

Size range (very small to very large)

Summaries would exist on a per-site (or even per-client) basis and the statistics should separate disk I/O from network latency (so that poorly performing or remote clients do not interfere with disk I/O readings).

On the client the summaries would be attached to their respective IOS structures and consulted when a read request is being issued. Note that writes are different because the client is required to bind a write operation (on a per-bmap basis) with a single IOS.

Read-before-write: Client-side buffering mechanism

Dealing with parallel I/O servers: What are the issues involved?

The parallel_fs IOS type describes a system which has a set of symmetric clients where each of the mmebers contained therein have a consistent/coherent view of the backing file system.

For example:

resource bessemer {
	desc	= "DDN9550 Lustre Parallel FS";
	type	= parallel_fs;
	id	= 0;
	nids	= 128.182.112.110, 128.182.112.111,
		  128.182.112.112, 128.182.112.113;
}

An example case would be where a client issues a write to IOS_a0 and then to IOS_a1 for regions which fall into the same bmap. Upon the first write into the bmap, the IOS notifies the MDS that the generation number must be bumped, denoting that the bmap chunk has been modified (and therefore outdating the other replicas). The client, being fickle, issues a subsequent write to a peer IOS (IOS_a1) which falls into the same bmap. Now since the backing filesystem is coherent, there is no need to bump the generation number again. Since the IOS's do not explicitly communicate, the client must inform IOS_a1 that the bmap he's accessing has already been bumped and therefore the IOS should process the write without communication to the MDS (for the purpose of a generation bump - CRC related communications will still persist).

Therefore, after first write, the client must present a token denoting that first access operations have already been handled. Of course this only applies when peer systems are connected via a coherent backing file system. In addition, IOS's in coherent environments must always flush their buffers to the file system.

In the case where we have a set of stand-alone nodes, such as the LCN cluster_noshare type, these must be treated differently since they are not backed by a shared file system. The result being that clients cannot update bmaps on different IOS's without bumping the generation number and, in essence, canceling out writes to peer IOS's.

Design Fallouts

  • Synchronous writes - before returning success on write to the client, the IOS must ensure its buffers are written on stable storage.

  • Bmap first write token - this token should originate at the MDS (via the shared-secret mechanism) and be verifiable by the IOS. This way the IOS does not trust the client.

  • After thinking about CRC management a bit more, I've concluded that parallel IOS will do more harm than good. This is because there is no way to serialize/synchronize the CRC state on disk with that on the MDS. Therefore the approach described in CRC management will be taken. The basic idea is that the MDS binds a bmap to an IOS and redirects all other nodes to that IOS. Hence, that IOS is the only one who make update the CRC tables for the given bmap.

The design fallouts described above still apply, the difference being that the write token is no longer shareable.

Note that this does not apply to read(2).

Clone this wiki locally