-
Notifications
You must be signed in to change notification settings - Fork 2
DocIO
This document describes IOS load balancing or how IOS are selected for an intended read/write operation.
Two conflicting phenomena must be managed:
- Exploiting server-side caches
- Not flooding a single server with requests
To complicate matters, we don't have a method for performance sharing between clients, so some other means may be necessary such as storing real-time performance state.
Also, a method may be needed to make load-balancing/throughput decisions which take into account several I/O Systems (applies only to read). One example would be "3/4 servers are down on the preferred IOS. Should we start to forward requests to secondary or tertiary IOS's?" Another could be "fileA is stored on an archiver at local-site and parallel-fs at remote-site - which site is chosen?
Perhaps the IOS's themselves could maintain statistics (long term or just run-time) which summarize their ability to process different types and sizes of IO requests?
Summaries would exist on a per-site (or even per-client) basis and the statistics should separate disk I/O from network latency (so that poorly performing or remote clients do not interfere with disk I/O readings).
On the client the summaries would be attached to their respective IOS structures and consulted when a read request is being issued. Note that writes are different because the client is required to bind a write operation (on a per-bmap basis) with a single IOS.
The parallel_fs IOS type describes a system which has a set of
symmetric clients where each of the mmebers contained therein have a
consistent/coherent view of the backing file system.
For example:
resource bessemer {
desc = "DDN9550 Lustre Parallel FS";
type = parallel_fs;
id = 0;
nids = 128.182.112.110, 128.182.112.111,
128.182.112.112, 128.182.112.113;
}
An example case would be where a client issues a write to IOS_a0 and
then to IOS_a1 for regions which fall into the same bmap.
Upon the first write into the bmap, the IOS notifies the MDS that the
generation number must be bumped, denoting that the bmap chunk has been
modified (and therefore outdating the other replicas).
The client, being fickle, issues a subsequent write to a peer IOS
(IOS_a1) which falls into the same bmap.
Now since the backing filesystem is coherent, there is no need to bump
the generation number again.
Since the IOS's do not explicitly communicate, the client must inform
IOS_a1 that the bmap he's accessing has already been bumped and
therefore the IOS should process the write without communication to the
MDS (for the purpose of a generation bump - CRC related communications
will still persist).
Therefore, after first write, the client must present a token denoting that first access operations have already been handled. Of course this only applies when peer systems are connected via a coherent backing file system. In addition, IOS's in coherent environments must always flush their buffers to the file system.
In the case where we have a set of stand-alone nodes, such as the LCN
cluster_noshare type, these must be treated differently since they are
not backed by a shared file system.
The result being that clients cannot update bmaps on different IOS's
without bumping the generation number and, in essence, canceling out
writes to peer IOS's.
-
Synchronous writes - before returning success on write to the client, the IOS must ensure its buffers are written on stable storage.
-
Bmap first write token - this token should originate at the MDS (via the shared-secret mechanism) and be verifiable by the IOS. This way the IOS does not trust the client.
-
After thinking about CRC management a bit more, I've concluded that parallel IOS will do more harm than good. This is because there is no way to serialize/synchronize the CRC state on disk with that on the MDS. Therefore the approach described in CRC management will be taken. The basic idea is that the MDS binds a bmap to an IOS and redirects all other nodes to that IOS. Hence, that IOS is the only one who make update the CRC tables for the given bmap.
The design fallouts described above still apply, the difference being that the write token is no longer shareable.
Note that this does not apply to read(2).
