default.html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
	<META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=utf-8">
	<TITLE>smallfile distributed I/O benchmark | Red Hat Intranet</TITLE>
	<META NAME="GENERATOR" CONTENT="LibreOffice 4.1.3.2 (Linux)">
	<META NAME="AUTHOR" CONTENT="anonymous">
	<META NAME="CREATED" CONTENT="0;0">
	<META NAME="CHANGEDBY" CONTENT="Ben England">
	<META NAME="CHANGED" CONTENT="20140709;95907050185250">
	<META NAME="category-departments" CONTENT="Engineering">
	<META NAME="category-keywords" CONTENT="filesystem">
	<META NAME="category-offices" CONTENT="Westford, MA">
	<META NAME="category-wiki-page-type" CONTENT="Misc">
	<META NAME="modified" CONTENT="2012-06-08 11:49:32">
	<META NAME="status" CONTENT="1">
	<META NAME="type" CONTENT="wiki_page">
	<STYLE TYPE="text/css">
	<!--
		@page { margin: 0.79in }
		P { color: #000000; font-family: "Liberation Serif", "Times New Roman", serif; font-size: 11pt; line-height: 138% }
		H1 { margin-top: 0.1in; margin-bottom: 0in; border: none; padding: 0in; color: #000000 }
		H1.western { font-family: "Liberation Sans", "Lucida Grande", "Helvetica", sans-serif; font-size: 22pt }
		H1.cjk { font-family: "Liberation Sans", "Lucida Grande", "Helvetica", sans-serif }
		H1.ctl { font-family: "Liberation Sans", "Lucida Grande", "Helvetica", sans-serif }
		H2 { margin-top: 0.1in; margin-bottom: 0in; border: none; padding: 0in; color: #000000; font-family: "Liberation Sans", "Lucida Grande", "Helvetica", sans-serif; font-size: 10pt; font-weight: normal; line-height: 130% }
		PRE { color: #000000 }
		PRE.cjk { font-family: "WenQuanYi Zen Hei", monospace }
		PRE.ctl { font-family: "Lohit Devanagari", monospace }
		H3 { color: #000000 }
		H3.western { font-family: "Albany", sans-serif }
		H3.cjk { font-family: "WenQuanYi Zen Hei Sharp" }
		H3.ctl { font-family: "Lohit Devanagari" }
		P.sdfootnote { margin-left: 0.24in; text-indent: -0.24in; margin-bottom: 0in; font-size: 10pt; line-height: 100% }
		A:link { color: #003399; text-decoration: none }
		A:visited { color: #000000 }
		A.sdfootnoteanc { font-size: 57% }
	-->
	</STYLE>
</HEAD>
<BODY LANG="en-US" TEXT="#000000" LINK="#003399" VLINK="#000000" DIR="LTR">
<FORM ACTION="/wiki/smallfile-distributed-io-benchmark" METHOD="POST" ENCTYPE="multipart/form-data">
	<INPUT TYPE=HIDDEN NAME="form_build_id" VALUE="form-5c8f33b57ec17d3fc622343ca04cc920">
	<INPUT TYPE=HIDDEN NAME="form_token" VALUE="79eae801c0b52735704736049baaa583">
	<INPUT TYPE=HIDDEN NAME="form_id" VALUE="subscriptions_ui_node_form">
</FORM>
<FORM ACTION="/comment/reply/71422" METHOD="POST">
	<INPUT TYPE=HIDDEN NAME="form_build_id" VALUE="form-5d38f8d7377558dd0b9728158589d8f4">
	<INPUT TYPE=HIDDEN NAME="form_token" VALUE="d88d096c160d9b2ea61f976935dd016c">
	<INPUT TYPE=HIDDEN NAME="form_id" VALUE="comment_form">
</FORM>
<DIV ID="content-wrapper" DIR="LTR" STYLE="background: #dfe1e4">
	<P><BR><BR>
	</P>
	<DIV ID="inner-wrap" DIR="LTR">
		<P><BR><BR>
		</P>
		<DIV ID="center" DIR="LTR">
			<P><BR><BR>
			</P>
			<DIV ID="tabs-wrapper" DIR="LTR">
				<H1 CLASS="western" STYLE="margin-top: 0in; margin-bottom: 0.2in">
				smallfile distributed I/O benchmark</H1>
			</DIV>
			<DIV ID="node-71422" DIR="LTR">
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in">&nbsp;This
				page describes the <STRONG>smallfile</STRONG> benchmark program.&nbsp;
				It is a python-based small-file distributed POSIX workload
				generator which can be used to quickly measure performance for a
				variety of metadata-intensive workloads across an entire
				cluster.&nbsp;&nbsp; It has no dependencies on any specific
				filesystem or implementation AFAIK.&nbsp; It is intended to
				complement use of iozone benchmark for measuring performance of
				large-file workloads, and borrows certain concepts from iozone
				and Ric Wheeler's fs_mark.&nbsp;&nbsp;&nbsp; It was developed by
				Ben England starting in March 2009, and is now open-source.
				Here's an example of the kind of data that can be generated with
				it:<IMG SRC="default_files/glusterfs-smallfile-2.jpg" NAME="graphics2" ALIGN=BOTTOM WIDTH=669 HEIGHT=541 BORDER=0></P>
				<DIV ID="Table of Contents1" DIR="LTR">
					<P><BR><BR>
					</P>
					<DIV ID="Table of Contents1_Head" DIR="LTR">
						<P STYLE="margin-top: 0.17in; line-height: 100%; page-break-after: avoid">
						<FONT FACE="Albany, sans-serif"><FONT SIZE=4 STYLE="font-size: 16pt"><B>Table
						of Contents</B></FONT></FONT></P>
					</DIV>
					<P STYLE="margin-left: 0.2in; margin-bottom: 0in; line-height: 100%">
					<A HREF="#__RefHeading__123_1677170542"><FONT FACE="Liberation Serif, serif"><FONT SIZE=3>Use
					with distributed filesystems 6</FONT></FONT></A></P>
					<P STYLE="margin-left: 0.2in; margin-bottom: 0in; line-height: 100%">
					<A HREF="#__RefHeading__125_1677170542"><FONT FACE="Liberation Serif, serif"><FONT SIZE=3>Use
					with non-networked filesystems 7</FONT></FONT></A></P>
					<P STYLE="margin-left: 0.2in; margin-bottom: 0in; line-height: 100%">
					<A HREF="#__RefHeading__127_1677170542"><FONT FACE="Liberation Serif, serif"><FONT SIZE=3>Use
					of subdirectories 8</FONT></FONT></A></P>
					<P STYLE="margin-left: 0.2in; margin-bottom: 0in; line-height: 100%">
					<A HREF="#__RefHeading__129_1677170542"><FONT FACE="Liberation Serif, serif"><FONT SIZE=3>Sharing
					directories across threads 8</FONT></FONT></A></P>
					<P STYLE="margin-left: 0.2in; margin-bottom: 0in; line-height: 100%">
					<A HREF="#__RefHeading__131_1677170542"><FONT FACE="Liberation Serif, serif"><FONT SIZE=3>Hashing
					files into directory tree 8</FONT></FONT></A></P>
					<P STYLE="margin-left: 0.2in; margin-bottom: 0in; line-height: 100%">
					<A HREF="#__RefHeading__133_1677170542"><FONT FACE="Liberation Serif, serif"><FONT SIZE=3>Random
					file size distribution option 9</FONT></FONT></A></P>
					<P STYLE="margin-left: 0.2in; margin-bottom: 0in; line-height: 100%">
					<A HREF="#__RefHeading__135_1677170542"><FONT FACE="Liberation Serif, serif"><FONT SIZE=3>Avoiding
					caching effects 9</FONT></FONT></A></P>
					<P STYLE="margin-left: 0.2in; margin-bottom: 0in; line-height: 100%">
					<A HREF="#__RefHeading__137_1677170542"><FONT FACE="Liberation Serif, serif"><FONT SIZE=3>Use
					of --pause in multi-thread tests 10</FONT></FONT></A></P>
					<P STYLE="margin-left: 0.2in; margin-bottom: 0in; line-height: 100%">
					<A HREF="#__RefHeading__139_1677170542"><FONT FACE="Liberation Serif, serif"><FONT SIZE=3>How
					to measure asynchronous file copy performance 10</FONT></FONT></A></P>
					<P STYLE="margin-left: 0.2in; margin-bottom: 0in; line-height: 100%">
					<A HREF="#__RefHeading__141_1677170542"><FONT FACE="Liberation Serif, serif"><FONT SIZE=3>Response
					time collection 11</FONT></FONT></A></P>
					<P STYLE="margin-left: 0.2in; margin-bottom: 0in; line-height: 100%">
					<A HREF="#__RefHeading__143_1677170542"><FONT FACE="Liberation Serif, serif"><FONT SIZE=3>Synchronization
					12</FONT></FONT></A></P>
					<P STYLE="margin-left: 0.39in; margin-bottom: 0in; line-height: 100%">
					<A HREF="#__RefHeading__145_1677170542"><FONT FACE="Liberation Serif, serif"><FONT SIZE=3>How
					test parameters are transmitted to worker threads 13</FONT></FONT></A></P>
					<P STYLE="margin-left: 0.39in; margin-bottom: 0in; line-height: 100%">
					<A HREF="#__RefHeading__147_1677170542"><FONT FACE="Liberation Serif, serif"><FONT SIZE=3>How
					remote worker threads are launched 13</FONT></FONT></A></P>
					<P STYLE="margin-left: 0.39in; margin-bottom: 0in; line-height: 100%">
					<A HREF="#__RefHeading__149_1677170542"><FONT FACE="Liberation Serif, serif"><FONT SIZE=3>How
					results are returned to master process 14</FONT></FONT></A></P>
				</DIV>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in"><BR>
				</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in"><BR>
				</P>
				<H1 CLASS="western" STYLE="margin-top: 0in; margin-bottom: 0.2in">
				What it can do</H1>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in"><BR>
				</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in">Capabilities
				include:</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in">- can
				manage workload generator processes on multiple hosts</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in">-
				calculates aggregate throughput for entire set of hosts</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in">- can
				start and stop all workload generator processes at approximately
				the same time (necessary for accurate aggregate throughput
				measurement)</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in">-
				useful for generating &quot;pure&quot; workloads (for example,
				just creates, or deletes, or setattr)</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in">- easy
				to extend to new workload types</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in">-
				provides CLI for scripted use, but workload generator is separate
				from CLI so it is possible to develop a GUI for it</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in">-
				supports either fixed file size or random exponential file size</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in">- can
				capture response time data in .csv format, provides utility to
				reduce this data to statistics</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in">-
				supports Windows (different launching method, see below)</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in">-
				writes unique data pattern in all files, verifies data read
				against this pattern</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in">- can
				write random data pattern that is incompressible</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in">- can
				measure time required for files to appear in a directory tree
				(useful for asynchronous replication tests)</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in">- in
				multi-host tests, can force all clients to read files written by
				different client</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in"><BR>
				</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in">Both
				python 2.7 and python 3 are supported. Limited support is
				available for pypy (JIT compilation).</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in"><BR>
				</P>
				<H1 CLASS="western" STYLE="margin-top: 0in; margin-bottom: 0.2in">
				Restrictions</H1>
				<UL>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in">For
					a multi-host test, ALL hosts <EM>must provide access to the same
					shared directory</EM></P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in"><EM><SPAN STYLE="font-variant: normal"><SPAN STYLE="font-style: normal">does
					not support mixed workloads (mixture of different operation
					types)<A CLASS="sdfootnoteanc" NAME="sdfootnote1anc" HREF="#sdfootnote1sym"><SUP>1</SUP></A></SPAN></SPAN></EM></P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in">is
					not accurate on memory-resident filesystem on single host 
					</P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in">requires
					all hosts to have same DNS domain name (plan to remove this
					restriction) 
					</P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in">does
					not support HTTP access (can use ssbench or cosbench for Swift
					testing)</P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in">does
					not support mixture of Windows and non-Windows clients at
					present</P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in">For
					POSIX-like operating systems, we have only tested with Linux,
					but there is a high probability that it would work with Apple OS
					and most other UNIX-like operating systems – we just don't
					have the time to test with them.</P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in">We
					only have tested on Windows XP and Windows 7 so far, and cannot
					guarantee that any other Windows will work, although it is
					likely that any Windows after Windows XP should be ok.</P>
				</UL>
				<H1 CLASS="western" STYLE="margin-top: 0in; margin-bottom: 0.2in">
				How to run</H1>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in">You
				must have password-less ssh access between the test driver node
				and the workload generator hosts if you want to run a distributed
				(multi-host) test.</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in">You
				must use a directory visible to all participating hosts to run a
				distributed test.</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in">To see
				what parameters are supported by smallfile_cli.py, do &quot;python
				smallfile_cli.py -h&quot;. Boolean true/false parameters can be
				set to either Y (true) or N (false). Every command consists of a
				sequence of parameter name-value pairs with the format –<B>name
				value</B> .</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in">The
				parameters are:</P>
				<UL>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in"><B>--operation</B>
					-- operation name, one of the following: 
					</P>
					<UL>
						<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in">create
						-- create a file and write data to it 
						</P>
						<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in">append
						-- open an existing file and append data to it 
						</P>
						<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in">delete
						-- delete a file 
						</P>
						<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in">rename
						-- rename a file 
						</P>
						<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in">delete_renamed
						-- delete a file that had previously been renamed 
						</P>
						<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in">read
						-- read an existing file 
						</P>
						<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in">stat
						-- just read metadata from an existing file 
						</P>
						<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in">chmod
						-- change protection mask for file 
						</P>
						<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in">setxattr
						-- set extended attribute values in each file 
						</P>
						<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in">getxattr
						- read extended attribute values in each file 
						</P>
						<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in">symlink
						-- create a symlink pointing to each file (create must be run
						beforehand) 
						</P>
						<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in">mkdir
						-- create a subdirectory with 1 file in it 
						</P>
						<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in">rmdir
						-- remove a subdirectory and its 1 file 
						</P>
						<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in">readdir
						– scan directories only, don't read files or their metadata</P>
						<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in">ls-l
						– scan directories and read basic file metadata</P>
						<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in">cleanup
						-- delete any pre-existing files from a previous run 
						</P>
						<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in">swift-put
						– simulates OpenStack Swift behavior when doing PUT operation</P>
						<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in">swift-get
						– simulates OpenStack Swift behavior for each GET operation. 
						</P>
					</UL>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in"><B>--top
					-- </B><SPAN STYLE="font-weight: normal">top-level directory,
					all file accesses are done inside this directory tree. If you
					wish to use multiple mountpoints,provide a list of top-level
					directories separated by comma (no whitespace).</SPAN></P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in"><B>--host-set</B>
					<SPAN STYLE="font-weight: normal">-- comma-separated set of
					hosts used for this test, no domain names allowed. Default:
					non-distributed test.</SPAN></P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in"><B>--files</B>
					-- how many files should each thread process?&nbsp;</P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in"><B>--threads</B>
					-- how many workload generator threads should each
					invocation_cli process create? 
					</P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in"><B>--file-size</B>
					-- total amount of data accessed per file. &nbsp; If zero then
					no reads or writes are performed. 
					</P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in">-<B>-file-size-distribution</B>
					– only supported value today is <B>exponential. </B><SPAN STYLE="font-weight: normal">Default:
					</SPAN>fixed file size.</P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in"><B>--record-size</B>
					-- record size in KB, how much data is transferred in a single
					read or write system call.&nbsp; If 0 then it is set to the
					minimum of the file size and 1-MB record size limit. Default: 0</P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in"><B>--files-per-dir</B>
					-- maximum number of files contained in any one directory.
					Default: 200</P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in"><B>--dirs-per-dir</B>
					-- maximum number of subdirectories contained in any one
					directory. Default: 20</P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in"><B>--hash-into-dirs</B>
					– if Y then assign next file to a directory using a hash
					function, otherwise assign next –files-per-dir files to next
					directory. Default: N</P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in"><B>--permute-host-dirs</B>
					– if Y then have each host process a different subdirectory
					tree than it otherwise would (see below for directory tree
					structure). Default: N</P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in">-<B>-same-dir</B>
					-- if Y then threads will share a single directory. Default: N</P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in"><B>--network-sync-dir</B>
					– don't need to specify unless you run a multi-host test and
					the –<B>top</B> parameter points to a non-shared directory
					(see discussion below). Default: <B>network_shared</B>
					subdirectory under –<B>top</B> dir.</P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in"><B>--xattr-size</B>
					-- size of extended attribute value in bytes (names begin with
					'user.smallfile-') 
					</P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in"><B>--xattr-count</B>
					-- number of extended attributes per file 
					</P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in"><B>--prefix</B>
					-- a string prefix to prepend to files (so they don't collide
					with previous runs for example) 
					</P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in"><B>--suffix</B>
					-- a string suffix to append to files (so they don't collide
					with previous runs for example) 
					</P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in"><B>--incompressible</B>
					– (default N) if Y then generate a pure-random file that will
					not be compressible (useful for tests where intermediate network
					or file copy utility attempts to compress data</P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in"><B>--record-ctime-size</B>
					-- default N, if Y then label each created file with an xattr
					containing a time of creation and a file size. This will be used
					by –<B>await-create</B> operation to compute performance of
					asynchonous file replication/copy.</P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in"><B>--finish</B>
					-- if Y, thread will complete all requested file operations even
					if measurement has finished. Default: Y</P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in"><B>--stonewall</B>
					-- if Y then thread will measure throughput as soon as it
					detects that another thread has finished. Default: N</P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in"><B>--verify-read
					– </B><SPAN STYLE="font-weight: normal">if Y then smallfile
					will verify read data is correct. Default: Y</SPAN></P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in"><B>--response-times</B>
					– <SPAN STYLE="font-weight: normal">if Y then save response
					time for each file operation in a rsptimes*csv file in the
					shared network directory. Record format is </SPAN><FONT FACE="Courier 10 Pitch"><SPAN STYLE="font-weight: normal">operation-type,
					start-time, response-time</SPAN></FONT><FONT FACE="Liberation Serif, serif"><SPAN STYLE="font-weight: normal">.
					The operation type is included so that you can run different
					workloads at the same time and easily merge the data from these
					runs. The start-time field is the time that the file operation
					started, down to microsecond resolution. The response time field
					is the file operation duration down to microsecond resolution.</SPAN></FONT></P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in"><B>--remote-pgm-dir
					</B>– don't need to specify this unless the smallfile software
					lives in a different directory on the target hosts and the
					test-driver host. 
					</P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in"><B>--pause
					</B>-- integer (microseconds) each thread will wait before
					starting next file. Default: 0</P>
				</UL>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in"><BR>
				</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in">So for
				example, if you want to run <STRONG>smallfile_cli.py</STRONG> on
				1 host with 8 threads each creating 2 GB of 1-MB files, you can
				use these options:</P>
				<PRE CLASS="western" STYLE="margin-bottom: 0.2in; border: none; padding: 0in"><STRONG> </STRONG><STRONG><FONT SIZE=3># python smallfile_cli.py --operation create --threads 8 --file-size 1024 --files 2048 --top /mnt/gfs/smf</FONT></STRONG></PRE><P STYLE="margin-bottom: 0in; border: none; padding: 0in">
				To run a 4-host test doing same thing:</P>
				<PRE CLASS="western" STYLE="border: none; padding: 0in"><STRONG> </STRONG><STRONG><FONT SIZE=3># python smallfile_cli.py --operation create --threads 8 --file-size 1024 --files 2048 --top /mnt/gfs/smf \</FONT></STRONG>
<STRONG>     </STRONG><STRONG><FONT SIZE=3>--host-set host1,host2,host3,host4</FONT></STRONG> </PRE><P STYLE="margin-bottom: 0in; border: none; padding: 0in">
				Errors encountered by worker threads will be saved in
				<B>/var/tmp/invoke-N.log</B> where <B>N</B> is the thread number.
				After each test, a summary of thread results is displayed, and
				overall test results are aggregated for you, in three ways:</P>
				<UL>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in"><I>files/sec</I>
					– only metric relevant to all tests</P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in"><I>IOPS</I>
					– application I/O operations per second, rate at which
					benchmark performed reads/writes</P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in"><I>MB/s</I>
					– megabytes/sec, rate at which application transferred data</P>
				</UL>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in">Users
				should never need to run <STRONG>smallfile.py</STRONG> -- this is
				the python class which implements the workload generator.
				Developers can run this module to invoke its unit test however:</P>
				<PRE CLASS="western" STYLE="margin-bottom: 0.2in; border: none; padding: 0in"><STRONG> </STRONG><STRONG><FONT SIZE=3># python smallfile.py </FONT></STRONG></PRE><P STYLE="margin-bottom: 0in; border: none; padding: 0in">
				To run just one unit test module run</P>
				<PRE CLASS="western" STYLE="margin-bottom: 0.2in; border: none; padding: 0in"><STRONG> </STRONG><STRONG><FONT SIZE=3># python -m unittest smallfile.Test.test_c3_Symlink</FONT></STRONG></PRE><P STYLE="margin-bottom: 0in; line-height: 100%">
				<BR>
				</P>
				<P STYLE="margin-bottom: 0in; line-height: 100%"><BR>
				</P>
				<H2 STYLE="margin-top: 0in; margin-bottom: 0.2in; font-weight: normal"><A NAME="__RefHeading__123_1677170542"></A><A NAME="__RefHeading__236_244684570"></A>
				<FONT SIZE=4 STYLE="font-size: 16pt">Use with distributed
				filesystems</FONT></H2>
				<P STYLE="margin-bottom: 0.2in">With distributed filesystems, it
				is necessary to have multiple hosts simultaneously applying
				workload to measure the performance of a distributed filesystem.
				The –host-set parameter lets you specify a comma-separated list
				of hosts to use. 
				</P>
				<P STYLE="margin-bottom: 0.2in">For any distributed filesystem
				test, there must be a single directory which is shared across all
				hosts, both test driver and worker hosts, that can be used to
				pass test parameters, pass back results, and coordinate activity
				across the hosts. This is referred to below as the “shared
				directory” in what follows. By default this is the
				<B>network_shared/</B> subdirectory of the –top directory, but
				you can override this default by specifying the –<B>network-sync-dir</B>
				directory parameter, see the next section for why this is useful.</P>
				<P STYLE="margin-bottom: 0.2in">Some distributed filesystems,
				such as NFS and Gluster, have relaxed, eventual-consistency
				caching of directories; this will cause problems for the shared
				directory. To work around this problem, you can use a separate
				NFS mountpoint exported from a Linux NFS server, mounted with the
				option <FONT FACE="Courier 10 Pitch">actimeo=1</FONT> (to limit
				duration of time NFS will cache directory entries and metadata).
				You then reference this mountpoint using the –<B>network-sync-dir</B>
				option of <B>smallfile</B>. For example:</P>
				<P STYLE="margin-bottom: 0.2in"><BR><BR>
				</P>
				<P STYLE="margin-left: 0.49in; margin-bottom: 0in"><FONT SIZE=3>#
				<B>mount -t nfs -o actimeo=1</B>
				<I>your-linux-server</I>:/<I>your/nfs/export</I> <B>/mnt/nfs</B></FONT></P>
				<P STYLE="margin-left: 0.49in; margin-bottom: 0in"><FONT SIZE=3>#
				<B>./smallfile_cli.py –top</B> <I>/your/distributed/filesystem</I>
				–<B>network-sync-dir /mnt/nfs/smf-shared </B></FONT>
				</P>
				<H2 STYLE="margin-top: 0in; margin-bottom: 0.2in; border: none; padding: 0in">
				<BR><BR>
				</H2>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in"><SPAN STYLE="font-weight: normal">For
				non-Windows tests, the user must set up password-less ssh between
				the test driver and the host. If security is an issue, a non-root
				username can be used throughout, since smallfile requires no
				special privileges. Edit the </SPAN><B>$HOME/.ssh/authorized_keys</B>
				<SPAN STYLE="font-weight: normal">file to contain the public key
				of the account on the test driver. The test driver will bypass
				the .ssh/known_hosts file by using </SPAN><B>-o
				StrictHostKeyChecking=no</B> <SPAN STYLE="font-weight: normal">option
				in the </SPAN><B>ssh</B> <SPAN STYLE="font-weight: normal">command.</SPAN></P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in"><BR>
				</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in"><SPAN STYLE="font-weight: normal">For
				Windows tests, each worker host must be running the</SPAN>
				<B>launch_smf_host.py</B> <SPAN STYLE="font-weight: normal">program
				that polls the shared network directory for a file that contains
				the command to launch </SPAN><B>smallfile_remote.py</B> <SPAN STYLE="font-weight: normal">in
				the same way that would happen with ssh on non-Windows tests. The
				command-line parameters are:</SPAN></P>
				<UL>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in"><B>--shared
					shared-directory</B> – <SPAN STYLE="font-weight: normal">this
					must point at the directory shared by all smallfile hosts.
					Normally this is the </SPAN><B>network_shared</B> <SPAN STYLE="font-weight: normal">subdirectory
					of the </SPAN>–<B>top </B><SPAN STYLE="font-weight: normal">directory
					but it could be the </SPAN>–<B>network-sync-dir</B> <SPAN STYLE="font-weight: normal">directory
					if that is specified.</SPAN></P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in"><B>--as-host
					host-name </B>– <SPAN STYLE="font-weight: normal">specify what
					hostname identifier will be used for this host. Why not just ask
					the host what name to use? Hosts can have multiple network
					interfaces, and therefore can have multiple host names. in some
					cases we want to use IP addresses instead.</SPAN></P>
				</UL>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in; font-weight: normal">
				An example of how to start a Windows test with this method
				follows, using actual DOS prompt syntax. Something like the first
				command must be run on every host participating in the test,
				before the test actually is started.</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in"><BR>
				</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in"><FONT FACE="Courier 10 Pitch"><FONT SIZE=2 STYLE="font-size: 9pt"><SPAN STYLE="font-weight: normal">&gt;
				start python launch_smf_host.py –shared <A HREF="/z:/smf">z:\smf</A>\network_shared
				–as-host gprfc023</SPAN></FONT></FONT></P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in"><FONT FACE="Courier 10 Pitch"><FONT SIZE=2 STYLE="font-size: 9pt"><SPAN STYLE="font-weight: normal">&gt;
				python smallfile_cli.py –top <A HREF="/z:/smf">z:\smf</A>
				–host-set gprfc023</SPAN></FONT></FONT></P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in"><BR>
				</P>
				<H2 STYLE="margin-top: 0in; margin-bottom: 0.2in; font-weight: normal"><A NAME="__RefHeading__125_1677170542"></A><A NAME="__RefHeading__238_244684570"></A>
				<FONT SIZE=4 STYLE="font-size: 16pt">Use with non-networked
				filesystems</FONT></H2>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in"><BR>
				</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in">There
				are cases where you want to use a distributed filesystem test on
				host-local filesystems. One such example is virtualization, where
				the “local” filesystem is really layered on a virtual disk
				image which may be stored in a network filesystem. The benchmark
				needs to share certain files across hosts to return results and
				synchronize threads. In such a case, you specify the
				–<B>network-sync-dir</B> <I>directory-pathname</I><SPAN STYLE="font-variant: normal">
				</SPAN><SPAN STYLE="font-variant: normal"><SPAN STYLE="font-style: normal">parameter
				to have the benchmark use a directory in some shared filesystem
				external to the test directory (specified with </SPAN></SPAN><SPAN STYLE="font-variant: normal">–</SPAN><SPAN STYLE="font-variant: normal"><SPAN STYLE="font-style: normal"><B>top</B></SPAN></SPAN><SPAN STYLE="font-variant: normal">
				</SPAN><SPAN STYLE="font-variant: normal"><SPAN STYLE="font-style: normal">parameter).
				By default, if this parameter is not specified then the shared
				directory will be the subdirectory </SPAN></SPAN><SPAN STYLE="font-variant: normal"><SPAN STYLE="font-style: normal"><B>network-dir</B></SPAN></SPAN><SPAN STYLE="font-variant: normal">
				</SPAN><SPAN STYLE="font-variant: normal"><SPAN STYLE="font-style: normal">underneath
				the directory specified with the </SPAN></SPAN><SPAN STYLE="font-variant: normal">–</SPAN><SPAN STYLE="font-variant: normal"><SPAN STYLE="font-style: normal"><B>top</B></SPAN></SPAN><SPAN STYLE="font-variant: normal">
				</SPAN><SPAN STYLE="font-variant: normal"><SPAN STYLE="font-style: normal">parameter.</SPAN></SPAN></P>
				<H2 STYLE="margin-top: 0in; margin-bottom: 0.2in; border: none; padding: 0in">
				<BR><BR>
				</H2>
				<H2 STYLE="margin-top: 0in; margin-bottom: 0.2in"><A NAME="__RefHeading__127_1677170542"></A><A NAME="__RefHeading__112_244684570"></A>
				<FONT SIZE=4 STYLE="font-size: 16pt">Use of subdirectories</FONT></H2>
				<P>Before a test even starts, the smallfile benchmark ensures
				that the directories needed by that test already exist (there is
				a specific operation type for testing performance of subdirectory
				creation and deletion). If the top directory (specified by –top
				parameter) is <B>D</B>, then the top per-thread directory is
				<B>D</B>/host/d<B>TT</B> where <B>TT</B> is a 2-digit thread
				number and “host” is the hostname. If the test is not a
				distributed test, then it's just whatever host the benchmark
				command was issued on, otherwise it is each of the hosts
				specified by the –host-set parameter. The first F files (where
				F is the value of the –files-per-dir) parameter are placed in
				this top per-thread directory. If the test uses more than F
				files/thread, then at least one subdirectory from the first level
				of subdirectories must be used; these subdirectories have the
				path T/host/dTT/dNNN where NNN is the subdirectory number.
				Suppose the value of the parameter –subdirs-per-dir is D. Then
				there are at most D subdirectories of the top per-thread
				directory. If the test requires more than D(F+1) files per
				thread, then a second level of subdirectories will have to be
				created, with pathnames like T/host/dTT/dNNN/dMMM . This process
				of adding subdirectories continues in this fashion until there
				are sufficient subdirectories to hold all the files. The purpose
				of this approach is to simulate a mixture of directories and
				files, and to not require the user to specify how many levels of
				directories are required.</P>
				<P>The use of multiple mountpoints is supported. This features is
				useful for testing NFS, etc.</P>
				<P>Note that the test harness does not have to scan the
				directories to figure out which files to read or write – it
				simply generates the filename sequence itself. If you want to
				test directory scanning speed, use <B>readdir</B> or <B>ls-l</B>
				operations. 
				</P>
				<H2 STYLE="margin-top: 0in; margin-bottom: 0.2in"><A NAME="__RefHeading__129_1677170542"></A><A NAME="__RefHeading__114_244684570"></A>
				<FONT SIZE=4 STYLE="font-size: 16pt">Sharing directories across
				threads</FONT></H2>
				<P>Some applications require that many threads, possibly spread
				across many host machines, need to share a set of directories.
				The <B>--same-dir</B> parameter makes it possible for the
				benchmark to test this situation. By default this parameter is
				set to N, which means each thread has its own non-overlapping
				directory tree. This setting provides the best performance and
				scalability. However, if the user sets this parameter to Y, then
				the top per-thread directory for all threads will be <B>T </B>instead
				of <B>T/host/dTT</B> as described in preceding section.</P>
				<H2 STYLE="margin-top: 0in; margin-bottom: 0.2in"><A NAME="__RefHeading__131_1677170542"></A><A NAME="__RefHeading__116_244684570"></A>
				<FONT SIZE=4 STYLE="font-size: 16pt">Hashing files into directory
				tree</FONT></H2>
				<P><FONT SIZE=2 STYLE="font-size: 11pt">For applications which
				create very large numbers of small files (millions for example),
				it is impossible or at the very least impractical to place them
				all in the same directory, whether or not the filesystem supports
				so many files in a single directory. There are two ways which
				applications can use to solve this problem:</FONT></P>
				<UL>
					<LI><P><FONT SIZE=2 STYLE="font-size: 11pt">insert files into 1
					directory at a time – can create I/O and lock contention for
					the directory metadata</FONT></P>
					<LI><P><FONT SIZE=2 STYLE="font-size: 11pt">insert files into
					many directories at the same time – relieves I/O and lock
					contention for directory metadata, but increases the amount of
					metadata caching needed to avoid cache misses</FONT></P>
				</UL>
				<P><FONT SIZE=2 STYLE="font-size: 11pt">The –<B>hash-into-dirs</B>
				parameter is intended to enable simulation of this latter mode of
				operation. By default, the value of this parameter is N, and in
				this case a smallfile thread will sequentially access directories
				one at a time. In other words, the first D (where D = value of
				–<B>files-per-dir </B>parameter) files will be assigned to the
				top per-thread directory, then the next D files will be assigned
				to the next per-thread directory, and so on. However, if the
				–<B>hash-into-dirs</B> parameter is set to Y, then the number
				of the file being accessed by the thread will be hashed into the
				set of directories that are being used by this thread. </FONT>
				</P>
				<H2 STYLE="margin-top: 0in; margin-bottom: 0.2in"><A NAME="__RefHeading__133_1677170542"></A><A NAME="__RefHeading__118_244684570"></A>
				<FONT SIZE=4 STYLE="font-size: 16pt">Random file size
				distribution option</FONT></H2>
				<P STYLE="margin-bottom: 0.2in"><FONT SIZE=2 STYLE="font-size: 11pt">In
				real life, users don't create files that all have the same size.
				Typically there is a file size distribution with a majority of
				small files and a lesser number of larger files. This benchmark
				supports use of the random exponential distribution to
				approximate that behavior. If you specify </FONT>
				</P>
				<P STYLE="margin-left: 0.79in; margin-bottom: 0.2in">–<FONT FACE="Courier 10 Pitch"><FONT SIZE=2 STYLE="font-size: 11pt">file-size-distribution
				</FONT></FONT><FONT FACE="Courier 10 Pitch"><FONT SIZE=2 STYLE="font-size: 11pt"><B>exponential</B></FONT></FONT>
				–<FONT FACE="Courier 10 Pitch"><FONT SIZE=2 STYLE="font-size: 11pt">file-size
				</FONT></FONT><FONT FACE="Courier 10 Pitch"><FONT SIZE=2 STYLE="font-size: 11pt"><B>S</B></FONT></FONT>
								</P>
				<P STYLE="margin-bottom: 0.2in"><FONT SIZE=2 STYLE="font-size: 11pt">The
				meaning of the –<B>file-size</B> parameter changes to the
				<I>maximum</I> file size (<B>S</B> KB), and the mean file size
				becomes <B>S</B>/8. All file sizes are rounded down to the
				nearest kilobyte boundary, and the smallest allowed file size is
				1 KB. When this option is used, the smallfile benchmark saves the
				seed for each thread's random number generator object in a <B>.seed</B>
				file stored in the <B>TMPDIR</B> directory (typically <B>/var/tmp</B>).
				This allows the file reader to recreate the sequence of random
				numbers used by the file writer to generate file sizes, so that
				the reader knows exactly how big each file should be without
				asking the file system for this information. The append operation
				works in the same way. All other operations are metadata
				operations and do not require that the file size be known in
				advance.</FONT></P>
				<P STYLE="margin-bottom: 0.2in"><BR><BR>
				</P>
				<H2 STYLE="margin-top: 0in; margin-bottom: 0.2in"><A NAME="__RefHeading__135_1677170542"></A><A NAME="__RefHeading__120_244684570"></A>
				<FONT SIZE=4 STYLE="font-size: 16pt">Avoiding caching effects</FONT></H2>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in"><A NAME="Avoiding_caching_effects"></A>
				THere are two types of caching effects that we wish to avoid,
				data caching and metadata caching.&nbsp; If the average object
				size is sufficiently large, we need only be concerned about data
				caching effects.&nbsp; In order to avoid data caching effects
				during a large-object read test, the Linux buffer cache on all
				servers must be cleared. In part this is done using the command:
				&quot;echo 1 &gt; /proc/sys/vm/drop_caches&quot; on all hosts.&nbsp;
				However, gluster has its own internal caches.&nbsp;&nbsp; To
				evict all prior data from the cache, the simplest method is to
				just use iozone to write a large amount of data into some files
				in the gluster filesystem, then delete them.&nbsp; For example,
				if the gluster 3.2 server caches 1 GB of data then the amount of
				data written should be roughly 2 GB/server and the number of
				files used should be roughly 8 times the number of servers.&nbsp;
				Use of many separate files ensures that this cache eviction data
				is spread across all servers approximately equally.<STRONG>&nbsp;</STRONG></P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in"><BR>
				</P>
				<H2 STYLE="margin-top: 0in; margin-bottom: 0.2in; border: none; padding: 0in"><A NAME="__RefHeading__137_1677170542"></A><A NAME="__RefHeading__240_244684570"></A>
				<FONT SIZE=4 STYLE="font-size: 16pt">Use of --pause in
				multi-thread tests</FONT></H2>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in">In some
				filesystems, the first thread that starts running will be
				operating at memory speed (example: NFS writes) and can easily
				finish before other threads have a chance to get started.&nbsp;
				This immediately invalidates the test.&nbsp; To make this less
				likely, it is possible to insert a per-file delay into each
				thread with the --pause option so that the other threads have a
				chance to participate in the test during the measurement
				interval.&nbsp;&nbsp;&nbsp; It is preferable to run a longer test
				instead, because in some cases you might otherwise restrict
				throughput unintentionally.&nbsp; But if you know that your
				throughput upper bound is X files/sec and you have N threads
				running, then your per-thread throughput should be no more than
				N/X, so a reasonable pause would be something like 3X/N
				microseconds.&nbsp; For&nbsp; example, if you know that you
				cannot do better than 100000 files/sec and you have 20 threads
				running,try a 60/100000 = 600 microsecond pause.&nbsp; Verify
				that this isn't affecting throughput by reducing the pause and
				running a longer test.</P>
				<H2 STYLE="margin-bottom: 0.2in; font-weight: normal"><A NAME="__RefHeading__139_1677170542"></A>
				<FONT SIZE=4>How to measure asynchronous file copy performance</FONT></H2>
				<P STYLE="margin-bottom: 0in; font-weight: normal; line-height: 100%">
				<FONT FACE="Liberation Serif, serif"><FONT SIZE=3>When we want to
				measure performance of an asynchronous file copy (example:
				Gluster geo-replication), we can use smallfile to create the
				original directory tree, but then we can use the new await-create
				operation type to wait for files to appear at the file copy
				destination. To do this, we need to specify a separate network
				sync directory. So for example, to create the original directory
				tree, we could use a command like:</FONT></FONT></P>
				<P STYLE="margin-bottom: 0in; line-height: 100%"><BR>
				</P>
				<P STYLE="margin-bottom: 0in; line-height: 100%"><FONT FACE="Courier 10 Pitch"><FONT SIZE=3><SPAN STYLE="font-weight: normal">./smallfile_cli.py
				--top /mnt/glusterfs-master/smf \</SPAN></FONT></FONT></P>
				<P STYLE="margin-bottom: 0in; line-height: 100%">–<FONT FACE="Courier 10 Pitch"><FONT SIZE=3><SPAN STYLE="font-weight: normal">threads
				16 --files 2000 --file-size 1024 \</SPAN></FONT></FONT></P>
				<P STYLE="margin-bottom: 0in; font-weight: normal; line-height: 100%">
				<FONT FACE="Courier 10 Pitch"><FONT SIZE=3>--operation create
				–incompressible Y --record-ctime-size Y --response-times Y</FONT></FONT></P>
				<P STYLE="margin-bottom: 0in; line-height: 100%"><BR>
				</P>
				<P STYLE="margin-bottom: 0in; line-height: 100%"><FONT FACE="Liberation Serif, serif"><FONT SIZE=3><SPAN STYLE="font-weight: normal">Suppose
				that this mountpoint is connected to a Gluster “master”
				volume which is being geo-replicated to a “slave” volume in a
				remote site asynchronously. We can measure the performance of
				this process using a command like this, where
				/mnt/glusterfs-slave is a read-only mountpoint accessing the
				slave volume.</SPAN></FONT></FONT></P>
				<P STYLE="margin-bottom: 0in; line-height: 100%"><BR>
				</P>
				<P STYLE="margin-bottom: 0in; font-weight: normal; line-height: 100%">
				<FONT FACE="Courier 10 Pitch"><FONT SIZE=3>./smallfile_cli.py
				--top /mnt/glusterfs-slave/smf \</FONT></FONT></P>
				<P STYLE="margin-bottom: 0in; font-weight: normal; line-height: 100%">
				<FONT FACE="Courier 10 Pitch"><FONT SIZE=3>--threads 16 --files
				2000 --file-size 1024 \</FONT></FONT></P>
				<P STYLE="margin-bottom: 0in; font-weight: normal; line-height: 100%">
				<FONT FACE="Courier 10 Pitch"><FONT SIZE=3>--operation
				await-create –incompressible Y --response-times Y \</FONT></FONT></P>
				<P STYLE="margin-bottom: 0in; font-weight: normal; line-height: 100%">
				<FONT FACE="Courier 10 Pitch"><FONT SIZE=3>--network-sync-dir
				/tmp/other</FONT></FONT></P>
				<P STYLE="margin-bottom: 0in; line-height: 100%"><BR>
				</P>
				<P STYLE="margin-bottom: 0in; font-weight: normal; line-height: 100%">
				<FONT FACE="Liberation Serif, serif"><FONT SIZE=3>Requirements:</FONT></FONT></P>
				<UL>
					<LI><P STYLE="margin-bottom: 0in; font-weight: normal; line-height: 100%">
					<FONT FACE="Liberation Serif, serif"><FONT SIZE=3>The parameters
					controlling file sizes, directory tree, and number of files must
					match in the two commands.</FONT></FONT></P>
					<LI><P STYLE="margin-bottom: 0in; font-weight: normal; line-height: 100%">
					<FONT FACE="Liberation Serif, serif"><FONT SIZE=3>The
					–incompressible option must be set if you want to avoid
					situation where async copy software can compress data to exceed
					network bandwidth.</FONT></FONT></P>
					<LI><P STYLE="margin-bottom: 0in; line-height: 100%"><FONT FACE="Liberation Serif, serif"><FONT SIZE=3><SPAN STYLE="font-weight: normal">The
					first command must use the </SPAN>–<B>record-ctime-size Y</B>
					<SPAN STYLE="font-weight: normal">option so that the
					await-create operation knows when the original file was created
					and how big it was. </SPAN></FONT></FONT>
					</P>
				</UL>
				<P STYLE="margin-bottom: 0in; line-height: 100%"><BR>
				</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in; line-height: 100%">
				<FONT FACE="Liberation Serif, serif"><FONT SIZE=3><SPAN STYLE="font-weight: normal">How
				does this work? The first command records information in a
				user-defined xattr for each file so that the second command, the
				</SPAN><B>await-create</B> <SPAN STYLE="font-weight: normal">operation
				can calculate time required to copy the file, which is recorded
				as a “response time”, and so that it knows that the entire
				file reached the destination.</SPAN></FONT></FONT></P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in; line-height: 100%">
				<BR>
				</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in; line-height: 100%">
				<FONT FACE="Liberation Serif, serif"><FONT SIZE=3><SPAN STYLE="font-weight: normal">WARNING:
				the –verify-read option is not supported with –await-create
				operation, so smallfile is not yet able to verify that the
				contents of the files are correct, only that the file size is
				correct.</SPAN></FONT></FONT></P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in; line-height: 100%">
				<BR>
				</P>
				<H1 CLASS="western" STYLE="margin-top: 0in; margin-bottom: 0.2in">
				Results</H1>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in">All
				tests display a &quot;files/sec&quot; result.&nbsp; If the test
				performs reads or writes, then a &quot;MB/sec&quot; data transfer
				rate and an &quot;IOPS&quot; result (i.e. total read or write
				calls/sec) are also displayed.&nbsp; Each thread participating in
				the test keeps track of total number of files and I/O requests
				that it processes during the test measurement interval.&nbsp;
				These results are rolled up per host if it is a single-host
				test.&nbsp; For a multi-host test, the per-thread results for
				each host are saved in a file within the --top directory, and the
				test master then reads in all of the saved results from its
				slaves to compute the aggregate result across all client hosts.&nbsp;
				The percentage of requested files which were processed in the
				measurement interval is also displayed, and if the number is
				lower than a threshold (default 70%) then an error is raised.</P>
				<H2 STYLE="margin-top: 0in; margin-bottom: 0.2in; border: none; padding: 0in"><A NAME="__RefHeading__141_1677170542"></A><A NAME="__RefHeading__122_244684570"></A>
				Response time collection</H2>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in">response
				times for operations on each file are saved by thread in .csv
				form. &nbsp; For example, you can turn these into an X-Y
				scatterplot so that you can see how response time varies over
				time, to use:<BR>&nbsp;&nbsp;</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in"><STRONG><FONT SIZE=4>#
				python smallfile_cli.py --response-times Y</FONT></STRONG><BR><STRONG><FONT SIZE=4>#
				ls -ltr /var/tmp/rsptimes*.csv</FONT></STRONG></P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in">You
				should see 1 .csv file per thread.&nbsp; These files should be in
				a format<BR>that can be loaded into any spreadsheet application,
				such as Excel, and<BR>graphed.&nbsp; An x-y scatterplot can be
				useful to see changes over time in response time.</P>
				<H1 CLASS="western" STYLE="margin-top: 0in; margin-bottom: 0.2in">
				Comparable Benchmarks</H1>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in">There
				are many existing performance test benchmarks. I have tried just
				about all the ones that I've heard of. Here are the ones I have
				looked at, I'm sure there are many more that I failed to include
				here.</P>
				<UL>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in"><B>Bonnie++</B>
					-- works well for a single host, but you cannot generate load
					from multiple hosts because the benchmark will not synchronize
					its activities, so different phases of the benchmark will be
					running at the same time, whether you want them to or not. 
					</P>
				</UL>
				<UL>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in"><B>iozone</B>
					-- this is a great tool for large-file testing, but it can only
					do 1 file/thread in its current form. 
					</P>
				</UL>
				<UL>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in"><B>postmark</B>
					-- works fine for a single client, not as useful for
					multi-client tests 
					</P>
				</UL>
				<UL>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in"><B>grinder</B>
					-- has not to date been useful for filesystem testing, though it
					works well for web services testing. 
					</P>
				</UL>
				<UL>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in"><B>JMeter</B>
					– has been used successfully by others in the past. 
					</P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in"><B>fs_mark</B>
					-- Ric Wheeler's filesystem benchmark, is very good at creating
					files 
					</P>
				</UL>
				<UL>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in"><B>fio</B>
					-- Linux test tool -- broader coverage of Linux system calls
					particularly around async. and direct I/O</P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in"><B>diskperf</B>
					– open-source tool that generates limited small-file workloads
					for a single host.</P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in"><B>dbench</B>
					– developed by samba team</P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in"><B>SPECsfs</B>
					– not open-source, but <B>netmist</B> workload generator is
					another distributed workload generator (configured similarly to
					iozone) but with a wider range of workloads. 
					</P>
				</UL>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in">&nbsp;</P>
				<H1 CLASS="western" STYLE="margin-top: 0in; margin-bottom: 0.2in">
				Design principles</H1>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in">A
				cluster-aware test tool ideally should:</P>
				<UL>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in">start
					threads on all hosts at same time 
					</P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in">stop
					measurement of throughput for all threads at the same time 
					</P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in">be
					easy to use in all file system environments 
					</P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in">be
					highly portable and be trivial to install 
					</P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in">have
					very low overhead 
					</P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in">not
					require threads to synchronize (be embarrassingly parallel) 
					</P>
				</UL>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in">Although
				there may be some useful tests that involve thread
				synchronization or contention, but we don't want the tool to
				<I>force</I> thread synchronization or contention for resources.
				In order to run prolonged small-file tests (which is a
				requirement for scalability to very large clusters),, each thread
				has to be able to use more than one directory.&nbsp;&nbsp; Since
				some filesystems perform very differently as the files/directory
				ratio increases, and most applications and users do not rely on
				having huge file/directory ratios, this is also important for
				testing the filesystem with a realistic use case.&nbsp; This
				benchmark does something similar to Ric Wheeler's fs_mark
				benchmark with multiple directory levels.&nbsp;&nbsp; This
				benchmark imposes no hard limit on how many directories can be
				used and how deep the directory tree can go.&nbsp; Instead, it
				creates directories according to these constraints:</P>
				<OL>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in">files
					(and directories) are placed as close to the root of the
					directory hierarchy as possible 
					</P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in">no
					directory contains more than the number of files specified in
					the --files-per-dir test parameter 
					</P>
					<LI><P STYLE="margin-bottom: 0in; border: none; padding: 0in">no
					directory contains more than number of subdirectories specified
					in the --dirs-per-dir test parameter 
					</P>
				</OL>
				<H2 STYLE="margin-top: 0in; margin-bottom: 0.2in"><BR><BR>
				</H2>
				<H2 STYLE="margin-top: 0in; margin-bottom: 0.2in"><A NAME="__RefHeading__143_1677170542"></A><A NAME="__RefHeading__124_244684570"></A><A NAME="Synchronization"></A>
				<FONT SIZE=4 STYLE="font-size: 16pt">Synchronization</FONT></H2>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in">A
				single directory is used to synchronize the threads and hosts.
				This may seem problematic, but we assume here that the file
				system is not very busy when the test is run (otherwise why would
				you run a load test on it?). So if a file is created by one
				thread, it will quickly be visible on the others, as long as the
				filesystem is not heavily loaded. 
				</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in"><BR>
				</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in">If it's
				a single-host test, any directory is sharable amongst threads,
				but in a multi-host test only a directory shared by all
				participating hosts can be used. If the –<B>top</B> test
				directory is in a network-accessible file system (could be NFS or
				Gluster for example), then the synchronization directory is by
				default in the network_shared subdirectory by default and need
				not be specified. If the –<B>top</B> directory is in a
				host-local filesystem, then the –<B>network-sync-dir</B> option
				must be used to specify the synchronization directory. When a
				network directory is used, change propagation between hosts
				cannot be assumed to occur in under two seconds. 
				</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in"><BR>
				</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in">We use
				the concept of a &quot;starting gate&quot; -- each thread does
				all preparation for test, then waits for a special file, the
				&quot;starting gate&quot;, to appear in the shared area. When a
				thread arrives at the starting gate, it announces its arrival by
				creating a filename with the host and thread ID embedded in it.
				When all threads have arrived, the controlling process will see
				all the expected &quot;thread ready&quot; files, and will then
				create the <B>starting gate</B> file. When the starting gate is
				seen, the thread pauses for a couple of seconds, then commences
				generating workload. This initial pause reduces time required for
				all threads to see the starting gate, thereby minimizing chance
				of some threads being unable to start on time. Synchronous thread
				startup reduces the &quot;warmup time&quot; of the system
				significantly.</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in"><BR>
				</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in">We also
				need a checkered flag (borrowing from car racing metaphor). Once
				test starts, each thread looks for a <B>stonewall</B> file in the
				synchronization directory. If this file exists, then the thread
				stops measuring throughput at this time (but can (and does by
				default) optionally continue to perform requested number of
				operations). Consequently throughput measurements for each thread
				may be added to obtain an accurate aggregate throughput number.
				This practice is sometimes called &quot;stonewalling&quot; in the
				performance testing world.</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in"><BR>
				</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in">Synchronization
				operations <I>in theory</I> do not require the worker threads to
				read the synchronization directory. For distributed tests, the
				test driver host has to check whether the various per-host
				synchronization files exist, but this does not require a readdir
				operation. The test driver does this check in such a way that the
				number of file lookups is only slightly more than the number of
				hosts, and this does not require reading the entire directory,
				only doing a set of lookup operations on individual files, so
				it's <I>O(n)</I> scalable as well.</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in"><BR>
				</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in">The bad
				news is that some filesystems do not synchronize directories
				quickly without an explicit readdir() operation, so we are at
				present doing os.listdir() as a workaround -- this may have to be
				revisited for very large tests.</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in"><BR>
				</P>
				<H3 CLASS="western" STYLE="font-weight: normal"><A NAME="__RefHeading__145_1677170542"></A><A NAME="__RefHeading__126_244684570"></A>
				How test parameters are transmitted to worker threads</H3>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in">The
				results of the command line parse are saved in a <B>smf_test_params</B>
				object and stored in a python pickle file, which is a
				representation independent of CPU architecture or operating
				system. The file is placed in the shared network directory.
				Remote worker processes are invoked via the <B>smallfile_remote.py</B>
				command and read this file to discover test parameters.</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in"><BR>
				</P>
				<H3 CLASS="western" STYLE="margin-top: 0in; margin-bottom: 0in; border: none; padding: 0in; font-weight: normal"><A NAME="__RefHeading__147_1677170542"></A><A NAME="__RefHeading__128_244684570"></A>
				How remote worker threads are launched</H3>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in"><BR>
				</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in; font-weight: normal">
				For Linux or other non-Windows environments, the test driver
				launches worker threads using parallel ssh commands to invoke the
				smallfile_remote.py program, and when this program exits, that is
				how the test driver discovers that the remote threads on this
				host have completed. 
				</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in"><BR>
				</P>
				<P STYLE="margin-bottom: 0in; border: none; padding: 0in; font-weight: normal">
				For Windows environments, ssh usage is more problematic. Sshd
				requires installation of cygwin, a Windows app that emulates a
				Linux-like environment, but we really want to test with native
				win32 environment instead. So a different launching method is
				used (and this method works on non-Windows environments as well).
								</P>
				<H3 CLASS="western" STYLE="font-weight: normal"><A NAME="__RefHeading__149_1677170542"></A><A NAME="__RefHeading__130_244684570"></A>
				<FONT SIZE=4>How results are returned to master process</FONT></H3>
				<P STYLE="margin-bottom: 0in; line-height: 100%"><A NAME="__RefHeading__132_244684570"></A>
				<FONT FACE="Liberation Serif, serif"><FONT SIZE=3>For either
				single-host or multi-host tests, each test thread is implemented
				as a <B>smf_invocation</B> object and all thread state is kept
				there.&nbsp; Results are returned by using python &quot;pickle&quot;
				files to serialize the state of these per-thread objects
				containing details of each thread's progress during the test.&nbsp;
				The pickle files are stored in the shared synchronization
				directory. </FONT></FONT>
				</P>
				<P STYLE="margin-bottom: 0in; line-height: 100%"><BR>
				</P>
				<H3 CLASS="western"><BR><BR>
				</H3>
			</DIV>
		</DIV>
	</DIV>
</DIV>
<DIV ID="sdfootnote1">
	<P CLASS="sdfootnote" STYLE="margin-bottom: 0.2in"><A CLASS="sdfootnotesym" NAME="sdfootnote1sym" HREF="#sdfootnote1anc">1</A>You
	can generate a mixed workload by running multiple invocations of
	smallfile at the same time, but this approach has limited utility
	and you will have to aggregate the resulting performance stats
	yourself.</P>
</DIV>
</BODY>
</HTML>