-
Notifications
You must be signed in to change notification settings - Fork 2
DocAdmin
This document describes the various administrative operations involved in a SLASH2 deployment.
This section outlines very basic instructions on getting familiar with ZFS snapshotting capabilities. This is important for a SuperCell deployment since the metadata file system (MDFS) is actually a zpool, so backing up, restoring, etc. are all necessary components of properly administering it.
While slashd is running, zfs and zpool commands will work the same way as one would expect under any other ZFS installation.
The notion of a snapshot in ZFS is a near-instantaneous copy-on-write clone of the entire zpool at a given point in time. The snapshot has a name. The snapshot can be exported into a raw stream that can be used for recovery purposes later on. A snapshot stream is the raw file system data that has been exported via the zfs send command.
There are two types of snapshot streams: full and incremental (or “partial”). A full snapshot stream contains the entire contents of the zpool at a certain point of time, serialized into a single file, which may be compressed to save space, if desired. (SLASH2 metadata is highly compressible so it is advised to do so while using a parallel compression tool such as pbzip2 or pxz.) An incremental snapshot stream is made relative to an existing snapshot, similar to a logical diff of file system changes.
A snapshot can be made of the entire MDFS reflecting the current moment in time:
# zfs snapshot s2_mdfs@snaptest0
Again, snapshots in ZFS apply to the entire file system i.e. they cannot be made on a single subdirectory tree or individual file. The state for a snapshot is held internally inside the zpool. The zfs send and zfs recv commands can be used to externalize or re-internalize a snapshot (i.e. import and fully recover state in the live file system). More details on how snapshots work are available in a variety of ZFS documentation/guides.
All snapshots held in a zpool can be listed:
# zfs list -t snapshot
NAME USED AVAIL REFER MOUNTPOINT
s2_mdfs@snaptest0 0 - 64.5K -
To externalize a snapshot into a stream, an action necessary for proper backup maintenance, run:
# zfs send s2_mdfs@snaptest0 > \
/local/snapshots/full/[email protected]
# ls -lFah /local/snapshots/full/[email protected]
-rw-r--r-- 1 root wheel 80k Jun 8 19:42 [email protected]
As the metadata constitutes all data except for actual user data, the footprint is several orders of magnitude less than the SuperCell data volume. Still, the space can grow, so it is advisable to compress the streams. Snappy, bzip2, xz, etc. and the parallel versions of these tools are great things to have in the toolchain.
It is the advice of the authors to create a periodic (cron) process to automatically create these streams and back them up on other hosts for file system analysis and disaster recovery.
When synchronizing clones of ZFS file systems, instead of serializing full snapshots, the ZFS incremental snapshot feature can leveraged, reducing export, transfer, and import time and resources. An incremental snapshot stream is made relative to another snapshot. This stream can then be copied to other machines and imported to facilitate the updating:
# snapfn=$snapdir/partial/$pool@$CURR_SNAP.zsnap
# zfs send -i $PREV_SNAP $pool@$CURR_SNAP > $snapfn
# scp $snapfn destination:$snapfn
The zstreamdump(8) command may be used to verify that a ZFS snapshot stream was properly generated and can be expected to be used to rebuild a file system from backup (in ZFS terminology: “received”).
For disaster recovery, the MDFS zpool needs to be rebuilt and the data imported back in from a recent full snapshot stream:
# slmctl stop # kill SLASH2 slashd
# zfs-fuse
# zpool create s2_mdfs raidz3 /dev/disk/by-id/ata-...
# zpool set compression… # any FS options
# decompress /local/snapshots/full/[email protected] | \
zfs recv -F s2_mdfs
Once this is complete, unmount /s2_mdfs, kill zfs-fuse, and relaunch
slashd.
All machines participating in the SuperCell deployment (MDS, IOS, and clients) should send all of their syslog messages to “syslog servers” to make aggregate log processing easier.
Non-dedicated client machines also intending to mount the SuperCell
should send at least the daemon syslog messages, which should include
all activity logged by the SLASH2 mount_slash client software.
Note: the @ denotes UDP for the transport whereas @@ denotes TCP. TCP is recommended but may not be supported by all systems.
Having a periodic test to ensure healthy service e.g. through a daily file access (read/write) cronjob is helpful as each component will register baseline activity to show up in the log processing system.
The SLASH2 source code tree contains =utils/slash2_check.py= which performs a basic system health check and sends reports to a Nagios server.
We have two log servers: arclog0.psc.edu & arclog1.psc.edu
Log messages should be sent to BOTH servers so that we don't lose them (e.g. in case one server is down)
Logs for the archiver are kept in /var/log/arc on the log servers
messages from 'mount_slash' client daemons are sorted into /var/log/arc/mount_slash
logs from the metadata server(s) are stored into /var/log/arc/mdserver
logs from the I/O servers are stored into /var/log/arc/ioserver
logs from all firehoses are stored in /var/log/arc/client
Log Rotation
ioserver and mount_slash logs are rotated daily (since weekly logs get too large)
other logs are rotated weekly
to rotate any logs daily, add the log filename to /etc/logrotate.d/arclog
Logs copied to /arc/logs/server/arclog(0,1)
Logs are copied to this archiver directory
Changes can be made in /SYSTEM.CONF to set the copies to DAILY from WEEKLY
You may find that you want to track and report various aspects of your file system including the current state of the system as well as the daily change (via syslog).
The dumpfid tool included in the SLASH2 distribution traverses the file system and reports the attributes of each file encountered. It can return the uid, gid, file size, timestamp and permissions of each file in the file system.
As an example, one user directory is scanned with three worker threads to exploit parallelism provided by the underlying storage hardware (triple mirrored SSDs):
# dumpfid -t 3 -O /local/reports/${user}_output_files.%n -RF \
'%g %u %s %Tm<%Y-%m-%dT%H:%M:%S> %m %f'$'\n' /s2_mdfs_mirror/$user
You can use syslogs with simple event correlator (SEC) to track what the filesystem is doing in real time. You can track reads, writes, deletes, etc...
Sample Reports (see Appendix A for a sample PSC report):
- Top 10 readers and writers by day
- All Hosts, Data/File Count Read and Written
- Total Daily Growth
- Current Size/Space Available
- MDS Disk Capacity
- Compression Ratio
- Current state of file system: usage by group, user, file count, GB used
Nagios is an extensible monitoring framework. A Nagios deployment consists of two components:
- A Nagios server: a collector of service reports and service health monitor
- Configuration of each node and service is required The default Web-based monitoring interface may be used by many alternative packages are available such as Thuk which provide versatility on what is deemed important to be monitored. Nagios clients: each service reports details on its health to the server
Each component in a SLASH2 deployment may use the slash2_check.py script included in the SLASH2 source distribution. This script is configured to run out of a cronjob periodically, such as every 5 minutes. This script only requires Python and the send_nsca Nagios client utility.
SEC – Simple Event Correlator for reacting to log events - http://simple-evcorr.sourceforge.net/SEC-tutorial/article.html
Ganglia is a third party rather feature complete monitoring package that is easy to install and can provide a number of useful performance observability stats: http://ganglia.sourceforge.net/
On the constituent I/O servers, periodic scrubbing of the zpools is recommended. Scrubbing in ZFS is a process where the file system itself scans and reads each piece of data stored for integrity. During this process, any consistency errors discovered are fixed.
The command zpool scrub $poolname initiates this process.
When errors are encountered on the underlying hardware comprising zpools, ZFS reports such conditions via zpool status. It is integral to have a monitoring/notification framework in place to alert administrators to take action on failing disks to avoid further degradation and potential data loss.
To bring the system fully offline, all services must be stopped.
Bringing only pieces of the deployment down at one time will only cripple the service.
Generally, the client service (mount_slash) on all hosts is stopped first,
then I/O servers, then finally MDS servers.
To stop the SLASH2 client service mount_slash:
client# pkill mount_slash
If a client host is mounting multiple SLASH2 file systems, be careful as this command will indiscriminately stop all SLASH2 client services.
To stop the SLASH2 I/O service sliod:
ios# slictl stop
If an IOS is running multiple instances of the
sliodSLASH2 I/O service, be careful to select which daemon to stop.The =utils/slictlN= script is included in the SLASH2 distribution to direct control requests to a specific daemon instance.
If is generally recommended to perform a snapshot and backup of the metadata before stopping the MDS service in case system change renders the service unavailable.
mds# slmctl stop
To disable write I/O lease assignments from being made to a particular I/O system:
mds# slmctl -p resources.$SITE.$IOS_RESOURCE.disable_bia=1
"bia" here stands for "bmap <-> IOS assignment".
"IOS" here means I/O system (i.e. a sliod instance).
In SLASH2, an assignment means write I/O is bound to a particular I/O system.
Note that read I/O leases are not bound to any particular IOS and the client may
choose which IOS to access the desired data.
To determine whether write lease assignments are enabled to any sliod instance
on the ios0 server, query each sliod instance:
mds# slmctl -p resources.$SITE.ios0s0.disable_bia
mds# slmctl -p resources.$SITE.ios0s1.disable_bia
mds# slmctl -p resources.$SITE.ios0s2.disable_bia
mds# slmctl -p resources.$SITE.ios0s3.disable_bia
mds# slmctl -p resources.$SITE.ios0s4.disable_bia
mds# slmctl -p resources.$SITE.ios0s5.disable_bia
mds# slmctl -p resources.$SITE.ios0s6.disable_bia
Or simply:
mds# slmctl -p resources.$SITE | grep $ios0.*disable_bia
Modify the SLASH2 configuration file, normally stored in projects/inf/$DEPLOYMENT/slcfg
and add the following to the desired sliod entry:
flags = "disable_bia";
Note to commit any changes to the deployment configuration after such a change and to push out all relevant hosts.
