Skip to content

DocAdmin

Jared Yanovich edited this page Oct 16, 2015 · 10 revisions

Administering a SLASH2 Deployment

This document describes the various administrative operations involved in a SLASH2 deployment.

2. Metadata FS Snapshots and Backups

This section outlines very basic instructions on getting familiar with ZFS snapshotting capabilities. This is important for a SuperCell deployment since the metadata file system (MDFS) is actually a zpool, so backing up, restoring, etc. are all necessary components of properly administering it.

While slashd is running, zfs and zpool commands will work the same way as one would expect under any other ZFS installation.

2.1. Terminology

The notion of a snapshot in ZFS is a near-instantaneous copy-on-write clone of the entire zpool at a given point in time. The snapshot has a name. The snapshot can be exported into a raw stream that can be used for recovery purposes later on. A snapshot stream is the raw file system data that has been exported via the zfs send command.

There are two types of snapshot streams: full and incremental (or “partial”). A full snapshot stream contains the entire contents of the zpool at a certain point of time, serialized into a single file, which may be compressed to save space, if desired. (SLASH2 metadata is highly compressible so it is advised to do so while using a parallel compression tool such as pbzip2 or pxz.) An incremental snapshot stream is made relative to an existing snapshot, similar to a logical diff of file system changes.

2.2. Snapshotting a ZFS file system

A snapshot can be made of the entire MDFS reflecting the current moment in time:

# zfs snapshot s2_mdfs@snaptest0

Again, snapshots in ZFS apply to the entire file system i.e. they cannot be made on a single subdirectory tree or individual file. The state for a snapshot is held internally inside the zpool. The zfs send and zfs recv commands can be used to externalize or re-internalize a snapshot (i.e. import and fully recover state in the live file system). More details on how snapshots work are available in a variety of ZFS documentation/guides.

2.3. Listing ZFS snapshots

All snapshots held in a zpool can be listed:

# zfs list -t snapshot

NAME                 USED  AVAIL  REFER  MOUNTPOINT
s2_mdfs@snaptest0    0     -      64.5K  -

2.4. Exporting a ZFS snapshot to a stream file

To externalize a snapshot into a stream, an action necessary for proper backup maintenance, run:

# zfs send s2_mdfs@snaptest0 > \
  /local/snapshots/full/[email protected]

# ls -lFah /local/snapshots/full/[email protected]
-rw-r--r--  1 root  wheel    80k Jun  8 19:42 [email protected]

As the metadata constitutes all data except for actual user data, the footprint is several orders of magnitude less than the SuperCell data volume. Still, the space can grow, so it is advisable to compress the streams. Snappy, bzip2, xz, etc. and the parallel versions of these tools are great things to have in the toolchain.

It is the advice of the authors to create a periodic (cron) process to automatically create these streams and back them up on other hosts for file system analysis and disaster recovery.

2.5. Incremental snapshot stream for MDFS mirroring nodes

When synchronizing clones of ZFS file systems, instead of serializing full snapshots, the ZFS incremental snapshot feature can leveraged, reducing export, transfer, and import time and resources. An incremental snapshot stream is made relative to another snapshot. This stream can then be copied to other machines and imported to facilitate the updating:

# snapfn=$snapdir/partial/$pool@$CURR_SNAP.zsnap
# zfs send -i $PREV_SNAP $pool@$CURR_SNAP > $snapfn
# scp $snapfn destination:$snapfn

2.6. Verifying a snapshot stream

The zstreamdump(8) command may be used to verify that a ZFS snapshot stream was properly generated and can be expected to be used to rebuild a file system from backup (in ZFS terminology: “received”).

2.7. Recovering a zpool from a snapshot stream

For disaster recovery, the MDFS zpool needs to be rebuilt and the data imported back in from a recent full snapshot stream:

# slmctl stop                                        # kill SLASH2 slashd
# zfs-fuse
# zpool create s2_mdfs raidz3 /dev/disk/by-id/ata-...
# zpool set compression…                             # any FS options
# decompress /local/snapshots/full/[email protected] | \
  zfs recv -F s2_mdfs

Once this is complete, unmount /s2_mdfs, kill zfs-fuse, and relaunch slashd.

3. System Activity Logging

3.1. Log sources

All machines participating in the SuperCell deployment (MDS, IOS, and clients) should send all of their syslog messages to “syslog servers” to make aggregate log processing easier.

Non-dedicated client machines also intending to mount the SuperCell should send at least the daemon syslog messages, which should include all activity logged by the SLASH2 mount_slash client software.

Note: the @ denotes UDP for the transport whereas @@ denotes TCP. TCP is recommended but may not be supported by all systems.

3.2. Log activity

Having a periodic test to ensure healthy service e.g. through a daily file access (read/write) cronjob is helpful as each component will register baseline activity to show up in the log processing system.

The SLASH2 source code tree contains =utils/slash2_check.py= which performs a basic system health check and sends reports to a Nagios server.

3.3. Log servers

We have two log servers: arclog0.psc.edu & arclog1.psc.edu Log messages should be sent to BOTH servers so that we don't lose them (e.g. in case one server is down) Logs for the archiver are kept in /var/log/arc on the log servers messages from 'mount_slash' client daemons are sorted into /var/log/arc/mount_slash logs from the metadata server(s) are stored into /var/log/arc/mdserver logs from the I/O servers are stored into /var/log/arc/ioserver logs from all firehoses are stored in /var/log/arc/client Log Rotation ioserver and mount_slash logs are rotated daily (since weekly logs get too large) other logs are rotated weekly to rotate any logs daily, add the log filename to /etc/logrotate.d/arclog Logs copied to /arc/logs/server/arclog(0,1) Logs are copied to this archiver directory Changes can be made in /SYSTEM.CONF to set the copies to DAILY from WEEKLY

4. Reporting

4.1. File System Usage Reporting

You may find that you want to track and report various aspects of your file system including the current state of the system as well as the daily change (via syslog).

4.2. Current state reporting

The dumpfid tool included in the SLASH2 distribution traverses the file system and reports the attributes of each file encountered. It can return the uid, gid, file size, timestamp and permissions of each file in the file system.

As an example, one user directory is scanned with three worker threads to exploit parallelism provided by the underlying storage hardware (triple mirrored SSDs):

# dumpfid -t 3 -O /local/reports/${user}_output_files.%n -RF \
  '%g %u %s %Tm<%Y-%m-%dT%H:%M:%S> %m %f'$'\n' /s2_mdfs_mirror/$user

4.3. Log Reporting

You can use syslogs with simple event correlator (SEC) to track what the filesystem is doing in real time. You can track reads, writes, deletes, etc...

Sample Reports (see Appendix A for a sample PSC report):

  • Top 10 readers and writers by day
  • All Hosts, Data/File Count Read and Written
  • Total Daily Growth
  • Current Size/Space Available
  • MDS Disk Capacity
  • Compression Ratio
  • Current state of file system: usage by group, user, file count, GB used

5. Monitoring

5.1. Nagios

Nagios is an extensible monitoring framework. A Nagios deployment consists of two components:

  • A Nagios server: a collector of service reports and service health monitor
  • Configuration of each node and service is required The default Web-based monitoring interface may be used by many alternative packages are available such as Thuk which provide versatility on what is deemed important to be monitored. Nagios clients: each service reports details on its health to the server

Each component in a SLASH2 deployment may use the slash2_check.py script included in the SLASH2 source distribution. This script is configured to run out of a cronjob periodically, such as every 5 minutes. This script only requires Python and the send_nsca Nagios client utility.

5.2. Other Frameworks

SEC – Simple Event Correlator for reacting to log events - http://simple-evcorr.sourceforge.net/SEC-tutorial/article.html

Ganglia is a third party rather feature complete monitoring package that is easy to install and can provide a number of useful performance observability stats: http://ganglia.sourceforge.net/

6. System Maintenance

6.1. IOS scrubbing

On the constituent I/O servers, periodic scrubbing of the zpools is recommended. Scrubbing in ZFS is a process where the file system itself scans and reads each piece of data stored for integrity. During this process, any consistency errors discovered are fixed.

The command zpool scrub $poolname initiates this process.

6.2. IOS/MDFS backend file system degradation detection

When errors are encountered on the underlying hardware comprising zpools, ZFS reports such conditions via zpool status. It is integral to have a monitoring/notification framework in place to alert administrators to take action on failing disks to avoid further degradation and potential data loss.

6.3. Downtime

To bring the system fully offline, all services must be stopped. Bringing only pieces of the deployment down at one time will only cripple the service. Generally, the client service (mount_slash) on all hosts is stopped first, then I/O servers, then finally MDS servers.

6.3.1. Stopping the Client Service

To stop the SLASH2 client service mount_slash:

client# pkill mount_slash

If a client host is mounting multiple SLASH2 file systems, be careful as this command will indiscriminately stop all SLASH2 client services.

6.3.2. Stopping the I/O Service

To stop the SLASH2 I/O service sliod:

ios# slictl stop

If an IOS is running multiple instances of the sliod SLASH2 I/O service, be careful to select which daemon to stop.

The =utils/slictlN= script is included in the SLASH2 distribution to direct control requests to a specific daemon instance.

6.3.3. Stopping the MDS Service

If is generally recommended to perform a snapshot and backup of the metadata before stopping the MDS service in case system change renders the service unavailable.

mds# slmctl stop

6.3.4. Administratively disabling certain I/O systems

Short term/temporarily (one daemon instance lifetime)

To disable write I/O lease assignments from being made to a particular I/O system:

mds# slmctl -p resources.$SITE.$IOS_RESOURCE.disable_bia=1

"bia" here stands for "bmap <-> IOS assignment". "IOS" here means I/O system (i.e. a sliod instance). In SLASH2, an assignment means write I/O is bound to a particular I/O system. Note that read I/O leases are not bound to any particular IOS and the client may choose which IOS to access the desired data.

To determine whether write lease assignments are enabled to any sliod instance on the ios0 server, query each sliod instance:

mds# slmctl -p resources.$SITE.ios0s0.disable_bia
mds# slmctl -p resources.$SITE.ios0s1.disable_bia
mds# slmctl -p resources.$SITE.ios0s2.disable_bia
mds# slmctl -p resources.$SITE.ios0s3.disable_bia
mds# slmctl -p resources.$SITE.ios0s4.disable_bia
mds# slmctl -p resources.$SITE.ios0s5.disable_bia
mds# slmctl -p resources.$SITE.ios0s6.disable_bia

Or simply:

mds# slmctl -p resources.$SITE | grep $ios0.*disable_bia

Long term/permanent (persistent across daemon instances)

Modify the SLASH2 configuration file, normally stored in projects/inf/$DEPLOYMENT/slcfg and add the following to the desired sliod entry:

flags = "disable_bia";

Note to commit any changes to the deployment configuration after such a change and to push out all relevant hosts.

Clone this wiki locally