doc/flashcache-sa-guide.txt

		FlashCache System Administration Guide
		--------------------------------------

Introduction :
============
Flashcache is a block cache for Linux, built as a kernel module,
using the Device Mapper. Flashcache supports writeback, writethrough
and writearound caching modes. This document is a quick administration
guide to flashcache.

Requirements :
============
Flashcache has been tested on variety of kernels between 2.6.18 and 2.6.38. 
If you'd like to build and use it on a newer kernel, please send me an email 
and I can help. I will not support older than 2.6.18 kernels.


Choice of Caching Modes : 
=========================
Writethrough - safest, all writes are cached to ssd but also written to disk
immediately.  If your ssd has slower write performance than your disk (likely
for early generation SSDs purchased in 2008-2010), this may limit your system 
write performance.  All disk reads are cached (tunable).  

Writearound - again, very safe, writes are not written to ssd but directly to
disk.  Disk blocks will only be cached after they are read.  All disk reads
are cached (tunable).

Writeback - fastest but less safe.  Writes only go to the ssd initially, and
based on various policies are written to disk later.  All disk reads are
cached (tunable).  

Writeonly - variant of writeback caching. In this mode, only incoming writes
are cached. No reads are ever cached.
    
Cache Persistence :
=================
Writethrough and Writearound caches are not persistent across a device removal
or a reboot. Only Writeback caches are persistent across device removals
and reboots.  This reinforces 'writeback is fastest', 'writethrough is safest'.

Known Bugs :
============
See https://github.com/facebook/flashcache/issues and report new issues there please.
Data corruption has been reported when using a loopback device for the cache device.
See also the 'Futures and Features' section of the design document, flashcache-doc.txt.


Cache creation and loading using the flashcache utilities :
=========================================================
Included are 3 utilities - flashcache_create, flashcache_load and 
flashcache_destroy. These utilities use dmsetup internally, presenting
a simpler interface to create, load and destroy flashcache volumes.
It is expected that the majority of users can use these utilities
instead of using dmsetup.


flashcache_create : Create a new flashcache volume.

flashcache_create [-v] -p back|around|thru [-s cache size] [-w] [-b block size] cachedevname ssd_devname disk_devname
-v : verbose.
-p : cache mode (writeback/writethrough/writearound).
-s : cache size. Optional. If this is not specified, the entire ssd device
     is used as cache. The default units is sectors. But you can specify 
     k/m/g as units as well.
-b : block size. Optional. Defaults to 4KB. Must be a power of 2.
     The default units is sectors. But you can specify k as units as well.
     (A 4KB blocksize is the correct choice for the vast majority of 
     applications. But see the section "Cache Blocksize selection" below).
-f : force create. by pass checks (eg for ssd sectorsize).
-w : write cache mode. Only writes are cached, not reads
-d : disk associativity, within each cache set, we store several contigous
     disk extents. Defaults to off.

Examples :
flashcache_create -p back -s 1g -b 4k cachedev /dev/sdc /dev/sdb
Creates a 1GB writeback cache volume with a 4KB block size on ssd 
device /dev/sdc to cache the disk volume /dev/sdb. The name of the device 
created is "cachedev".

flashcache_create -p thru -s 2097152 -b 8 cachedev /dev/sdc /dev/sdb
Same as above but creates a write through cache with units specified in 
sectors instead. The name of the device created is "cachedev".


flashcache_load : Load an existing writeback cache volume.  

flashcache_load ssd_devname [cachedev_name]

Example :
flashcache_load /dev/sd
Load the existing writeback cache on /dev/sdc, using the virtual
cachedev_name from when the device was created. If you're upgrading from
an older flashcache device format that didn't store the cachedev name
internally, or you want to change the cachedev name use, you can specify
it as an optional second argument to flashcache_load.

For writethrough and writearound caches flashcache_load is not needed; flashcache_create 
should be used each time.


flashcache_destroy : Destroy an existing writeback flashcache. All data will be lost !!!

flashcache_destroy ssd_devname

Example :
flashcache_destroy /dev/sdc
Destroy the existing cache on /dev/sdc. All data is lost !!!

For writethrough and writearound caches this is not necessary.


Removing a flashcache volume :
============================
Use dmsetup remove to remove a flashcache volume. For writeback 
cache mode,  the default behavior on a remove is to clean all dirty 
cache blocks to disk. The remove will not return until all blocks 
are cleaned. Progress on disk cleaning is reported on the console 
(also see the "fast_remove" flashcache sysctl).

A reboot of the node will also result in all dirty cache blocks being
cleaned synchronously (again see the note about "fast_remove" in the
sysctls section).

For writethrough and writearound caches, the device removal or reboot
results in the cache being destroyed.  However, there is no harm is
doing a 'dmsetup remove' to tidy up before boot, and indeed
this will be needed if you ever need to unload the flashcache kernel
module (for example to load an new version into a running system).

Example:
dmsetup remove cachedev

This removes the flashcache volume name cachedev. Cleaning
all blocks prior to removal.

Cache Stats :
===========
Use 'dmsetup status' for cache statistics.

'dmsetup table' also dumps a number of cache related statistics.

Examples :
dmsetup status cachedev
dmsetup table cachedev

Flashcache errors are reported in 
/proc/flashcache/<cache name>/flashcache_errors

Flashcache stats are also reported in 
/proc/flashcache/<cache name>/flashcache_stats
for easier parseability.

Using Flashcache sysVinit script (Redhat based systems):
=======================================================
Kindly note that, this sections only applies to the Redhat based systems. Use
'utils/flashcache' from the repository as the sysvinit script.

This script is to load, unload and get statistics of an existing flashcache 
writeback cache volume. It helps in loading the already created cachedev during 
system boot and removes the flashcache volume before system halt happens.

This script is necessary, because, when a flashcache volume is not removed 
before the system halt, kernel panic occurs.

Configuring the script using chkconfig:

1. Copy 'utils/flashcache' from the repo to '/etc/init.d/flashcache'
2. Make sure this file has execute permissions,
   'sudo chmod +x /etc/init.d/flashcache'.
3. Edit this file and specify the values for the following variables
   SSD_DISK, BACKEND_DISK, CACHEDEV_NAME, MOUNTPOINT, FLASHCACHE_NAME
4. Modify the headers in the file if necessary.
   By default, it starts in runlevel 3, with start-stop priority 90-10
5. Register this file using chkconfig
   'chkconfig --add /etc/init.d/flashcache'

Cache Blocksize selection :
=========================
Cache blocksize selection is critical for good cache utilization and
performance. 

A 4KB cache blocksize for the vast majority of workloads (and filesystems).

Cache Metadata Blocksize selection :
==================================
This section only applies to the writeback cache mode. Writethrough and 
writearound modes store no cache metadata at all.

In Flashcache version 1, the metadata blocksize was fixed at 1 (512b) sector.
Flashcache version 2 removes this limitation. In version 2, we can configure 
a larger flashcache metadata blocksize. Version 2 maintains backwards compatibility
for caches created with Version 1. For these cases, a metadata blocksize of 512
will continue to be used.

flashcache_create -m can be used to optionally configure the metadata blocksize.
Defaults to 4KB.

Ideal choices for the metadata blocksize are 4KB (default) or 8KB. There is 
little benefit to choosing a metadata blocksize greater than 8KB. The choice 
of metadata blocksize is subject to the following rules :

1) Metadata blocksize must be a power of 2.
2) Metadata blocksize cannot be smaller than sector size configured on the 
ssd device.
3) A single metadata block cannot contain metadata for 2 cache sets. In other
words, with the default associativity of 512 (with each cache metadata slot 
sizing at 16 bytes), the entire metadata for a given set fits in 8KB (512*16b).
For an associativity of 512, we cannot configure a metadata blocksize greater
than 8KB.

Advantages of choosing a larger (than 512b) metadata blocksize :
- Allows the ssd to be configured to larger sectors. For example, some ssds
  allow choosing a 4KB sector, often a more performant choice.
- Allows flashache to do better batching of metadata updates, potentially 
  reducing metadata updates, small ssd writes, reducing write amplification
  and higher ssd lifetimes.

Thanks due to Earle Philhower of Virident for this feature !

FlashCache Sysctls :
==================
Flashcache sysctls operate on a per-cache device basis. A couple of examples
first.

Sysctls for a writearound or writethrough mode cache :
cache device /dev/ram3, disk device /dev/ram4

dev.flashcache.ram3+ram4.cache_all = 1
dev.flashcache.ram3+ram4.zero_stats = 0
dev.flashcache.ram3+ram4.reclaim_policy = 0
dev.flashcache.ram3+ram4.pid_expiry_secs = 60
dev.flashcache.ram3+ram4.max_pids = 100
dev.flashcache.ram3+ram4.do_pid_expiry = 0
dev.flashcache.ram3+ram4.io_latency_hist = 0
dev.flashcache.ram3+ram4.skip_seq_thresh_kb = 0

Sysctls for a writeback mode cache :
cache device /dev/sdb, disk device /dev/cciss/c0d2

dev.flashcache.sdb+c0d2.fallow_delay = 900
dev.flashcache.sdb+c0d2.fallow_clean_speed = 2
dev.flashcache.sdb+c0d2.cache_all = 1
dev.flashcache.sdb+c0d2.fast_remove = 0
dev.flashcache.sdb+c0d2.zero_stats = 0
dev.flashcache.sdb+c0d2.reclaim_policy = 0
dev.flashcache.sdb+c0d2.pid_expiry_secs = 60
dev.flashcache.sdb+c0d2.max_pids = 100
dev.flashcache.sdb+c0d2.do_pid_expiry = 0
dev.flashcache.sdb+c0d2.max_clean_ios_set = 2
dev.flashcache.sdb+c0d2.max_clean_ios_total = 4
dev.flashcache.sdb+c0d2.dirty_thresh_pct = 20
dev.flashcache.sdb+c0d2.stop_sync = 0
dev.flashcache.sdb+c0d2.do_sync = 0
dev.flashcache.sdb+c0d2.io_latency_hist = 0
dev.flashcache.sdb+c0d2.skip_seq_thresh_kb = 0

Sysctls common to all cache modes :

dev.flashcache.<cachedev>.cache_all:
	Global caching mode to cache everything or cache nothing.
	See section on Caching Controls. Defaults to "cache everything".
dev.flashcache.<cachedev>.zero_stats:
	Zero stats (once).
dev.flashcache.<cachedev>.reclaim_policy:
	FIFO (0) vs LRU (1). Defaults to FIFO. Can be switched at 
	runtime.
dev.flashcache.<cachedev>.io_latency_hist:
	Compute IO latencies and plot these out on a histogram.
	The scale is 250 usecs. This is disabled by default since 
	internally flashcache uses gettimeofday() to compute latency
	and this can get expensive depending on the clocksource used.
	Setting this to 1 enables computation of IO latencies.
	The IO latency histogram is appended to 'dmsetup status'.
(There is little reason to tune these)
dev.flashcache.<cachedev>.max_pids:
	Maximum number of pids in the white/black lists.
dev.flashcache.<cachedev>.do_pid_expiry:
	Enable expiry on the list of pids in the white/black lists.
dev.flashcache.<cachedev>.pid_expiry_secs:
	Set the expiry on the pid white/black lists.
dev.flashcache.<cachedev>.skip_seq_thresh_kb:
	Skip (don't cache) sequential IO larger than this number (in kb).
	0 (default) means cache all IO, both sequential and random.
	Sequential IO can only be determined 'after the fact', so
	this much of each sequential I/O will be cached before we skip 
	the rest.  Does not affect searching for IO in an existing cache.

Sysctls for writeback mode only :

dev.flashcache.<cachedev>.fallow_delay = 900
	In seconds. Clean dirty blocks that have been "idle" (not 
	read or written) for fallow_delay seconds. Default is 15
	minutes. 
	Setting this to 0 disables idle cleaning completely.
dev.flashcache.<cachedev>.fallow_clean_speed = 2
	The maximum number of "fallow clean" disk writes per set 
	per second. Defaults to 2.
dev.flashcache.<cachedev>.fast_remove = 0
	Don't sync dirty blocks when removing cache. On a reload
	both DIRTY and CLEAN blocks persist in the cache. This 
	option can be used to do a quick cache remove. 
	CAUTION: The cache still has uncommitted (to disk) dirty
	blocks after a fast_remove.
dev.flashcache.<cachedev>.dirty_thresh_pct = 20
	Flashcache will attempt to keep the dirty blocks in each set 
	under this %. A lower dirty threshold increases disk writes, 
	and reduces block overwrites, but increases the blocks
	available for read caching.
dev.flashcache.<cachedev>.stop_sync = 0
	Stop the sync in progress.
dev.flashcache.<cachedev>.do_sync = 0
	Schedule cleaning of all dirty blocks in the cache. 
(There is little reason to tune these)
dev.flashcache.<cachedev>.max_clean_ios_set = 2
	Maximum writes that can be issues per set when cleaning
	blocks.
dev.flashcache.<cachedev>.max_clean_ios_total = 4
	Maximum writes that can be issued when syncing all blocks.

Using dmsetup to create and load flashcache volumes :
===================================================
Few users will need to use dmsetup natively to create and load 
flashcache volumes. This section covers that.

dmsetup create device_name table_file

where

device_name: name of the flashcache device being created or loaded.
table_file : other cache args (format below). If this is omitted, dmsetup 
	     attempts to read this from stdin.

table_file format :
0 <disk dev sz in sectors> flashcache <disk dev> <ssd dev> <dm virtual name> <cache mode> <flashcache cmd> <blksize in sectors> [size of cache in sectors] [cache set size]

cache mode:
	   1: Write Back
	   2: Write Through
	   3: Write Around

flashcache cmd: 
	   1: load existing cache
	   2: create cache
	   3: force create cache (overwriting existing cache). USE WITH CAUTION

blksize in sectors:
	   4KB (8 sectors, PAGE_SIZE) is the right choice for most applications.
	   See note on block size selection below.
	   Unused (can be omitted) for cache loads.

size of cache in sectors:
	   Optional. if size is not specified, the entire ssd device is used as
	   cache. Needs to be a power of 2.
	   Unused (can be omitted) for cache loads.

cache set size:
	   Optional. The default set size is 512, which works well for most 
	   applications. Little reason to change this. Needs to be a
	   power of 2.
	   Unused (can be omitted) for cache loads.

Example :

echo 0 `blockdev --getsize /dev/cciss/c0d1p2` flashcache /dev/cciss/c0d1p2 /dev/fioa2 cachedev 1 2 8 522000000 | dmsetup create cachedev

This creates a writeback cache device called "cachedev" (/dev/mapper/cachedev)
with a 4KB blocksize to cache /dev/cciss/c0d1p2 on /dev/fioa2.
The size of the cache is 522000000 sectors.

(TODO : Change loading of the cache happen via "dmsetup load" instead
of "dmsetup create").

Caching Controls
================
Flashcache can be put in one of 2 modes - Cache Everything or 
Cache Nothing (dev.flashcache.cache_all). The defaults is to "cache
everything".

These 2 modes have a blacklist and a whitelist.

The tgid (thread group id) for a group of pthreads can be used as a
shorthand to tag all threads in an application. The tgid for a pthread
is returned by getpid() and the pid of the individual thread is
returned by gettid().

The algorithm works as follows :

In "cache everything" mode,
1) If the pid of the process issuing the IO is in the blacklist, do
not cache the IO. ELSE,
2) If the tgid is in the blacklist, don't cache this IO. UNLESS
3) The particular pid is marked as an exception (and entered in the
whitelist, which makes the IO cacheable).
4) Finally, even if IO is cacheable up to this point, skip sequential IO 
if configured by the sysctl.

Conversely, in "cache nothing" mode,
1) If the pid of the process issuing the IO is in the whitelist,
cache the IO. ELSE,
2) If the tgid is in the whitelist, cache this IO. UNLESS
3) The particular pid is marked as an exception (and entered in the
blacklist, which makes the IO non-cacheable).
4) Anything whitelisted is cached, regardless of sequential or random
IO.

Examples :
--------
1) You can make the global cache setting "cache nothing", and add the
tgid of your pthreaded application to the whitelist. Which makes only
IOs issued by your application cacheable by Flashcache. 
2) You can make the global cache setting "cache everything" and add
tgids (or pids) of other applications that may issue IOs on this
volume to the blacklist, which will make those un-interesting IOs not
cacheable. 

Note that this only works for O_DIRECT IOs. For buffered IOs, pdflush,
kswapd would also do the writes, with flashcache caching those.

The following cacheability ioctls are supported on /dev/mapper/<cachedev> 

FLASHCACHEADDBLACKLIST: add the pid (or tgid) to the blacklist.
FLASHCACHEDELBLACKLIST: Remove the pid (or tgid) from the blacklist.
FLASHCACHEDELALLBLACKLIST: Clear the blacklist. This can be used to
cleanup if a process dies.

FLASHCACHEADDWHITELIST: add the pid (or tgid) to the whitelist.
FLASHCACHEDELWHITELIST: Remove the pid (or tgid) from the whitelist.
FLASHCACHEDELALLWHITELIST: Clear the whitelist. This can be used to
cleanup if a process dies.

/proc/flashcache_pidlists shows the list of pids on the whitelist
and the blacklist.

Security Note :
=============
With Flashcache, it is possible for a malicious user process to 
corrupt data in files with only read access. In a future revision
of flashcache, this will be addressed (with an extra data copy).

Not documenting the mechanics of how a malicious process could 
corrupt data here.

You can work around this by setting file permissions on files in 
the flashcache volume appropriately.

Why is my cache only (<< 100%) utilized ?
=======================================
(Answer contributed by Will Smith)

- There is essentially a  1:many mapping between SSD blocks and HDD blocks.
- In more detail, a HDD block gets hashed to a set on SSD which contains by 
  default 512 blocks.  It can only be stored in that set on SSD, nowhere else.

So with a simplified SSD containing only 3 sets:

SSD = 1 2 3 , and a HDD with 9 sets worth of data, the HDD sets would map to the SSD 
sets like this:

HDD: 1 2 3 4 5 6 7 8 9
SSD: 1 2 3 1 2 3 1 2 3

So if your data only happens to live in HDD sets 1 and 4, they will compete for 
SSD set 1 and your SSD will at most become 33% utilized.

If you use XFS you can tune the XFS agsize/agcount to try and mitigate this 
(described next section).


Tuning XFS for better flashcache performance :
============================================
If you run XFS/Flashcache, it is worth tuning XFS' allocation group
parameters (agsize/agcount) to achieve better flashcache performance.
XFS allocates blocks for files in a given directory in a new
allocation group. By tuning agsize and agcount (mkfs.xfs parameters),
we can achieve much better distribution of blocks across
flashcache. Better distribution of blocks across flashcache will
decrease collisions on flashcache sets considerably, increase cache
hit rates significantly and result in lower IO latencies.

We can achieve this by computing agsize (and implicitly agcount) using
these equations,

C = Cache size, 
V = Size of filesystem Volume.

agsize % C = (1/agcount)*C
agsize * agcount ~= V

where agsize <= 1000g (XFS limits on agsize).

A couple of examples that illustrate the formula,

For agcount = 4, let's divide up the cache into 4 equal parts (each
part is size C/agcount). Let's call the parts C1, C2, C3, C4. One
ideal way to map the allocation groups onto the cache is as follows.

Ag1	     Ag2	Ag3	       Ag4
--	     --		--	       --
C1	     C2		C3	       C4	(stripe 1)
C2	     C3		C4	       C1	(stripe 2)
C3	     C4		C1	       C2	(stripe 3)
C4	     C1		C2	       C3	(stripe 4)
C1	     C2		C3	       C4	(stripe 5)

In this simple example, note that each "stripe" has 2 properties
1) Each element of the stripe is a unique part of the cache.
2) The union of all the parts for a stripe gives us the entire cache.

Clearly, this is an ideal mapping, from a distribution across the
cache point of view.

Another example, this time with agcount = 5, the cache is divided into
5 equal parts C1, .. C5.

Ag1	     Ag2	Ag3	       Ag4	Ag5
--	     --		--	       --	--
C1	     C2		C3	       C4	C5	(stripe 1)
C2	     C3		C4	       C5	C1	(stripe 2)
C3	     C4		C5	       C1	C2	(stripe 3)
C4	     C5		C1	       C2	C3	(stripe 4)
C5	     C1		C2	       C3	C4	(stripe 5)
C1	     C2		C3	       C4	C5	(stripe 6)

A couple of examples that compute the optimal agsize for a given
Cachesize and Filesystem volume size.

a) C = 600g, V = 3,5TB
Consider agcount = 5

agsize % 600 = (1/5)*600
agsize % 600 = 120

So an agsize of 720g would work well, and 720*5 = 3.6TB (~ 3.5TB)

b) C = 150g, V = 3.5TB
Consider agcount=4

agsize % 150 = (1/4)*150
agsize % 150 = 37.5

So an agsize of 937g would work well, and 937*4 = 3.7TB (~ 3.5TB)

As an alternative, 

agsize % C = (1 - (1/agcount))*C
agsize * agcount ~= V

Works just as well as the formula above.

This computation has been implemented in the utils/get_agsize utility.

Tuning Sequential IO Skipping for better flashcache performance
===============================================================
Skipping sequential IO makes sense in two cases:
1) your sequential write speed of your SSD is slower than
   the sequential write speed or read speed of your disk.  In 
   particular, for implementations with RAID disks (especially 
   modes 0, 10 or 5) sequential reads may be very fast.  If 
   'cache_all' mode is used, every disk read miss must also be 
   written to SSD.  If you notice slower sequential reads and writes 
   after enabling flashcache, this is likely your problem.
2) Your 'resident set' of disk blocks that you want cached, i.e.
   those that you would hope to keep in cache, is smaller
   than the size of your SSD.  You can check this by monitoring
   how quick your cache fills up ('dmsetup table').  If this
   is the case, it makes sense to prioritize caching of random IO,
   since SSD performance vastly exceeds disk performance for 
   random IO, but is typically not much better for sequential IO.

In the above cases, start with a high value (say 1024k) for
sysctl dev.flashcache.<device>.skip_seq_thresh_kb, so only the
largest sequential IOs are skipped, and gradually reduce
if benchmarks show it's helping.  Don't leave it set to a very
high value, return it to 0 (the default), since there is some
overhead in categorizing IO as random or sequential.

If neither of the above hold, continue to cache all IO, 
(the default) you will likely benefit from it. 


Further Information
===================
Git repository         : https://github.com/facebook/flashcache/
Developer mailing list : http://groups.google.com/group/flashcache-dev/