4/20/2014

Disk Caching with SSDs on Linux and Windows

I recently built a desktop PC for both working and gaming. It's been a while since I put one together and I came across some interesting performance related advances in storage technology, for both Linux and Windows, which I was previously unaware of.

The machine dual boots Debian GNU/Linux and Windows 8. I don't normally use Windows for anything, but it's a necessity for gaming. Most of my hardware choices for the build were fairly straight forward, after making sure they all worked under both OSs. Intel Core i5 CPU, 16GB RAM, Nvidia GTX 760 GPU and an MSI Z87-G41 PC Mate motherboard. I wasn't sure how to handle storage though; I've been using SSDs in my laptop, but this machine needs a lot of room (modern games eat disk space) and I didn't want to go back to using slow spinning rust. The trouble is, 2TB of SSD (split across 4 disks) currently comes to about ~£1,000 which was way over budget for me.


It turns out, both Linux and Windows now have the ability to use an SSD as a cache for a slower disk. The Linux version is called BCache and has been available since kernel 3.10 (available in Debian Jessie - Testing). The Windows version is called Intel Smart Response and unlike the Linux version has some hardware dependencies as well (make sure your CPU/motherboard supports it).

Both versions allow you to specify a slow backing disk and a fast cache disk. When reading, it will fetch the data from the fast SSD cache if it has been previously cached, otherwise it will read from the slow backing disk. It then "intelligently" decides which data to add to the cache disk. If it's a large sequential read it doesn't bother, because the point of the SSD cache is to make "small reads" fast because seeking on spinning disks is slow. There is also an optional write cache capability available for both. If you turn it on, when files are written to the device, they get written to the SSD for speed and are then synced to the backend slow disk soon after.

For my solution, I opted for four disk drives. Two 1TB Seagate Barracuda spinning disks for £49 each and two 120GB (112GB usable) Kingston SSDNow V300 SSDs for £62 each, totalling £222. It's worth noting at this point that Intel Smart Response will only use up to a maximum of 64GB for the caching partition so don't buy more space than you need, but it will still let you use the rest of the SSD for other data if needed (Linux BCache in my case). Linux BCache has no such size restrictions.

* SSD 1 is split into 22GB for the Linux OS and 90GB for the Windows C drive
* SSD 2 is split into 56GB for the Linux BCache cache device and 56GB for the Windows Intel Smart Response cache
* HDD 3 and 4 are combined into a single 2TB RAID 0 striped device (fakeRAID) and is then split into two separate 1TB partitions, one for my Linux /data/ BTRFS partition and one for my Windows D: drive.

So whichever OS I boot into I have one SSD dedicated to the OS and a 1TB double-speed (RAID) device fronted by another 56GB SSD cache for bulk data. Of course, if I lose one of the spinning disks in this setup, I lose all of my data, but I care more about speed than redundancy for this machine. It will contain no important data that isn't backed up elsewhere. The cache is good for speeding things up, but I also wanted to put some effort into speeding up the backing disk too for when cache misses happen.

It was a surprise to me to find out that I could use a single fakeRAID device under both Windows and Linux. Just do an "apt-get install dmraid" and then sit back and watch your fake raid devices magically appear in /dev/mapper/. Initially I created a single 2TB NTFS partition on it with the intention of mounting the same partition under both OS's and sharing the bulk data space that way. This actually worked at first. It was only when I turned on the Intel Smart Response technology on Windows that things started to break. If you use "Maximized" caching mode under Windows (the one with write caching enabled), when you subsequently attempt to mount the partition under Linux, it gives you the following error message:
The disk contains an unclean file system (0, 0).
Metadata kept in Windows cache, refused to mount.
Failed to mount '/dev/***': Operation not permitted
The NTFS partition is in an unsafe state. Please resume and shutdown
Windows fully (no hibernation or fast restarting), or mount the volume
read-only with the 'ro' mount option.
If you change it to using "Enhanced" mode (read caching only), things *appear* to work, but I found that files which I created under Linux would mysteriously disappear after I booted into Windows and then booted back into Linux later on. It's less than ideal to split the data device into two separate 1TB partitions, but I can always resize the partitions at a later date if I find one filling up a lot faster than the other. I could just use one disk for Linux and one for Windows, but then I lose both the resizing capability and half my read/write performance.

I used these simple instructions to set up Intel Smart Response on the Windows side. For the Linux side, I used these instructions; although it's on the Arch Linux website, they're fairly distro agnostic.

The BCache technology is still very young. Although it is built into the kernel, you need to download the user-space tools for managing it from a git repository. To give you an idea of how easy it is to set up once you've done that though, here's what I did:
root@blob:~# make-bcache -B /dev/mapper/isw_ciifcibhae_DataRaid2
UUID:           fa6e31be-0f36-493e-8569-cd4f12e8b899
Set UUID:       8c89607c-5b7b-41b3-9800-f3a0218a5115
version:        1
block_size:     1
data_offset:    16

root@blob:~# make-bcache -C /dev/sdb1
UUID:           cea25b63-5f81-4326-aaec-daba7ab6febe
Set UUID:       b0d9552f-ef1f-4858-bd78-18e18fa4efcc
version:        0
nbuckets:       112204
block_size:     1
bucket_size:    1024
nr_in_set:      1
nr_this_dev:    0
first_bucket:   1

root@blob:~# echo /dev/mapper/isw_ciifcibhae_DataRaid2 > /sys/fs/bcache/register
root@blob:~# echo /dev/sdb1 > /sys/fs/bcache/register
root@blob:~# echo b0d9552f-ef1f-4858-bd78-18e18fa4efcc > /sys/block/bcache0/bcache/attach
root@blob:~# echo writeback > /sys/block/bcache0/bcache/cache_mode
/dev/sdb1 is the cache device. /dev/mapper/isw_ciifcibhae_DataRaid2 is the second partition of the fake raid stripe that I created.

After running those commands, /dev/bcache0 appeared on my system, which can be partitioned and formatted just like any other block device. You can even add encryption to it with LUKS/cryptsetup and/or use LVM on it. The caching works at the block level so it is filesystem agnostic.

I then added the following to my /etc/rc.local to bring the partition up during boot:
echo /dev/sdb1 > /sys/fs/bcache/register
mount /data
Mounting the partition at this late point in the boot sequence is fine for my purposes as it will only contain non-boot related content like my Steam folder for example. There is probably a nicer way of doing it though.

Once you have /dev/bcache0, go and have a look in "/sys/block/bcache0/bcache/". There are a bunch of files in there which will show you useful information and statistics so you can see how effective your cache is currently being. For example, here's the percent of reads that hit my cache instead of the backing disk for the past hour:
root@blob:~# cat /sys/block/bcache0/bcache/stats_hour/cache_hit_ratio
96
root@blob:~# 
You can also tweak the way bcache works. For example, you can read, and change the cutoff at which point BCache considers a read "sequential" and thus avoids caching:
root@blob:~# echo 4194304 > /sys/block/bcache0/bcache/sequential_cutoff
root@blob:~# cat /sys/block/bcache0/bcache/sequential_cutoff
4.0M
The first thing I did was test each of the devices using "hdparm -t $DEVICE" to see how they were performing for reads:
1 SSD          : Timing buffered disk reads: 1024 MB in  3.00 seconds = 340.98 MB/sec
1 Spinning disk: Timing buffered disk reads: 346 MB in  3.00 seconds = 115.30 MB/sec
RAID stripe    : Timing buffered disk reads: 678 MB in  3.01 seconds = 225.47 MB/sec
BCache device  : Timing buffered disk reads: 712 MB in  3.00 seconds = 237.30 MB/sec
The BCache device doesn't appear much different to the RAID stripe device, presumably because hdparm is not generating any cache hits so the reads are falling through the SSD cache device and through to the slower RAID stripe device.

I wanted to see bcache in action though. The best way I found to do this was to first of all stick a load of data on the device and read it all whilst watching the disk I/O using iostat. I did a quick "apt-get install steam" to install Steam (who needs Ubuntu), then I created a symlink from /home/mike/.steam/ to /data/mike/.steam/, fired up Steam and installed 18GB of Left For Dead 2 and Portal. I then installed iostat and set it running:
apt-get install sysstat
iostat -d 1 -x /dev/sdc /dev/sdd /dev/sdb /dev/bcache0
In a second terminal I then ran:
root@blob:/data/mike# echo 3 > /proc/sys/vm/drop_caches
root@blob:/data/mike# find . -type f -exec cat {} > /dev/null \;
The first command clears the in-memory disk cache so it doesn't interfere with the results. The second command reads all of the data on the disk. Watching the output of iostat allowed me to see how much each device was being accessed. Once the cache device is "warmed up", you can clearly see it handling the reads that would previously have hit the slow backend disk.

Unfortunately, it's difficult to quantify how much of a difference it makes during normal day to day usage. The theory behind it is sound though. Hopefully by the time I build my next machine, SSDs will have come down in price to the point where I can use them for bulk data too.

source : https://grepular.com/Disk_Caching_with_SSDs_Linux_and_Windows

No comments:

Post a Comment