High performance SCST iSCSI Target on Linux software Raid
This material is work-in-progress ... Do not follow instructions here until this notice is removed. |
Introduction
This HOW-TO is focusing on performance. This is why I made some decisions targeted on performance instead of security (like you are used in Alpine Linux). This means we are not using grsec and not using a firewall. I presume you will take security actions on another level.
To get started, you can download a boot cd here:
http://alpine.nethq.org/distfiles/alpine-scst-110207-x86_64.iso
Vanilla Linux kernel with SCST patches
The default Linux kernel will provide support for iSCSI. The problem with this implementation is it operates in user-space. SCST iSCSI will run in kernel-space and this is one of the reasons why it preforms much better. SCST performance depends on specific patches which need to be applied to the kernel. This is why we created a separate kernel just for SCST usage. SCST modules are already included by default so there is no need for a separate module package to be installed.
P.S. We only provide an x86_64 kernel for SCST because it will perform better on 64bit systems.
Linux software raid
Hardware choise
In my setup i have 4 pieces of WD RE4 1TB drives connected to a mpt2sas based controller. I'm using Dell PowerEdge R510 which includes their most basic PERC H200 raid controller. I have tried using their Raid10 technology, but the raid rebuilding takes more then 3 days. You can read on mailing lists this is common to this adapter so i choose not to use this kind of "hardware" raid. Performance in degraded mode is really poor, and my users would suffer from it for too long. There are also little to no ways to control this array from Aline Linux. All LSI tools are based on glibc which will not work on uclibc hosts.
Raid level
I started looking for the best performance raid level with redundancy. According to many mailing lists and the opinion of the Linux raid author, RAID10 with layout f2 (far) seems to preform best while still having redundancy. Please remember with RAID10 50% of your hard disk space will go to redundancy, but performance is almost the same as RAID0 (stripe). For most up-to-date information regarding Linux software raid: https://raid.wiki.kernel.org/index.php/Overview
Raid setup
Make sure we have the mdadm raid configuration tool installed
apk add mdadm
I am not using partitions on my disks, although there are reasons to use partitions, see here:
https://raid.wiki.kernel.org/index.php/Partition_Types
mdadm -v --create /dev/md0 --level=raid10 --layout=f2 --raid-devices=4 /dev/sda /dev/sdb /dev/sdc /dev/sdd
You can now monitor your raid (re)building: cat /proc/mdstat
By default, the rebuild speed will be set and can be checked and changed here:
cat /proc/sys/dev/raid/speed_limit_max cat /proc/sys/dev/raid/speed_limit_min
make sure we have raid10 module loaded at boot
echo raid10 >> /etc/modules
When you are happy with your raid configuration, save its information to mdamd.conf file
mdadm --detail --scan >> /etc/mdadm.conf
It should display something like
ARRAY /dev/md0 metadata=1.2 name=scst:0 UUID=71fc93b8:3fef0057:9f5feec1:7d9e57e8
When you are ready with your raid setup and its functioning, you will need to make sure its starting at boot time
rc-update add mdadm-raid default
Monitor software raid
Linux software raid can be monitored with mdadm daemon option. Alpine Linux includes an initd script which can invoke the daemon
/etc/init.d/mdadm
It will be default monitor the array's defined in mdadm.conf. To receive email notifications about array issues, we need to provide our email address inside mdadm.conf:
MAILADDR me@inter.net
Because mdadm cannot send email itself, we need to setup an sendmail (replacement) program.
apk add ssmtp
Ssmtp can be configured by editing /etc/ssmtp/ssmtp.conf
You can monitor messages (syslog) for actions invoked by mmdadm.
If you have your own monitoring system active, you can also let mdadm issue a script and notify it.
Monitor hard disk with smartmontools
To keep your array healthy we can monitor our harddisks by its SMART interface. SMART can tell you in an early stage when hardisks begin to fail. Smartmontools includes a daemon called smartd which will run in the background and notify you by email or script about any issues which your harddisk will have.
apk add smartmontools
By default your smartd.conf will include the DEVICESCAN line which will automatically can your system and start monitoring all SMART capable devices. Because we want to make some changes to its default configuration we can comment it and include lines like these:
/dev/sda -a -d sat -m user@domain.tld -M test /dev/sdb -a -d sat -m user@domain.tld -M test /dev/sdc -a -d sat -m user@domain.tld -M test /dev/sdd -a -d sat -m user@domain.tld -M test
The -a option will do most of the basic monitoring for you, the -d specifies the device type (in my case a SATA disk) and the -m is to tell smartd to email me any issues it may find regarding this disk. I have also included the -M test switch to let smartmontools email me a test email at startup to make sure my mail subsystem is working. If you want smartd to email, you will need to have a minimal smtp setup like we used above with ssmtp but also the mail client which is included in mailx.
apk add mailx
When all is setup we can start it and make sure it starts at boot.
rc-update add smartd default /etc/init.d/smartd start
Dont forget to lbu_commit if needed.
Disk & Volume management with SCST
Volume management can be an interesting addition to your block device(s) but it does add is an extra layer between you block devices and SCST. If you are in need of any of the features provided by LVM then go ahead and use it.
Logical Volume Manager
LVM brings some very interesting features to Linux block devices:
- Online resizing of volumes
- Can be used on any software array (in contrary to partitions)
- Snapshots (easier to make create backups of block devices)
- Many more great features
To setup LVM please follow Setting_up_Logical_Volumes_with_LVM
BlockIO vs FileIO
BLOCKIO has access directly to the block device md0 (without extra layers) but FILEIO uses a filesystem on top of the block device and will use regular files. Although the extra layer would seem to be bad but it actually isn't. The filesystem will provide a caching layer and in some situations you will see improved performance. According to the SCST dev it can lead up to 10% increase of performance.
Some speed tests [1]
Filesystem for FileIO
XFS has some very interesting features which match the use in our iSCSI environment like:
- Automatically adjust to destination block device
- Live growing (when filesystem is mounted. no shrink support)
- Very good performance with large files
When using FILEIO we need to create a filesystem. To create an XFS filesystem we need xfsprogs installed
apk add xfsprogs
We also need XFS support in our kernel
modprobe xfs
And we need it next time we boot
echo xfs >> /etc/modules
File creation
On top of the filesystem we need to create files which will be used as disks. These files can be regular or sparse files [2]. If you need to create files on large volumes you will need to wait a long time when you create them as regular files with dd, instead you can use sparse files. The problem with sparse files, they can get heavily fragmented over time. Another solution for this issue is fallocate. fallocate is included in util-linux-ng.
apk add util-linux-ng
Now you can create your files like this
fallocate -l 10G disk1
And verify if its correct by:
du -h disk1
This command should show the file is allocating all disk space.
SCST and iSCSI management
SCST is managed by the sysfs filesystem. You can create your own scripts to control it, or use the included scstadmin like I will show here.
For a detailed overview of the sysfs filesystem please check here:
http://lwn.net/Articles/378658/
When starting with SCST management we need to have the SCST framework (kernel module) and the iSCSI kernel module loaded
modprobe scst scst_vdisk iscsi_scst
After these are loaded we can start the iSCSI deamon
/etc/init.d/iscsi-scst start
This command will return some information based on the current configuration located in: /etc/scst.conf
The basic config when only having iSCSI loaded is:
TARGET_DRIVER iscsi { enabled 0 }
Adding a target
We start add adding a target to the correct target driver
scstadmin -add_target iqn.2010-12.org.alpinelinux:tgt -driver iscsi
The config at this point should be:
TARGET_DRIVER iscsi { enabled 0 TARGET iqn.2010-12.org.alpinelinux:tgt { enabled 0 } }
Adding a device
Now that we have a target in our configuration we need to add a device:
scstadmin -open_dev disk01 -handler vdisk_fileio -attributes filename=/mnt/array1/disk01,nv_cache=1
The config at this point should be:
HANDLER vdisk_fileio { DEVICE disk01 { t10_dev_id "disk01 b8ceed65" usn b8ceed65 filename /mnt/array1/disk01 nv_cache 1 } } TARGET_DRIVER iscsi { enabled 0 TARGET iqn.2010-12.org.alpinelinux:tgt { enabled 0 } }
Adding a LUN
To add the device to the target we need to specify which LUN it will be. (we always need to start with 0)
scstadmin -add_lun 0 -driver iscsi -target iqn.2010-12.org.alpinelinux:tgt -device disk01
The config at this point should be:
HANDLER vdisk_fileio { DEVICE disk01 { t10_dev_id "disk01 b8ceed65" usn b8ceed65 filename /mnt/array1/disk01 nv_cache 1 } } TARGET_DRIVER iscsi { enabled 0 TARGET iqn.2010-12.org.alpinelinux:tgt { enabled 0 LUN 0 disk01 } }
Enable the target
This is the default minimum configuration for a working iSCSI setup. We now need to activate it:
scstadmin -enable_target iqn.2010-12.org.alpinelinux:tgt -driver iscsi
The config at this point should be:
HANDLER vdisk_fileio { DEVICE disk01 { t10_dev_id "disk01 b8ceed65" usn b8ceed65 filename /mnt/array1/disk01 nv_cache 1 } } TARGET_DRIVER iscsi { enabled 0 TARGET iqn.2010-12.org.alpinelinux:tgt { rel_tgt_id 1 enabled 1 LUN 0 disk01 } }
SCST security settings
The latest version of SCST has changed its security configuration. Previously you should edit files inside your etc directory like initiators.allow but since SysFS its it controlled with SysFS. iSCSI daemon can be configured to be only listening to one interface via init.d, but in the cause of MPIO you would have multiple interfaces to listen on. To make iSCSI only listen on the interfaces which you reserve for iSCSI traffic, you can add them as allowed portal like this:
scstadmin -add_tgt_attr iqn.2010-12.org.alpinelinux:tgt -driver iscsi -attributes allowed_portal=192.168.1.1
IP wildcards are specified with a question mark "?"
HANDLER vdisk_fileio { DEVICE disk01 { t10_dev_id "disk01 b8ceed65" usn b8ceed65 filename /mnt/array1/disk01 nv_cache 1 } } TARGET_DRIVER iscsi { enabled 0 TARGET iqn.2010-12.org.alpinelinux:tgt { rel_tgt_id 1 enabled 1 allowed_portal 192.168.1.1 LUN 0 disk01 } }
Enable the iSCSI driver
Our configuration should be ok and the target is enabled. If you need to set any security groups or other settings please do them now. When you are ready we can activate the iSCSI driver:
scstadmin -set_drv_attr iscsi -attributes enabled=1
Reverting your changes
After this command, your iSCSI disk should be available in your initiator. You can use the same command to disable the target and take it offline:
scstadmin -set_drv_attr iscsi -attributes enabled=0
Some other command to break down the SCST config piece by piece:
To disable the target
scstadmin -disable_target iqn.2010-12.org.alpinelinux:tgt -driver iscsi
To remove the lun
scstadmin -rem_lun 0 -driver iscsi -target iqn.2010-12.org.alpinelinux:tgt -device disk01
To close a device
scstadmin -close_dev disk01 -handler vdisk_fileio
To remove a target
scstadmin -rem_target iqn.2010-12.org.alpinelinux:tgt -driver iscsi
iSCSI MPIO
MPIO Multi Path IO is a way to transfer IO data over multiple links. iSCSI MPIO will provide you two advantages. 1. Failover in case one of the paths goes down (cable is loose/defect or NIC is defect...) 2. Increased througoutput by sharing load over multiple paths. Currently 10G network interfaces and switches are still relatively expensive compared to 1G networks. Depending on your setup you can configure your network interfaces accordingly. I will provide an example which uses ESXi 4.1 as initiator. ESXi round-robin initiator can spread the SCSI command evenly over each interface by configuring the iops parameter. For details on such setup I suggest you to read the following article [3] and for reference you can also look here [4] and [5].
Interfaces
In my current setup I have configured 4 interfaces to use for my iSCSI traffic. These 4 interfaces consist out of 2 dual network PCI Express addon cards. I am not including my onboard dual controller because most of these onboard controllers do not allow simultaneous data transfers. You can easily verify this by using bwm-ng and push lots of data over it, I can confirm this for my onboard e1000 and bmx2 interfaces thus I will use these interfaces for management traffic.
Bonding
I have tried to bond the iSCSI interfaces on the target, although it was successful and it would increase performance, with ESXi it would receive much (+/- 50%) better performance with IP networking. My advise is not to use bonding on the Target side.
IP / Subnets
ESXi support multiple network adapters in the same subnet, this kind of setup seems more logic to me and would be easier to setup and you would end up with more paths to your storage because each interface can reach remote interfaces, but I was not able to make this work in Linux. I have tried playing with arp settings and route tables, but i was never successful to send traffic to a particular interface and receive it back from the same one. You can find some information on this topic here [6]. Somebody with better knowledge about routing/arp issues please correct me. While single subnet setup would provide you with more paths, the actual hardware paths would still stay 4gb, so I setup my NIC's each in its own subnet on both target and initiator.
Performance
If you configure this right you should be able to reach approx 90MB/s per channel according to my speed tests. These tests are based:
- ESXi 4.1
- 4 hardware paths
- Using Round-Robin disk with IOPS=1
- Windows 2008 server as guest
- Windows Raid0 disk setup (using 2 ESXi disks)
- CrystalDiskMark [7]
- Without jumboframes
Caution
If you are going to use iSCSI as one of your main storage solutions, you should seriously think about using redundant hardware.
- Use multiple network cards (physical cards)
- Use multiple switches (Assign your network cards to switches)
- Even better: Heartbeat with iSCSI failover DRBD