High performance SCST iSCSI Target on Linux software Raid
This material is work-in-progress ... Do not follow instructions here until this notice is removed. |
Introduction
This HOW-TO is focusing on performance. This is why I made some decisions targeted on performance instead of security (like you are used in Alpine Linux). This means we are not using grsec and not using a firewall. I presume you will take security actions on another level.
To get started, you can download a boot cd here:
http://alpine.nethq.org/distfiles/alpine-scst-110207-x86_64.iso
Vanilla Linux kernel with SCST patches
The default Linux kernel will provide support for iSCSI. The problem with this implementation is it operates in user-space. SCST iSCSI will run in kernel-space and this is one of the reasons why it preforms much better. SCST performance depends on specific patches which need to be applied to the kernel. This is why we created a separate kernel just for SCST usage. SCST modules are already included by default so there is no need for a separate module package to be installed.
P.S. We only provide an x86_64 kernel for SCST because it will perform better on 64bit systems.
Linux software raid
In my personal setup i have 4 pieces of WD RE4 1TB drives which i want to use in the best performance raid level with redundancy. According to many mailing lists and opinion of the Linux raid author RAID10 with layout f2 (far) seems to preform best while still having redundancy. Please remember with RAID10 50% of your hard disk space will go to redundancy, but performance is almost the same as RAID0 (stripe).
For most up-to-date information regarding Linux software raid: https://raid.wiki.kernel.org/index.php/Overview
RAID10 has multiple layout types. f(far)2 in tests seem to preform the best. Please see above link for references.
mdadm -v --create /dev/md0 --level=raid10 --layout=f2 --raid-devices=4 /dev/sda /dev/sdb /dev/sdc /dev/sdd
I am not using partitions on my disks, although there are reasons to use partitions, see here:
https://raid.wiki.kernel.org/index.php/Partition_Types
You can now monitor your raid (re)building: cat /proc/mdstat
By default, the rebuild speed will be set and can be checked and changed here:
cat /proc/sys/dev/raid/speed_limit_max cat /proc/sys/dev/raid/speed_limit_min
make sure we have raid10 module loaded at boot
echo raid10 >> /etc/modules
When you are happy with your raid configuration, save its information to mdamd.conf file
mdadm --detail --scan >> /etc/mdadm.conf
It should display something like
ARRAY /dev/md0 metadata=1.2 name=scst:0 UUID=71fc93b8:3fef0057:9f5feec1:7d9e57e8
When you are ready with your raid setup and its functioning, you will need to make sure its starting at boot time
rc-update add mdadm-raid default
Monitor software raid
Linux software raid can be monitored with mdadm daemon option. Alpine Linux includes an initd script which can invoke the daemon
/etc/init.d/mdadm
It will be default monitor the array's defined in mdadm.conf. To receive email notifications about array issues, we need to provide our email address inside mdadm.conf:
MAILADDR me@inter.net
Because mdadm cannot send email itself, we need to setup an sendmail (replacement) program.
apk add ssmtp
Ssmtp can be configured by editing /etc/ssmtp/ssmtp.conf
You can monitor messages (syslog) for actions invoked by mmdadm.
If you have your own monitoring system active, you can also let mdadm issue a script and notify it.
Monitor hard disk with smartmontools
To keep your array healthy we can monitor our harddisks by its SMART interface. SMART can tell you in an early stage when hardisks begin to fail. Smartmontools includes a daemon called smartd which will run in the background and notify you by email or script about any issues which your harddisk will have.
apk add smartmontools
By default your smartd.conf will include the DEVICESCAN line which will automatically can your system and start monitoring all SMART capable devices. Because we want to make some changes to its default configuration we can comment it and include lines like these:
/dev/sda -a -d sat -m user@domain.tld -M test /dev/sdb -a -d sat -m user@domain.tld -M test /dev/sdc -a -d sat -m user@domain.tld -M test /dev/sdd -a -d sat -m user@domain.tld -M test
The -a option will do most of the basic monitoring for you, the -d specifies the device type (in my case a SATA disk) and the -m is to tell smartd to email me any issues it may find regarding this disk. I have also included the -M test switch to let smartmontools email me a test email at startup to make sure my mail subsystem is working. If you want smartd to email, you will need to have a minimal smtp setup like we used above with ssmtp but also the mail client which is included in mailx.
apk add mailx
When all is setup we can start it and make sure it starts at boot.
rc-update add smartd default /etc/init.d/smartd start
Dont forget to lbu_commit if needed.
Disk & Volume management with SCST
Volume management can be an interesting addition to your block device(s) but it does add is an extra layer between you block devices and SCST. If you are in need of any of the features provided by LVM then go ahead and use it, I didn't add it to my array cause i don't think i will need it. In my above RAID10 setup, md will provide me an 2TB block device md0. SCST provides 2 ways to access your disk subsystem, BLOCKIO and FILEIO.
BLOCKIO has access directly to the block device md0 (without extra layers) but FILEIO uses a filesystem on top of the block device and will use regular files. Altough the extra layer would seem to be bad but it actually isn't. The filesystem will provide a caching layer and in some situations you will see improved performance.
Some speed tests:
http://scst.sourceforge.net/vl_res.txt
When using FILEIO we need to create a filesystem. To create an XFS filesystem we need xfsprogs installed
apk add xfsprogs
We also need XFS support in our kernel
modprobe xfs
And we need it next time we boot
echo xfs >> /etc/modules
XFS will automatically choose the correct settings for the target block device
mkfs.xfs /dev/md0
Our filesystem should be ready to be mounted
mkdir -p /mnt/array1 && mount /dev/md0 /mnt/array1
Because of FILEIO we need files on our filesystem which will act as iSCSI disks. We will create them with dd
dd if=/dev/zero of=/mnt/array1/disk01 bs=512k count=100000
You can also user sparse files which are instantly created (no need to wait for every bit to be written to the filesystem).
More info about sparse files here
http://en.wikipedia.org/wiki/Sparse_file
SCST and iSCSI management
SCST is managed by the sysfs filesystem. You can create your own scripts to control it, or use the included scstadmin like I will show here.
For a detailed overview of the sysfs filesystem please check here:
http://lwn.net/Articles/378658/
When starting with SCST management we need to have the SCST framework (kernel module) and the iSCSI kernel module loaded
modprobe scst scst_vdisk iscsi_scst
After these are loaded we can start the iSCSI deamon
/etc/init.d/iscsi-scst start
This command will return some information based on the current configuration located in: /etc/scst.conf
The basic config when only having iSCSI loaded is:
TARGET_DRIVER iscsi { enabled 0 }
Adding a target
We start add adding a target to the correct target driver
scstadmin -add_target iqn.2010-12.org.alpinelinux:tgt -driver iscsi
The config at this point should be:
TARGET_DRIVER iscsi { enabled 0 TARGET iqn.2010-12.org.alpinelinux:tgt { enabled 0 } }
Adding a device
Now that we have a target in our configuration we need to add a device:
scstadmin -open_dev disk01 -handler vdisk_fileio -attributes filename=/mnt/array1/disk01,nv_cache=1
The config at this point should be:
HANDLER vdisk_fileio { DEVICE disk01 { t10_dev_id "disk01 b8ceed65" usn b8ceed65 filename /mnt/array1/disk01 nv_cache 1 } } TARGET_DRIVER iscsi { enabled 0 TARGET iqn.2010-12.org.alpinelinux:tgt { enabled 0 } }
Adding a LUN
To add the device to the target we need to specify which LUN it will be. (we always need to start with 0)
scstadmin -add_lun 0 -driver iscsi -target iqn.2010-12.org.alpinelinux:tgt -device disk01
The config at this point should be:
HANDLER vdisk_fileio { DEVICE disk01 { t10_dev_id "disk01 b8ceed65" usn b8ceed65 filename /mnt/array1/disk01 nv_cache 1 } } TARGET_DRIVER iscsi { enabled 0 TARGET iqn.2010-12.org.alpinelinux:tgt { enabled 0 LUN 0 disk01 } }
Enable the target
This is the default minimum configuration for a working iSCSI setup. We now need to activate it:
scstadmin -enable_target iqn.2010-12.org.alpinelinux:tgt -driver iscsi
The config at this point should be:
HANDLER vdisk_fileio { DEVICE disk01 { t10_dev_id "disk01 b8ceed65" usn b8ceed65 filename /mnt/array1/disk01 nv_cache 1 } } TARGET_DRIVER iscsi { enabled 0 TARGET iqn.2010-12.org.alpinelinux:tgt { rel_tgt_id 1 enabled 1 LUN 0 disk01 } }
SCST security settings
The latest version of SCST has changed its security configuration. Previously you should edit files inside your etc directory like initiators.allow but since SysFS its it controlled with SysFS. iSCSI daemon can be configured to be only listening to one interface via init.d, but in the cause of MPIO you would have multiple interfaces to listen on. To make iSCSI only listen on the interfaces which you reserve for iSCSI traffic, you can add them as allowed portal like this:
scstadmin -add_tgt_attr iqn.2010-12.org.alpinelinux:tgt -driver iscsi -attributes allowed_portal=192.168.1.1
IP wildcards are specified with a question mark "?"
HANDLER vdisk_fileio { DEVICE disk01 { t10_dev_id "disk01 b8ceed65" usn b8ceed65 filename /mnt/array1/disk01 nv_cache 1 } } TARGET_DRIVER iscsi { enabled 0 TARGET iqn.2010-12.org.alpinelinux:tgt { rel_tgt_id 1 enabled 1 allowed_portal 192.168.1.1 LUN 0 disk01 } }
Enable the iSCSI driver
Our configuration should be ok and the target is enabled. If you need to set any security groups or other settings please do them now. When you are ready we can activate the iSCSI driver:
scstadmin -set_drv_attr iscsi -attributes enabled=1
Reverting your changes
After this command, your iSCSI disk should be available in your initiator. You can use the same command to disable the target and take it offline:
scstadmin -set_drv_attr iscsi -attributes enabled=0
Some other command to break down the SCST config piece by piece:
To disable the target
scstadmin -disable_target iqn.2010-12.org.alpinelinux:tgt -driver iscsi
To remove the lun
scstadmin -rem_lun 0 -driver iscsi -target iqn.2010-12.org.alpinelinux:tgt -device disk01
To close a device
scstadmin -close_dev disk01 -handler vdisk_fileio
To remove a target
scstadmin -rem_target iqn.2010-12.org.alpinelinux:tgt -driver iscsi
iSCSI MPIO
MPIO Multi Path IO is a way to transfer IO data over multiple links. iSCSI MPIO will provide you two advantages. 1. Failover in case one of the paths goes down (cable is loose/defect or NIC is defect...) 2. Increased througoutput by sharing load over multiple paths. Currently 10G network interfaces and switches are still relatively expensive compared to 1G networks. Depending on your setup you can configure your network interfaces accordingly. I will provide an example which uses ESXi 4.1 as initiator. ESXi round-robin initiator can spread the SCSI command evenly over each interface by configuring the iops parameter. For details on such setup I suggest you to read the following article [1] and for reference you can also look here [2] and [3].
Interfaces
In my current setup I have configured 4 interfaces to use for my iSCSI traffic. These 4 interfaces consist out of 2 dual network PCI Express addon cards. I am not including my onboard dual controller because most of these onboard controllers do not allow simultaneous data transfers. You can easily verify this by using bwm-ng and push lots of data over it, I can confirm this for my onboard e1000 and bmx2 interfaces thus I will use these interfaces for management traffic.
Bonding
I have tried to bond the iSCSI interfaces on the target, although I was successful and I would increase performance, with ESXi it would receive much (+/- 50%) better performance with IP networking. My advise is not to use bonding on the Target side.
IP / Subnets
ESXi support multiple network adapters in the same subnet, this kind of setup seems more logic to me and would be easier to setup and you would end up with more paths to your storage because each interface can reach remote interfaces, but I was not able to make this work in Linux. I have tried playing with arp settings and route tables, but i was never successful to send traffic to a particular interface and receive it back from the same one. You can find some information on this topic here [4]. Somebody with better knowledge about routing/arp issues please correct me. While single subnet setup would provide you with more paths, the actual hardware paths would still stay 4gb, so I setup my NIC's each in its own subnet on both target and initiator.
Performance
If you configure this right you should be able to reach approx 90MB/s per channel according to my speed tests. These tests are based:
- ESXi 4.1
- 4 hardware paths
- Using Round-Robin disk with IOPS=1
- Windows 2008 server as guest
- Windows Raid0 disk setup (using 2 ESXi disks)
- CrystalDiskMark [5]
- Without jumboframes
Caution
If you are going to use iSCSI as one of your mail storage solutions, you should seriously think about using redundant hardware.
- Use multiple network cards (physical cards)
- Use multiple switches (Assign your network cards to switches)
- Even better: Heartbeat with iSCSI failover DRBD