High performance SCST iSCSI Target on Linux software Raid

From Alpine Linux

Introduction

This HOW-TO is focusing on performance. This is why I made some decisions targeted on performance instead of security (like you are used in Alpine Linux). This means we are not using grsec and not using a firewall. I presume you will take security actions on another level.

To get started, you can download a boot cd here:

http://alpine.nethq.org/distfiles/alpine-scst-110210-x86_64.iso

Vanilla Linux kernel with SCST patches

The default Linux kernel will provide support for iSCSI. The problem with this implementation is it operates in user-space. SCST iSCSI will run in kernel-space and this is one of the reasons why it preforms much better. SCST performance depends on specific patches which need to be applied to the kernel. This is why we created a separate kernel just for SCST usage. SCST modules are already included by default so there is no need for a separate module package to be installed.

P.S. We only provide an x86_64 kernel for SCST because it will perform better on 64bit systems.

Linux software raid

Hardware choice

In my setup i have 4 pieces of WD RE4 1TB drives connected to a mpt2sas based controller. I'm using Dell PowerEdge R510 which includes their most basic PERC H200 raid controller. I have tried using their Raid10 technology, but the raid rebuilding takes more then 3 days. You can read on mailing lists this is common to this adapter so i choose not to use this kind of "hardware" raid. Performance in degraded mode is really poor, and my users would suffer from it for too long. There are also little to no ways to control this array from Aline Linux. All LSI tools are based on glibc which will not work on uclibc hosts.

Raid level

I started looking for the best performance raid level with redundancy. According to many mailing lists and the opinion of the Linux raid author, RAID10 with layout f2 (far) seems to preform best while still having redundancy. Please remember with RAID10 50% of your hard disk space will go to redundancy, but performance is almost the same as RAID0 (stripe). For most up-to-date information regarding Linux software raid: https://raid.wiki.kernel.org/index.php/Overview

Raid setup

Make sure we have the mdadm raid configuration tool installed

apk add mdadm

I am not using partitions on my disks, although there are reasons to use partitions, see here:

https://raid.wiki.kernel.org/index.php/Partition_Types

mdadm -v --create /dev/md0 --level=raid10 --layout=f2 --raid-devices=4 /dev/sda /dev/sdb /dev/sdc /dev/sdd

You can now monitor your raid (re)building: cat /proc/mdstat

By default, the rebuild speed will be set and can be checked and changed here:

cat /proc/sys/dev/raid/speed_limit_max
cat /proc/sys/dev/raid/speed_limit_min

make sure we have raid10 module loaded at boot

echo raid10 >> /etc/modules

When you are happy with your raid configuration, save its information to mdamd.conf file

mdadm --detail --scan >> /etc/mdadm.conf

It should display something like

ARRAY /dev/md0 metadata=1.2 name=scst:0 UUID=71fc93b8:3fef0057:9f5feec1:7d9e57e8

When you are ready with your raid setup and its functioning, you will need to make sure its starting at boot time

rc-update add mdadm-raid default

Monitor software raid

Linux software raid can be monitored with mdadm daemon option. Alpine Linux includes an initd script which can invoke the daemon

/etc/init.d/mdadm

It will be default monitor the array's defined in mdadm.conf. To receive email notifications about array issues, we need to provide our email address inside mdadm.conf:

MAILADDR me@inter.net

Because mdadm cannot send email itself, we need to setup an sendmail (replacement) program.

apk add ssmtp

Ssmtp can be configured by editing /etc/ssmtp/ssmtp.conf

You can monitor messages (syslog) for actions invoked by mmdadm.

If you have your own monitoring system active, you can also let mdadm issue a script and notify it.

Monitor hard disk with smartmontools

To keep your array healthy we can monitor our harddisks by its SMART interface. SMART can tell you in an early stage when hardisks begin to fail. Smartmontools includes a daemon called smartd which will run in the background and notify you by email or script about any issues which your harddisk will have.

apk add smartmontools

By default your smartd.conf will include the DEVICESCAN line which will automatically scan your system and start monitoring all SMART capable devices. Because we want to make some changes to its default configuration we can comment it and include lines like these:

/dev/sda -a -d sat -m user@domain.tld -M test
/dev/sdb -a -d sat -m user@domain.tld -M test
/dev/sdc -a -d sat -m user@domain.tld -M test
/dev/sdd -a -d sat -m user@domain.tld -M test

The -a option will do most of the basic monitoring for you, the -d specifies the device type (in my case a SATA disk) and the -m is to tell smartd to email me any issues it may find regarding this disk. I have also included the -M test switch to let smartmontools email me a test email at startup to make sure my mail subsystem is working. If you want smartd to email, you will need to have a minimal smtp setup like we used above with ssmtp but also the mail client which is included in mailx.

apk add mailx

When all is setup we can start it and make sure it starts at boot.

rc-update add smartd default
/etc/init.d/smartd start

Disk & Volume management with SCST

Volume management can be an interesting addition to your block device(s) but it does add is an extra layer between you block devices and SCST. If you are in need of any of the features provided by LVM then go ahead and use it.

Logical Volume Manager

LVM brings some very interesting features to Linux block devices:

  • Online resizing of volumes
  • Can be used on any software array (in contrary to partitions)
  • Snapshots (easier to make create backups of block devices)
  • Many more great features

To setup LVM please follow Setting_up_Logical_Volumes_with_LVM

BlockIO vs FileIO

BLOCKIO has access directly to the block device md0 (without extra layers) but FILEIO uses a filesystem on top of the block device and will use regular files. Although the extra layer would seem to be bad but it actually isn't. The filesystem will provide a caching layer and in some situations you will see improved performance. According to the SCST dev it can lead up to 10% increase of performance.

Some speed tests [1]

Filesystem for FileIO

XFS has some very interesting features which match the use in our iSCSI environment like:

  • Automatically adjust to destination block device
  • Live growing (when filesystem is mounted. no shrink support)
  • Very good performance with large files

When using FILEIO we need to create a filesystem. To create an XFS filesystem we need xfsprogs installed

apk add xfsprogs

We also need XFS support in our kernel

modprobe xfs

And we need it next time we boot

echo xfs >> /etc/modules

File creation

On top of the filesystem we need to create files which will be used as disks. These files can be regular or sparse files [2]. If you need to create files on large volumes you will need to wait a long time when you create them as regular files with dd, instead you can use sparse files. The problem with sparse files, they can get heavily fragmented over time. Another solution for this issue is fallocate. fallocate is included in util-linux-ng.

apk add util-linux-ng

Now you can create your files like this

fallocate -l 10G disk1

And verify if its correct by:

du -h disk1

This command should show the file is allocating all disk space.

SCST and iSCSI management

SCST is managed by the sysfs filesystem. You can create your own scripts to control it, or use the included scstadmin like I will show here.

For a detailed overview of the sysfs filesystem please check here:

http://lwn.net/Articles/378658/

When starting with SCST management we need to have the SCST framework (kernel module) and the iSCSI kernel module loaded

modprobe scst scst_vdisk iscsi_scst

After these are loaded we can start the iSCSI deamon

/etc/init.d/iscsi-scst start

This command will return some information based on the current configuration located in: /etc/scst.conf

The basic config when only having iSCSI loaded is:

TARGET_DRIVER iscsi {
        enabled 0
}

Adding a target

We start add adding a target to the correct target driver

scstadmin -add_target iqn.2010-12.org.alpinelinux:tgt -driver iscsi

The config at this point should be:

TARGET_DRIVER iscsi {
        enabled 0

        TARGET iqn.2010-12.org.alpinelinux:tgt {
                enabled 0
        }
}

Adding a device

Now that we have a target in our configuration we need to add a device:

scstadmin -open_dev disk01 -handler vdisk_fileio -attributes filename=/mnt/array1/disk01,nv_cache=1

The config at this point should be:

HANDLER vdisk_fileio {
       DEVICE disk01 {
               t10_dev_id "disk01 b8ceed65"
               usn b8ceed65

               filename /mnt/array1/disk01
               nv_cache 1
       }
}

TARGET_DRIVER iscsi {
       enabled 0

       TARGET iqn.2010-12.org.alpinelinux:tgt {
               enabled 0
       }
}

Adding a LUN

To add the device to the target we need to specify which LUN it will be. (we always need to start with 0)

scstadmin -add_lun 0 -driver iscsi -target iqn.2010-12.org.alpinelinux:tgt -device disk01

The config at this point should be:

HANDLER vdisk_fileio {
       DEVICE disk01 {
               t10_dev_id "disk01 b8ceed65"
               usn b8ceed65

               filename /mnt/array1/disk01
               nv_cache 1
       }
}

TARGET_DRIVER iscsi {
       enabled 0

       TARGET iqn.2010-12.org.alpinelinux:tgt {
               enabled 0

               LUN 0 disk01
       }
}

Enable the target

This is the default minimum configuration for a working iSCSI setup. We now need to activate it:

scstadmin -enable_target iqn.2010-12.org.alpinelinux:tgt -driver iscsi

The config at this point should be:

HANDLER vdisk_fileio {
       DEVICE disk01 {
               t10_dev_id "disk01 b8ceed65"
               usn b8ceed65

               filename /mnt/array1/disk01
               nv_cache 1
       }
}

TARGET_DRIVER iscsi {
       enabled 0

       TARGET iqn.2010-12.org.alpinelinux:tgt {
               rel_tgt_id 1
               enabled 1

               LUN 0 disk01
       }
}

SCST security settings

The latest version of SCST has changed its security configuration. Previously you should edit files inside your etc directory like initiators.allow but since SysFS its it controlled with SysFS. iSCSI daemon can be configured to be only listening to one interface via init.d, but in the cause of MPIO you would have multiple interfaces to listen on. To make iSCSI only listen on the interfaces which you reserve for iSCSI traffic, you can add them as allowed portal like this:

scstadmin -add_tgt_attr iqn.2010-12.org.alpinelinux:tgt -driver iscsi -attributes allowed_portal=192.168.1.1

IP wildcards are specified with a question mark "?"

HANDLER vdisk_fileio {
       DEVICE disk01 {
               t10_dev_id "disk01 b8ceed65"
               usn b8ceed65

               filename /mnt/array1/disk01
               nv_cache 1
       }
}

TARGET_DRIVER iscsi {
       enabled 0

       TARGET iqn.2010-12.org.alpinelinux:tgt {
               rel_tgt_id 1
               enabled 1
               allowed_portal 192.168.1.1

               LUN 0 disk01
       }
}

Enable the iSCSI driver

Our configuration should be ok and the target is enabled. If you need to set any security groups or other settings please do them now. When you are ready we can activate the iSCSI driver:

scstadmin -set_drv_attr iscsi -attributes enabled=1

Reverting your changes

After this command, your iSCSI disk should be available in your initiator. You can use the same command to disable the target and take it offline:

scstadmin -set_drv_attr iscsi -attributes enabled=0

Some other command to break down the SCST config piece by piece:

To disable the target

scstadmin -disable_target iqn.2010-12.org.alpinelinux:tgt -driver iscsi

To remove the lun

scstadmin -rem_lun 0 -driver iscsi -target iqn.2010-12.org.alpinelinux:tgt -device disk01

To close a device

scstadmin -close_dev disk01 -handler vdisk_fileio

To remove a target

scstadmin -rem_target iqn.2010-12.org.alpinelinux:tgt -driver iscsi

iSCSI MPIO

MPIO Multi Path IO is a way to transfer IO data over multiple links. iSCSI MPIO will provide you two advantages. 1. Failover in case one of the paths goes down (cable is loose/defect or NIC is defect...) 2. Increased througoutput by sharing load over multiple paths. Currently 10G network interfaces and switches are still relatively expensive compared to 1G networks. Depending on your setup you can configure your network interfaces accordingly. I will provide an example which uses ESXi 4.1 as initiator. ESXi round-robin initiator can spread the SCSI command evenly over each interface by configuring the iops parameter. For details on such setup I suggest you to read the following article [3] and for reference you can also look here [4] and [5].

Interfaces

In my current setup I have configured 4 interfaces to use for my iSCSI traffic. These 4 interfaces consist out of 2 dual network PCI Express addon cards. I am not including my onboard dual controller because most of these onboard controllers do not allow simultaneous data transfers. You can easily verify this by using bwm-ng and push lots of data over it, I can confirm this for my onboard e1000 and bmx2 interfaces thus I will use these interfaces for management traffic.

Bonding

I have tried to bond the iSCSI interfaces on the target, although it was successful and it would increase performance, with ESXi it would receive much (+/- 50%) better performance with IP networking. My advise is not to use bonding on the Target side.

IP / Subnets

ESXi support multiple network adapters in the same subnet, this kind of setup seems more logic to me and would be easier to setup and you would end up with more paths to your storage because each interface can reach remote interfaces, but I was not able to make this work in Linux. I have tried playing with arp settings and route tables, but i was never successful to send traffic to a particular interface and receive it back from the same one. You can find some information on this topic here [6]. Somebody with better knowledge about routing/arp issues please correct me. While single subnet setup would provide you with more paths, the actual hardware paths would still stay 4gb, so I setup my NIC's each in its own subnet on both target and initiator.

Performance

If you configure this right you should be able to reach approx 90MB/s per channel according to my speed tests. These tests are based:

  • ESXi 4.1
  • 4 hardware paths
  • Using Round-Robin disk with IOPS=1
  • Windows 2008 server as guest
  • Windows Raid0 disk setup (using 2 ESXi disks)
  • CrystalDiskMark [7]
  • Without jumboframes

Caution

If you are going to use iSCSI as one of your main storage solutions, you should seriously think about using redundant hardware.

  • Use multiple network cards (physical cards)
  • Use multiple switches (Assign your network cards to switches)
  • Even better: Heartbeat with iSCSI failover DRBD