Short stroking disk drives to improve performance

Reading a post from Hans DeLeenheer (VEEAM) which ramped up quite a bit including responses from Calvin Zito (HP), Alex McDonald (NetApp) and Nigel Poulton. The discussion started on a comment that XIO had \”special\” firmware which improved IO performance. Immediately the term \”short-stroking\” came up which leads to believe X-IO is cheap-skating on their technologies. I was under the same impression at first right until the moment I saw that Richard Lary is (more or less) the head of tech at X-IO together with Clark Lubbers and Bill Pagano who also come out of the same DEC stable. For those of you who don\’t know Richie, he\’s the one who ramped up Digital StorageWorks back in the late 70\’s/early 80\’s and also stood at the cradle of VAX-VMS. (Yeah yeah, I\’m getting old, google it if you don\’t know what I\’m talking about.)

Anyway, the topic \”Short Stroking\” is often mentioned as a cause of high numbers in numerous performance reports but what exactly is it.

For that to answer we need to have a look at the layout of a diskdrive (HDD). An HDD consists of one or more platters, an actuator with a few arms attached, a drive motor, cache ,transmission bus and some other smarts here). The drive motor determines the drive rotational speed. (RPM) and the actuator performance determines the speed of the heads moving back and forwards over the platters.

The things that determine the performance of a drive are the following:

Rotational Delay
Seek time
Settling time
Cache
Bus speed

The bus technology is either SAS, SATA or Fibre Channel these days of which the first two are the most common especially in newer arrays. SAS does 12Gb/s and SATA 6Gb/s. The bus speed is hardly ever a factor in performance cases unless a very intensive sequential workload is required with very large blocks. In some corner cases you will see a performance bottleneck in that area. Same thing with drive-cache. The drive cache is not use to serve you as a user of that drive. In other words you cannot influence the amount or behaviour of the cache. The cache is just there to have a drive accumulate enough data to make clever decisions when and where to physically write the data on the respective platters and sectors. Each drive has its sweet-spot and there are many mathematicians working on cracking algorithms to determine that sweet-spot. In almost all cases cache is eliminated from being a bottleneck in performance cases.

That leaves us with rotational delay, seek time, latency, settling time, head switch time and cylinder skew time .

Rotational Delay (Latency)

This is the easiest one to calculate. It is measured in milliseconds and is basically 60/RPM. So on a 10000RPM drive the head needs to wait for a maximum of 6ms for a sector to arrive under the head before it can read that sector. For a 15K RPM drive that winds back to 4ms. These are maximum number so the average is 1/2 of these numbers.

Seek time

The seek time is time required for the actuator to position the head over the tracks on the platters. The number are often represented as average seek time on reads and writes. You may ask why but there is actually a difference between the read head and a write head.

Settling time

Settling time is the time it takes before the read or write from or to that sector starts. The difference in read and write heads is size. A read head is much smaller than a write head so it can start reading data from the platters even before it has settled 100% over the track. You cannot do this writes. As the write head is bigger, since is needs to transfer more energy in order to have the recording surface change the magnetic order, it needs to settle exactly above the track before it can start. This is not only a measure to warrant that the data is actually written to that track but also to prevent overwriting adjacent tracks which would be disastrous.

Latency

Latency is the sum of seek time plus rotational delay i.e. the time it takes before a read or write operation from a drive can start. This is the number that is also mentioned on most specs.

Then we have two numbers that you will likely never see on spec sheets but are very important for mechanical engineers, algorithm number crunchers and firmware developers.

Head switch time

The head-switch time is the time needed for a switch between the heads on the top and bottom side of the platter on the same track. If you thought that the tracks between the two recording surfaces were 100% aligned you are correct however due to external factors like temperature and vibrations, the lack of dedicated servo-control tracks (remember the good old days??) and the enormous increase of areal density, the heads needs to re-settle on the respective track. This requires far more time then a track-to-track seek since that largely maintains a servo lock on track position.

Cylinder skew time

Cylinder skew time is the length of time between the index markers on each track in the same zone. (More on zones later). Each track on a disk has an index marker which, more or less, outlines the start of the track. If the location of these markers would be all on the same position on each track and a read or write would need to shift a track that read or write would have to wait nearly for a full rev in order to determine which sector on that track the head is actually located. If the index marker is skewed for such a distance that it would land the heads right on top of it after they have left the last sector on the adjacent track you would not need to wait for the full revolution. The distance between the last sector of a track and the index marker on the next track needs to be long enough to accommodate for a full track-to-track seek plus settling time.

As mentioned these last two numbers are hardly ever seen on spec sheets.

Zones

Since the outer tracks on disk drivers are physically longer than the inner tracks it means the surface area for recording is also longer. Given the fact that the RPMs are fixed this also means that the linear velocity (LV) on the outer tracks is much higher than on the inner tracks. When doing some napkin calculations this means that on a 3.5\” diskdrive (of which the outer track is at around 3\”) the length of a track is around 24cm (3\” * pi * 2.54cm/inch = ~24). So when this track flies around with 10000 RPM you get .24 * 10000 * 60seconds = 40 meters/s resulting in ~144 km/h. (Taking into account that the head rests at around 3 nanometers above the surface that is comparable with flying a Airbus A380 from San Francisco to New York at cruise speed at an altitude of ~ 1 meter above the ground. Pretty impressive.) On the inner tracks you\’ll see that the track lenght is lower (1\” * pi * 2.54 ~ 8cm) which results in a lower linear velocity (0.08 * 10000 * 60 seconds ~ 13 meters/s/47 km/h)

This allows for a staged higher recording density from the outer to the inner tracks. If we would use the same aerial density on the outer tracks as on the inner tracks the read and write heads would not have enough time to read or write the data to the sectors. If we would use the same sector size on the inner track as on the outer track we would lose a fair amount of recording capacity. This leads us to divide the tracks into zones where we can effectively use the same sector size on a certain amount of tracks before changing the sector length to accommodate the lower or higher LV.

The recording density on these zones differs in bits per inch (BPI). Due to a number of factors the centre of the disk contains the most recording density (and not the outer tracks) and thus you have a trade-off where to most effectively use which tracks.

A last technical thing

Diskdrives use serpentine recording on each zone. This means that it will use and entire zone on one side of the platter before it does a head-switch to the other side of the platter in the same zone. As mentioned this it to prevent the delay of track seek time plus settling time as much as possible. The goal is to have a continuous LBA range with the least amount of latency due to seek and settling time.

Short stroking

Going back to the subject. As you\’ve seen using the most effective zone w.r.t. recording density and linear velocity is very beneficial from a performance perspective. The concept of short-stroking is to artificially restrict the use to just a few zones in order to prevent head movement (and therefore seek time) as much as possible. Obviously this also has the nasty side-effect that the capacity is restricted linearly to the capacity of the zones configured.

Even in your home pc you can create a similar scenario by creating a partition restricted to the first few zones of the disk. The size of the partition is first and foremost determined by your capacity requirements. No use in creating a 100MB partition when you need 200GB. The easiest method is however to use a tool called \”hdtune\” which can be downloaded here. Basically it starts a set of tests to try and determine what the optimal partition size and layout will be in order to achieve the best performance. When you start the short-stroke test it will show you at which stage the performance will drop off significantly. You can the create a partition within these boundaries and there you have your free performance boost. Don\’t think of creating other partitions on that same drive and using them because that would result in the disk head still needing to move to the other edges of the disk rendering the entire setup useless.

\”So what causes this drop-off to occur?\” you might ask. Well, basically it is the flip point when the track seek time exceeds the rotational delay of one revolution and the head needs to stay on that same track for another revolution in order for the correct sector to pass under it plus the head-switch time when it needs to change the platter surface.

As an example take a disk with 20 zones each contains 250 tracks. The seek time for the head to move between track 0, track 1000 and back to 0 is just short of one rotational delay. This means that even if you would need to move from track 0, then track 998 and then back to 0 you would still remain in the situation where you do not need to wait for an entire revolution for the correct sector to land under the head. If you need to read from track 0 to track 1050 and back you most likely would just miss the correct sector and need to wait almost an entire rev which, on a 10K drive is ~6ms and on SATA (7200 RPM) easily exceeds 8ms.

Obviously there are some additional smarts which is handled by the drive firmware and buffer-space which can adjust and re-order the read- and write commands to optimize some of this. Sooner or later you will hit the edges of these capabilities and performance will drop.

There are numerous great resources on the web which explain disk-technology. One of which is from Artem Rubtsov.

Some of the \”must-have\” books to have when it comes to storage performance are

[amazon template=quick link&asin=0133390098]
[amazon template=quick link&asin=076451685X]
[amazon template=quick link&asin=0971786186]

Have a look.

Regards,

Erwin

alpharob

28/10/2014 at 00:04

“Obviously there are some additional smarts which is handled by the drive
firmware and buffer-space which can adjust and re-order the read- and
write commands to optimize some of this.”

Some or all of this? TCQ and analogous NCQ changes all of this. Tracks will switch and get that nearby read, never mind it came on to the queue much later.. as you know. Maybe a guess of the firmware that Richie and crew came up with is to move the hotter blocks to faster tracks. I seriously doubt they improved on TCQ/NCQ algorithms. But who knows.

Erwin van Londen
03/11/2014 at 11:00

Seems Gavin’s response below gives some clues on the usage. It doesn’t seem to be NCQ/TCQ but I’m not 100% sure how the calculation is done.

Log in to Reply

Erwin van Londen

03/11/2014 at 11:08

Thank Gavin for the response. As I mentioned above I’m not 100% sure how the calculation is done. I think its a bit too over-simplified to say you get 16000 IOPS out of a subsystem and divide that number by the sum of spindles in the system. I also mentioned in my post that by utilizing every surface of a spindle does not improve performance perse. It will most likely decrease if not handled correctly due to increased delay caused by the head-switch time which incurs an additional settling time.

If you can elaborate a bit more that would be very appreciated. As far as I can determine now per your comments, the proclaimed number of 400 IOPS per spindle is only substantiated in an overall 40 spindle subsystem.