In the storage world the verb “redundant” means that your data is safe, secure, copied multiple times, backed-up and it gives you the feeling that if you make a mistake or something happens to your data it can be retrieved from these other places. This more or less make you sleep at night when you worked on a large project for a couple of months and if you would loose that data it’s no big deal.

However in the Human Resources world the word “Redundant” has a totally different meaning. It more or less means that if somebody from HR approaches you and asks if you have a couple of minutes it’s time to get worried. As said the word “Redundant” in HR terms has a 99% chance that you’ll be out the door before you can have your next cup of coffee.

So here you are, living in a foreign country (Melbourne, Australia) as a highly skilled storage architect and on a certain moment you’re being told that your position is made redundant. Now if you’re an Australian citizen and/or on a permanent visa you take your severance pay, register at a lot of recruitment agency’s and probably within one or two months you’re back in the game.

Things start to change when you’re on a temporary business sponsorship visa (a so called 457) because that one is cancelled right away as well and you have about 28 days to pack your things up and leave the country (whereto? that is your problem as long as your not within Australian borders.) So when you are in my position with a wife and kids who go to school here and have a flourishing social life you can imagine this has quite some impact. You have to sell you house, get rid of the lease car and all other commitments you’ve taken on in the period you’ve worked here. In my case when I sum this all up it’s around AUD$70,000 that I lost on one day and that is a lot of money.

To go back to the Australian policies, your profile matches at least two of the most wanted people Australia is after (have a look on the Department of Immigration website and search for MODL) w.r.t IT skills however there is no arrangement for you get some sort of bridge visa to stay in the country and apply for a permanent one.

The only option is to get a new job with a company who is willing to sponsor you for a new 457 visa (remember this all has to take place within the notice period + 28 days the government grants you for packing up your stuff) . Since this option is not really appealing for companies given the responsibilities that they have to take (buy you a couple of plane tickets back home) most of the companies are not keen on doing that. The main reason being the hassle that is involved. (sign two documents and pay around $500.00 on admins fees). The exact wording are on the Department of Immigrations website as well. Click here

This all happened back in 2009 so for me it was a very exiting time but I hope it never repeats itself. Anyway, I managed to circumvent the entire situation due to the fact that other departments in my company did want/need my skills and as such my visa didn’t need cancellation after all. In the mean time I was able to get a permanent visa so besides the stress of visa termination when your position is made redundant a permanent visa gives you some more option w.r.t. social security and medical care here in Oz.

So why is this 457 so dangerous? This temporary visa was instantiated by the Australian government after many complaints of large international companies who had massive problems getting workers from oversees to do a temporary job in Australia. Think of companies who need temporary expertise from elsewhere in the world to complete a project in a relatively short timeframe. Previously they had to apply for a permanent visa which, by regulation, requires a lot more scrutiny and therefore takes a lot longer to get. This became an economic burden since many projects in Australia took far longer than essentially needed. To overcome this situation the Australian government introduced a short term, very restrictive visa with the well known number 457. This one is valid for 4 years, it allows you to work and live in Australia and that’s about it.

You’re not eligible for any social security arrangements, government funded medical arrangements (Medicare) or anything else besides the human rights. (so yes, you are allowed to call the police when there are buglers in the house. :-))

The intention of this visa was, as I described above, to let companies complete their projects and the workers would return to their original country to rejoin their old position with that company. The problem however now is that almost every company, and even recruiting agencies, offer this 457 to lure people from overseas to come to Australia to work here. They do not tell you all the above things I just wrote up. So many folks believe they are granted at least a 4 year stay but this may not be the case, It can be as bad as 2 months if you’re really unlucky.

To give some advise:

1. If you’re single and willing to take the risk to work for a maximum of 4 years in Australia you’re good to go on a 457.
2. If you do enter the country on a 457 visa you have to take private medical insurance unless this is taken care of by your company.
3. Do not engage into long term financial commitments like buying a house, leasing a car for an extended period or other financial transactions which might get you into serious problems. (Believe me it will almost kill you when you get “the message”.
4. Live on “inflatable” furniture because an international move back does cost a lot of money. The less you have to take back home the cheaper it will be of course.

5. If you have a family which you want to take with you, wait for a permanent visa. It’s just not worth the risk, stress and hassle that will take place when you employer cancels your sponsorship (another word for being made redundant) and your visa stops in 28 days.
6. Also remember that in addition of your employer having to inform the department of immigration of your status change, you are obligated to do the same. If you don’t, you’re treated as an illegal immigrant which most likely will prohibit any chance of obtaining a permanent visa in the future.

6. If you’re being sent here by your employer overseas make sure he provides you written consent that your job is still available when you return.
7. If you’re on the job for 2 years on a 457 visa and your employer still wishes to keep you ask him immediately to assist you in applying for a permanent visa. He might be willing to even pay for it. If your skills are good most of the times he/she will cooperate. The cost for my family was around AUS$6500.– which included all paperwork, levies, and the fees for the immigration agent. In addition you are required to do an English language test if your not from a native English speaking country which is around 350 per adult. Also a medical examination is required for ALL applicants including children which added another $1200 to the bill.

8. You can apply for the visa yourself however it’s much more effective to hire an agent to take care of the paperwork. The problem is that many questions might be incorrectly interpreted by you. If your application is received by the department of immigration and it does have errors or irregularities in it your application will be sent back and you’ll end up at the bottom of the pile again. On average the waiting time is between 6 months and a year so you’re better off with an agent.

If you work in Australia your employer is obligated to deduct a so called Medicare levy. This is the premium you have pay for medical coverage under the public medical system however when you’re on a 457 you’re not eligible for this coverage and as such you can deduct the total amount of 1.5% of you’re taxable income. Secondly, the Australian government more or less encourages private health insurance and allows you to deduct 30% to 40% (depending on age) of your premiums from your taxable income.
If you have children who require school necessities like books, uniforms, stationary etc you are NOT allowed to deduct that from your taxable income. As I said before you don’t get any social benefits on a temporary visa.

Let the above not stop you from the experience of living and working in Australia. It’s a great country with lots to do and see but make sure you are fully covered for all aspects when your 457 gets cancelled for whatever reason.

Kind regards,
Erwin

SCSI UNMAP and performance implications

14/10/2011

Erwin van Londen

When listening to Greg Knieriemens’ podcast on Nekkid Tech there was some debate on VMWare’s decisision to disable the SCSI UNMAP command on vSphere 5.something. Chris Evans (www.thestoragearchitect.com) had some questions why this has happened so I’ll try to give a short description.

Be aware that, although I work for Hitachi, I have no insight in the internal algorithms of any vendor but the T10 (INCITS) specifications are public and every vendor has to adhere to these specs so here we go.

With the introduction of thin provisioning in the SBC3 specs a whole new can of options, features and functions came out of the T10 (SCSI) committee which enabled applications and operating systems to do all sorts of nifty stuff on storage arrays. Basically it meant you could give a host a 2 TB volume whilst in the background you only had 1TB physically available. The assumption with thin provisioning (TP) is that a host or application wont use that 2 TB in one go anyway so why pre-allocate it.

So what happens is that the storage array will provide the host with a range of addressable LBA’s (Logical Block Addresses) which the host is able to use to store data. In the back-end on the array these LBA’s are then only allocated upon actual use. The array has one or more , so called, disk pools where it can physically store the data. The mapping from the “virtual addressable LBA” which the host sees and the back-end physical storage is done by mapping tables. Depending on the implementation between the different vendor certain “chunks” out of these pools are reserved as soon as one LBA is allocated. This prevents performance bottlenecks from a housekeeping perspective since it doesn’t need to manage each single LBA mapping. Each vendor has different page/chunks/segment sizes and different algorithms to manage these but the overall method of TP stay the same.

So lets say the segment size on an array is 42MB (:-)) and an application is writing to an LBA which falls into this chunk. The array updates the mapping tables, allocates cache-slots and all the other housekeeping stuff that is done when a write IO is coming in. As of that moment the entire 42 MB is than allocated to that particular LUN which is presented to that host. Any subsequent write to any LBA which falls into this 42MB segment is just a regular IO from an array perspective. No additional overhead is needed or generated w.r.t. TP maintenance. As you can see this is a very effective way of maintaining an optimum capacity usage ratio but as with everything there are some things you have to consider as well like over provisioning and its ramifications when things go wrong.

Lets assume that is all under control and move on.

Now what happens if data is no longer needed or deleted. Lets assume a user deletes a file which is 200MB big (video for example) In theory this file had occupied at least 5 TP segments of 42MB. But since many filesystems are very IO savvy they do not scrub the entire 42MB back to zero but just delete the FS entry pointer and remove the inodes from the inode table. This means that only a couple of bytes effectively have been removed on the physical disk and array cache.
The array has no way of knowing that these couple of bytes, which have been returned to 0, represent an entire 200MB file and as such these bytes are still allocated in cache, on disk and the TP mapping table. This also means that these TP segments can never be re-mapped to other LUN’s for more effective use if needed. To overcome this there have been some solutions to overcome this like host-based scrubbing (putting all bits back to 0), de-fragmentation to re-align all used LBA’s and scrub the rest and some array base solutions to check if segments do contain on zero’s and if so remove them from the mapping table and therefore make the available for re-use.

As you can imagine this is not a very effective way of using TP. You can be busy clearing things up on a fairly regular basis so there had to be another solution.

So the T10 friends came up with two new things namely “write same” and “unmap”. Write same does exactly what it says. It issues a write command to a certain LBA and tells the array to also write this bit stream to a certain set of LBA’s. The array then executes this therefore offloading the host from keeping track of all the write commands so it can do more useful stuff than pushing bits back and forth between himself and the array. This can be very useful if you need to deploy a lot of VM’s which by definition have a very similar (if not exactly) the same pattern. The other way around it has a similar benefit that if you need to delete VM’s (or just one) the hypervisor can instruct the array to clear all LBA’s associated with that particular VM and if the UNMAP command is used in conjunction with the write same command you basically end up with the situation you want. The UNMAP command instructs the array that a certain LBA (LBA’s) are no longer in use by this host and therefore can be re-used in the free pool.

As you can imagine if you just use the UNMAP command this is very fast from a host perspective and the array can handle this very quickly but here comes the catch. If the host instructs the array to UNMAP the association between the LBA and the LUN it is basically only a pointer from the mapping table that is removed. the actual data does still exist either in cache or on disk. If that same segment is then re-allocated to another host in theory this particular host can issue a read command to any given LBA in that segment and retrieve the data that was previously written by the other system. Not only can this confuse the operating system but it also implies a huge security risk.

In order to prevent this the array has one or more background threads to clear out these segments before they are effectively returned to the pool for re-use. These tasks normally run on a pretty low priority to not interfere with normal host IO. (Remember that it still is (or are) the same CPU(s) who have to take care of this.) If CPU’s are fast and the background threads are smart enough under normal circumstances you hardly see any difference in performance.

As with all instruction based processing the work has to be done either way, being it the array or the host. So if there is a huge amount of demand where hypervisors move around a lot of VM’s between LUN’s and/or arrays, there will be a lot of deallocation (UNMAP), clearance (WRITE SAME) and re-allocation of these segments going on. It depends on the scheduling algorithm at what point the array will decide to reschedule the background and frontend processes so that the will be a delay in the status response to the host. On the host it looks like a performance issue but in essence what you have done is overloading the array with too many commands which normally (without thin provisioning) has to be done by the host itself.

You can debate if using a larger or smaller segment size will be beneficial but that doesn’t matter at all. If you use a smaller segment size the CPU has much more overhead in managing mapping tables whereas using bigger segment sizes the array needs to scrub more space on deallocation.

So this is the reason why VMWare had disabled the UNMAP command in this patch since a lot of “performance problems” were seen across the world when this feature was enabled. Given the fact that it was VMWare that disabled this you can imagine that multiple arrays from multiple vendors might be impacted in some sense otherwise they would have been more specific on array vendors and types which they haven’t done.

OpenDNS with DNS-O-Matic

04/08/2011

Erwin van Londen

Uncategorized No Comments

A while ago I wrote a short article that I found a nice way to “secure” or at least be able to monitor my childrens’ web behavior called OpenDNS. I soon found out that you have at least one problem and that is dynamic IP addresses which your ISP shoves to you when you link up your router. Problem is these are never the same and the DHCP lifetime is 0 seconds. So even in a small link bounce of 2 or 3 seconds you get a new IP address on your WAN side.

This renders the security features of OpenDNS (DNS Domain blocking) more or less useless since the DNS queries that are now made from one of your PC on the LAN side are now exposed to the OpenDNS with another public IP address and OpenDNS can therefore not link your profile to this address.

So lets take an example:
Your internal LAN is using 10.1.2.0/24 and is NAT-ed on your router to the outside world. Your ISP provides you with an adress of, let say, 152.43.50.2.

On the OpenDNS website you create a profile called “My Home network” and you link this address to the profile. The profile also allows you to block certain websites manually or entire categories like Adult, Weapons, Gambling etc. so all in all important to keep this away from your children.
Now what happens if one of your computers does a DNS query is that OpenDNS takes the from address (ie your public IP address 152.43.50.2), link this to your profile to verify if your requested page/domain falls in one of the criteria you configured and if the action is for this site to be blocked it redirects you to a page which just shows an explanation why this site is blocked. You can customize this as well.

The problem is however that if your ISP provided address changes OpenDNS cannot link this WAN (152.43.50.2) address to your profile anymore and will just return the IP address of that site after which your computer just connects to it and shows the page.

This so called Dynamic IP address problem is also acknowledged by OpenDNS and their recommendation is in these cases to install a little tool which on regular intervals checks if this address changes or not and if it has it updates your OpenDNS profile with the new address. “Problem solved” you might say. Well, not exactly. The problem is that this little tool has to be installed on a PC which either runs Windows or MaxOS. Secondly this PC has to be secured from tampering since kids become smarter as well and it gives them the option to just remove this or fumble around as they seem fit which in essence renders it useless. I also don’t want too much of these tools installed on PC’s since I’m being seen as the household admin I want to do as little as possible. Admins should be lazy. Improves effectiveness 🙂 I decided not to use this agent so this has put me in some sort of catch22 situation. Again I should be lazy from an admin standpoint so I don’t have the time nor urge to check the OpenDNS website every 10 minutes if my address has changed so I worked something out with another service from OpenDNS which is called DNS-O-Matic (DOM). This service allowed me to write a simple script which enbled me to automate the entire process.

So In my case I’ve done the following.
I have an OpenDNS account with a network profile which blocks certain categories of websites.
Next to that I created an DOM account and linked the OpenDNS service to the DOM account. This basically means that if I update DOM with my new, ISP provided, IP address it will propagate this to my OpenDNS account. (DNS-O-Matic provides many more options to link this service to but I leave this up to you to check this out.)

Now you might say “How does this fix things?”. Well, the solution is easy. DOM provides a simple API which you can write a script or program against. This allows you to update DOM automatically via this API which in turn updates your OpenDNS profile with your new IP address. So the first thing you need to do is obtain your current IP address. If you query the OpenDNS servers with the myip.opendns.com destination it will always return your actual (ISP provided) IP address. (This is basically the source address on which the OpenDNS service should return the answers to).
Next thing you need to do is to verify if this address is the same as your “old” address and if not, update DOM with this new address.

I made a little script which I hooked up to cron so it does this for me automatically every 5 minutes.

#!/bin/bash
## Script to update OpenDNS and DNS-O-Matic
## Check www.dnsomatic.com. opendns is linked to this.
##
## Documentation
## https://www.dnsomatic.com/wiki/api
##
##
## This script runs in cron every 5 minutes.

## First get your public IP address
ip=$(dig @208.67.222.222 myip.opendns.com +short)
## Get my IP I know I use to have from a hidden file
oldip=$(cat /home/erwin/.oldip)

## If needed update the IP address on the web. If not do nothing.
if [ $ip != $oldip ]
then

curl https://:@updates.dnsomatic.com/nic/update?hostname=all.dnsomatic.com&myip=$ip&wildcard=NOCHG&mx=NOCHG&backmx=NOCHG

## Write the new IP address to the hidden file again.
echo $ip > /home/erwin/.oldip

fi

That’s it. I’m sure this can be achieved on Windows as well with either batch files or commandlets and vb script but I just had bash at hand.

My crontab entry looks like this:

*/5 * * * * /home/erwin/Desktop/scripts/DNS-O-Matic/update.sh

And it works perfectly I must say.

Now there are two “Gotchas”:

How do you prevent from kids just choosing another DNS service like the default ones that come with your ISP.
This still requires you to have your computer online.

The answer to 1 is to create a frame redirect rule in your router firewall so that every DNS query (UDP port 53) is directed to OpenDNS. And the answer to 2 is “You are correct :-)”. Since I work from home my Linux box is always on. (At least during the time I’m working and during the time my kids are allowed on the net.

Some newer generation routers have this functionality build in so its a one time setup on your router and you wouldn’t have to worry about it anymore.

Hope this helps in one of your situations.

Regards,
Erwin

Beyond the Hypervisor as we know it

21/07/2011

Erwin van Londen

General Info Uncategorized No Comments

And here we are again. I’ve busy doing some internal stuff for my company so the tweets and blogs were put on low maintenance.

Anyway, VMware launched its new version of vSphere and the amount of attention and noise it received is overwhelming both from a positive as well as negative side. Many customers feel they are ripped off by the new licensing schema whereas from a technical perspective all admins seem to agree the enhancements being made are fabulous. Being a techie myself I must say the new and updated stuff is extremely appealing and I can see why many admins would like to upgrade right away. I assume that’s only possible after the financial hurdles have been taken.

So why this subject? “VMware is not going to disappear and neither does MS or Xen” I hear you say. Well, probably not however let take a step back why these hypervisors were initially developed. Basically what they wanted to achieve is the option to run multiple applications on one server without having any sort of library dependency which might conflict and disturb or corrupt another application. VMware hasn’t been the initiator of this concept but the birthplace of this all was IBM’s mainframe platform. Even back in the 60’s and 70’s they had the same problem. Two or more applications had to run on the same physical box however due to conflicts in libraries and functions IBM found a way to isolate this and came up with the concept of virtual instances which ran on a common platform operating system. MVS which later became OS/390 and now zOS.

When the open systems guys spearheaded by Microsoft in the 80’s and 90’s took off they more or less created the same mess as IBM had seen before. (IBM did actually learn something and pushed that into OS/2 however that OS never really took off).
When Microsoft came up with so called Dynamic Link Libraries this was heaven for application developers. They could now dynamically load a DLL and use its functions. However they did not take into account that only one DLL with a certain function could be loaded as any one particular point. And thus when DLL got new functionality and therefore new revision levels sometimes they were not backward compatible and very nasty conflict would surface. So we were back to zero.

And along came VMware. They did for the Windows world what IBM had done many years before and created a hypervisor which would let you run multiple virtual machines each isolated from each other with no possibility of binary conflicts. And they still make good money of it.

However also the application developers have not been pulling things out of their nose and sit still. They also have seen that they no longer can utilize the development model they used for years. Every self respecting developer now programs with massive scalability and distributed systems in mind based on cloud principles. Basically this means that applications are almost solely build on web technologies with javascript (via node.js), HTML 5 or other high level languages. These applications are then loaded upon distributed systems like openstack, hadoop and one or two others. These platforms create application containers where the application is isolated and has to abide by the functionality of the underlying platform. This is exactly what I wrote almost two years ago where the application itself should be virtualised instead of the operating system. (See here)

When you take this into account you can imagine that the hypervisors, as we know them now, at some point in time will render themselves useless. The operating system itself is not important anymore and is doesn’t matter where these cloud systems run on. The only thing that is important is scalability and reliability. Companies like VMware, Microsoft, HP and others are not stupid and see this coming. This is also the reason why they start building these massive data centres to accommodate the customers who adopt this technology and start hosting these applications.

Now here come the problems with this concept. SLA’s. Who is going to guarantee you availability when everything is out of your control. Examples like outages with Amazon EC2, Microsoft’s cloud email service BPOS, VMware’s Cloud Foundry outage or Google GMAIL service show that even these extremely well designed systems at some point in time run into Murphy and the question is do you want to depend on these providers for business continuity. Be aware you have no vote how and were your application is hosted. That is totally at the discretion of the hosting provider. Again, its all about risk assessment versus costs versus flexibility and other arguments you can think of so I leave that up to you.

So where does this take you? Well, you should start thinking about your requirements. Does my business need this cloud based flexibility or should I adopt a more hybrid model where some applications are build and managed by myself/my staff.

In any way you will see more and more applications being developed for both internal, external and hybrid cloud models. This then brings us back to the subject line that the hypervisors as we know them today will cease to exist. It might take a while but the software world is like a diesel train, it starts slowly but when it´s on a roll its almost impossible to stop so be prepared.

Kind regards,
Erwin van Londen

SoE, SCSI over Ethernet.

31/05/2011

Erwin van Londen

Fibre Channel Storage Networking 8 Comments

It may come as no surprise that I’m not a fan of FCoE. Although I have nothing against the underlying thought of converged networking I do feel that the method of encapsulating multiple protocols in yet another frame is overkill, adds complexity, requires additional skills, training and operating methods and introduces risk so as far as I’m concerned it shouldn’t be needed. The main reason FCoE is invented is to have the ability to traverse traffic from Fibre Channel environments through gateways (called FCF’s) to an Ethernet connected Converged Network Adapter in order to save on some cabling. Yeah, yeah I know many say you’ll save a lot more but I’m not convinced.
After staring at some ads from numerous vendors I still wonder why they never came up with the ability to directly map the SCSI protocol on Ethernet in the same way they do with IP. After all with the introduction of 10G Ethernet all issues of reliability appear to have gone (have they??) so it shouldn’t be such a problem to directly address this. This was the main reason why Fibre Channel was invented in the first place. I think from a development perspective this should be an evenly amount of effort to have SCSI directly transported on Ethernet compared to Fibre Channel.From an interface perspective it shouldn’t be such a problem as well. I think storage would be as happy to shove in an Ethernet port in addition to FC. They wouldn’t need to use any difficult FCoE or iSCSI mechanisms.

Since all, or at least a lot, development efforts these days seem to have shifted to Ethernet why still invest in Fibre Channel. Ethernet still has a 7 layer OSI stack but you should be able to just use three, the physical, datalink, and networking layer. This should be enough to shove frames back and forth in a flat Ethernet network (or Ethernet Fabric as Brocade calls it).For other protocol like TCP/IP this is no problem since they already use the same stack but just travel a bit higher up. This then allows you to have a routable iSCSI environment (over IP) as well as a native SCSI protocol running on the same network. The biggest problem is then security. If SCSI runs on a flat Ethernet network there is no way (yet) to secure SCSI packets arriving at all ports in that particular network segment. This would be the same as having no zoning active as well as disabling all LUN masking on the arrays. The only way to circumvent this is to invent some sort of “Ethernet Firewall” mechanism. (I’m not aware of a product/vendor who provides this but I’ve never heard of it.) I’ts pretty easy to spoof a MAC address so that’s no good as a security precaution.

As usual this should then also have all the other security features like authentication, authorisation etc etc. Fibre Channel already provides authentication based on DH-CHAP which is specified in the FC-SP standard. Although DH-CHAP exists in the Ethernet world it is strictly tied to higher layers like TCP. It would be good though to see this functionality on the lower layers as well.

I’m not an expert on Ethernet so I would welcome comments that would provide some more insight of the options and possibilities.

Food for thought.

Regards,
Erwin

Why disk drives have become slower over the years

30/05/2011

Erwin van Londen

Storage Networking 2 Comments

What is the first question vendors get (or at least used to get) when a customer (non-technical) calls???
I’ll spare you the guesswork: “What does a TB of disks do at your place ??”. Usually I’ll goofle around for the cheapest disk at a local PC store and say “Well Sir, that would be about 80 dollars”. I then hear somebody falling of the chair, trying to get up again, reach for the phone and with a resonating voice asking “Why are your competitors so expensive then?”. “They most likely did not gave a direct answer to your question.”, I reply.
The thing is an HDD should be evaluated on multiple factors and when you spend 80 bucks on a 1TB disk you get capacity and that’s about it. Don’t expect performance or extended MTBF figures let alone all the stuff than comes with enterprise arrays like large caches, redundancy in every sense and a lot more. This is what makes up the price per GB.

“Ok, so why have disk drives become so much slower in the past couple of years?”. Well, they haven’t. The RPM, seek time and latency have remained the same over the last couple of years. The problem is that the capacity has increased so much that the so called “access density” has increased linearly so the disk has to service a massive amount of bytes with the same nominal IOPS capability.

I did some simple calculations which shows the decrease in performance on larger disks. I didn’t assume any raid or cache accelerators.

I first calculated a baseline based on a 100GB disk drive (I know, they don’t exist but it just for the calculations) with 500GB of data that I need to read or write.

The assumption was to have a 100% random read profile. Although the host can read or write in increments of 512 bytes IO size theoretically this doesn’t mean the disk will write this IO in one sequential stroke. An 8K host IO can be split up in the smallest supported sector size on disk which is currently around 512 bytes. (Don’t worry, every disk and array will optimize this but again this is just to show the nominal differences)

So when I have 100GB disk drive this translates to a little over 190 million sectors. In order to read 500 GB of data this would take a theoretical 21.7 minutes. The number of disks are calculated based on the capacity required for that 500GB (Also remember that disks use a base10 capacity value whereas operating systems,memory chips and other electronics use a base2 value so that’s 10^3 vs 2^10.)

Baseline
	Sectors	RPM	Avrg delay in ms	Max IOPS	Disks required	6
100	190,734,863	10000	8	125	Num IOPS	750
					Time Required	1302
					in minutes	21.7

If you now take this baseline and map this to some previous and current disk types and capacities you can see the differences.

GB	Sectors	RPM	# Disk per A29	Num IOPS	Time required in sec	in min	%pcnt of base line	* times base value

9	17,166,138	7200	57	4731	206	3.44	15.83	6.32
18	34,332,275	7200	29	2407	406	6.77	31.19	3.21
36	68,664,551	10000	15	1875	521	8.69	40.02	2.50
72	137,329,102	10000	8	1000	977	16.29	75.04	1.33
146	278,472,900	10000	4	500	1953	32.55	150	0.67
300	572,204,590	10000	2	250	3906	65.1	300	0.33
450	858,306,885	10000	2	250	3906	65.1	300	0.33
600	1,144,409,180	10000	1	125	7813	130.22	600.08	0.17

You can see here that capacity wise to store the same 500GB on 146 GB disks you need less disks but you also get fewer total IOPS. This then translates into slower performance. As an example a 300GB drive with 10000RPM triples the time compared to the baseline disk to read this 500 gigabyte.

Now these a re relatively simple calculations however they do apply to all disks including the ones in your disk array.

I hope this also makes you start thinking about performance as well as capacity. I’m pretty sure your business finds it most annoying when your users need to get a cup of coffee after every database query. 🙂

Why not FCoE?

26/05/2011

Erwin van Londen

Brocade Cisco Fibre Channel 2 Comments

You may have read my previous articles on FCoE as well as some comments I’ve posted on Brocade’s and Cisco’s blog sites. It won’t surprise you that I’m no fan of FCoE. Not for the technology itself but for the enormous complexity and organisational overhead involved.

So lets take a step back and try to figure out why this has become so much of a buzz in the storage and networking world.

First lets make it clear that FCoE is driven by the networking folks and most notably Cisco. The reason for this is that Cisco has around 90% market share of the data centre networking side but they only have around 10 to 15% of the storage side. (I don’t have the actual numbers at hand but I’ m sure it’s not far off). Brocade with their FC offerings have that part (storage) pretty well covered. Cisco hasn’t been able to eat more out of that pie for quite some time so they had to come up with something else. So FCoE was born. This allowed them (Cisco) to slow but steady get the foot in the storage door by offering a, so called, “new” way of doing business in the data centre and convince customers to go “converged”.

I already explained that their is no or negligible benefit from an infrastructural and power/cooling perspective so cost-effectiveness from a capex perspective is nil and maybe even negative. I also showed that the organizational overhaul that has to be accomplished is tremendous. Remember you’re trying to glue two different technologies together by adding a new one. The June-2009 FC-BB-5 document (where FCoE is described) is around 1.9 MB and 180 pages give or take a few. FC-BB-6 is 208 pages and 2.4 MB thick. How does this decrease complexity?
Another part that you have to look at is backward compatibility. The Fibre Channel standard went up to 16Gb/s a while ago and most vendors have released product for it already. The FC standard does specify backward compatibility to 2Gb/s. So I’m perfectly safe when linking up an 16G SFP with a 8Gb/s or 4 Gb/s SFP and the speed will be negotiated to the highest possible. This means I don’t have to throw away some older, not yet depreciated, equipment. How does Ethernet play in this game? Well, it doesn’t, 10G Ethernet is incompatible with 1G so they don’t marry up. You have to forklift your equipment out of the data center and get new gear from top to bottom. How’s that for investment protection? The network providers will tell you this migration process comes naturally with equipment refresh but how do you explain that if you have to refresh one or two director class switches were your other equipment can’t connect to it this is a natural process? This means you have buy additional gear that bridges between the old and the new; resulting in you paying even more. This is probably what is meant by “naturally”. “Naturally you have to pay more.”

So it’s pretty obvious that Cisco needs to pursue this path will it ever get more traction in the data center storage networking club. They’ve also proven this with UCS, which looks like to fall off the cliff as well when you believe the publications in the blog-o-sphere. Brocade is not pushing FCoE at all. The only reason they are in the FCoE game is to be risk averse. If for some reason FCoE does take off they can say they have products to support that. Brocade has no intention of giving up an 80 to 85% market share in fibre channel just to be at risk to hand this over the other side being Cisco Networking. Brocade’s strategy is somewhat different than Ciscos’. Both companies have outlined their ideas and plans on numerous occasions so I’ll leave that for you to read on their websites.

“What about the other vendors?” you’ll say. Well that’s pretty simple. All array vendors couldn’t care less. For them it’s just another transport mechanism like FC and iSCSI and there is no gain nor loss if FCoE makes it or not. They won’t tell you this in your face of course. The other connectivity vendors like Emulex and Qlogic have to be on the train with Cisco as well as Brocade however their main revenue comes out of the server vendors who build products with Emulex or Qlogic chips in them. If the server vendors demand an FCoE chip either party builds one and is happy to sell it to any server vendor. For the connectivity vendors like these it’s just another revenue stream they link into and cannot afford to be outside a certain technology if the competition is picking it up. Given the fact there is some significant R&D required w.r.t. chip development these vendors also have to market their kit to have some ROI. This is normal market dynamics.

“So what alternative do you have for a converged network?” was a question that was asked to me a while ago. My response was “Do you have a Fibre Channel infrastructure? If so, then you already have a converged network.” Fibre Channel was designed from the bottom up to transparently move data back and forth irrespective of the upper protocol used including TCP/IP. Unfortunately SCSI has become the most common but there is absolutely no reason why you couldn’t add a networking driver and the IP protocol stack as well. I’ve done this many times and never have had any troubles with it.

The question is now: “Who do you believe?” and “How much risk am I willing to take to adopt FCoE?”. I’m not on the sales side of the fence not am I in marketing. I work in a support role and have many of you on the phone when something goes wrong. My background is not in the academic world. I worked my way up and have been in many roles where I’ve seen technology evolve and I know when to spot bad ones. FCoE is one of them.

Comments are welcome.

Regards,
Erwin

HP ends Hitachi relationship

16/05/2011

Erwin van Londen

General Info Uncategorized No Comments

Well, this maybe a bit premature and I don’t have any insights in Leo’ s agenda but when you apply some common sense and logic you cannot draw another conclusion than within the foreseeable future this will happen. “And why would that be?” you say, “They (HP) have a fairly solid XP installed base and they seem to do sell enough to make it profitable and they also have embarked on the P9500 train”.

Yes, indeed, however take a look at it from the other side. HP has currently 4 lines of storage products, the MSA inherited thru the Compaq merger which comes out of Houston and specifically targeted at the SMB market, the EVA, from the Digital/Compaq StorageWorks stable, which has been the only HP owned modular array which has done well in the SME space, the XP/P9500 obviously thru their Hitachi OEM relationship and, since last year, the 3-Par kit. When you compare these products they do have a lot of overlap in many areas especially in the open systems space. It is therefore that the R&D budgets for all the 4 products eat up a fair amount of dollars. Besides that, HP also has to set aside a huge amount of money for Sales, Pre-Sales, Services and Customer support in training, marketing etc to be able to provide a solution of which a customer will only choose the one which fits their needs. So just from a product perspective there is a 1:4 sales ratio. I don’t even mention the choices customers have from the competition. For the lower part of the pie (MSA & small EVA) HP heavily relies on their channel but from a support and marketing perspective this still requires a significant investment to keep those product lines alive. HP just has released their latest generation of the EVA but as far as I know has not commented on future generations. It is to be expected that as long as the EVA sells like it has always done the development of it will continue.

With the acquisition of 3-Par last year HP has dived very deep in their money pit and paid 2.3 billion dollars for them. You don’t make such an investment to just keep a certain product out of the hands of a competitor (Dell in this case). You do want this product to sell like hotcakes to be able to shorten your ROI as much as possible. Leo has quite some shareholders to answer to. It then depends where you get the most margins from and it is very clear that when you combine the ROI needs of 3-Par and the margins they will obviously make on that product HP will most likely prefer to sell 3-Par before XP/P9500 even if the latter would be a better fit for the solution needed by the customer. When you put it all together you’ll notice that even within the storage division of HP there is a fair amount of competition between the product lines and no R&D department for either of those want to loose. So who needs to give??

There are two reasons why HP would not end their relation ship with Hitachi, Mainframe and Customer demand. Neither of the native HP product have Mainframe support so if HP decides to end the Hitachi relationship they will certainly loose that piece as well as obtaining the risk that same customer chooses the competition for the rest of the stack as well. Also if XP/P9500 customers already have made significant investment investment in Hitachi based products, they most certainly will not like a decision like this. HP, however is also not reluctant to make these harsh decisions. History proves they’ve done it before. (Abruptly ending and OEM relationship with EMC as an example.)

So, if you are an HP customer who just invested in Hitachi technology, rest assure you will always have a fallback scenario and that of course is to deal with Hitachi itself. Just broaden your vision and give HDS a call to see what they have to offer. You’ll be very pleasantly surprised.

Regards,
Erwin

(post-note 18-05-2011) Some HP customers have already been told that 3-Par equipment is now indeed HP preferred solution they will offer unless Mainframe is involved.

(post-note 10-07-2011) Again more and more proof is surfacing. See Chris Mellor’s post on El Reg over here

Will FCoE bring you more headaches?

02/05/2011

Erwin van Londen

Fibre Channel No Comments

Yes it will!!.

Bit of a blunt statement but here’s why.

When you look at the presentations all connectivity vendors (Brocade,Cisco,Emulex etc…) will give you they pitch that FCoE is the best thing since sliced bread. Reduction in costs, cooling, cabling and complexity will solve all of your to-days problems! But is this really true?

Let start with costs. Are the cost savings really that big as they promise. These days a server 1G Ethernet port sits on the motherboard and is more or less almost a free-bee. Expectation is that the additional cost of 10Ge will be added to a server COG but as usual they will decline over time. Most servers come with multiple of these ports. On average a CNA is 2 times more expensive then 2 GE ports + 2 HBA’s so that’s not a reason to jump to FCoE. Each vendor have different price lists so that’s something you need to figure out yourself. The CAPEX is the easy part.

An FCoE capable switch (CEE or FCF) is significantly more expensive than an Ethernet switch + a FC switch. Be aware that these are data center switches and the current port count on an FCoE switch is not sufficient to deploy large scale infrastructures.

Then there is the so called power and cooling benefit. (?!?!?) I searched my butt of to find the power requirements on HBA’s and CNA’s but no vendor is publishing these. I can’t imagine an FC HBA chip eats more than 5 watts however a CNA will probably use more given the fact it runs on a higher clock speed and for redundancy reasons you need two of them anyway so in general I think these will equate to the same power requirements or an eth+hba combination is even more efficient than CNA’s. Now lets compare a Brocade 5000 (32 port FC switch) with a Brocade 8000 FCoE from a BTU and power rating perspective. I used their own specs according to their data sheets so if I made a mistake don’t blame me.

A Brocade 5000 uses a maximum of 56 watts and has a BTU rating of 239 at 80% efficiency. An 8000 FCoE switch uses 206 watts when idle and 306 watts when in use. The BTU heat dissipation is 1044.11 per hour. I struggled to find any benefit here. Now you can say that you also need an Ethernet switch but even if that has the same ratings as a 5000 switch you still save a hell of a lot of power and cooling requirement on separate switches. I haven’t checked out the Cisco, Emulex and Qlogic equipment but I assume I’m not far off on those as well.

Now, hang on, all vendors say there is a “huge benefit” in FCoE based infrastructures. Yes, there is, you can reduce your cabling plant but even there is a snag. You need very high quality cables so an OM1 or OM2 cabling plant will not do. As a minimum you need OM3 but OM4 is preferred. Do you have this already? If so good you need less cabling, if not buy a completely new plant.

Then there is complexity. Also an FCoE sales pitch. “Everything is much easier and simpler to configure if you go with FCoE”. Is it??? Where is the reduction in complexity when the only benefit is that you can get rid of cabling. Once a cabling plant is in place you only need to administer the changes and there is some extremely good and free software to do that. So even if you consider this as a huge benefit what do you get in return. A famous Dutch football player once said “Elk voordeel heb z’n nadeel” (That’s Dutch with an Amsterdam dialect spelling :-)) which more or less means that every benefit has it’s disadvantage i.e. there is a snag with each benefit.

The snag here is you get all the nice features like CEE,DCBX,LLDP,ETS,PFC,FIP,FPMA and a lot more new terminology introduced into you storage and network environment. (say what???). This more or less means that each of these abbreviations needs to be learned by your storage administrators as well as you network administrators, which means additional training requirements (and associated costs). This is not a replacement for your current training and knowledge but this comes on top of that.
Also these settings are not a one-time-setup which can be configured centrally on a switch but they need to be configured and managed per interface.

In my previous article I also mentioned the complete organizational overhaul you need to do between the storage and networking department. From a technology standpoint these two “cultures” have a different mindset. Storage people need to know exactly what is going to hit their arrays from an applications perspective as well as operating systems, firmware, drivers etc. Network people don’t care. They have a horizontal view and they transport IP packets from A to B irrespective of the content of that packet. If the pipe from A to B is not big enough they create a bigger pipe and there we go. In the storage world it doesn’t work like this as described before.

Then there is the support side of the fence. Lets assume you’ve adopted FCoE in your environment. Do you have everything in place to solve a problem when it occurs. (mind the term “when” not “if”) Do you know exactly what it takes to troubleshoot a problem. Do you know how to collect logs the correct way? Have you ever seen a Fibre Channel trace captured by an analyzer? If so, where you able to bake some cake of it and actually are able to pinpoint an issue if there is one and more importantly how to solve this? Did you ever look at fabric/switch/port statistics on a switch to verify if something is wrong? For SNIA I wrote a tutorial (over here) in which I describe the overall issues support organisations face when a customer calls in for support and also what to do about it. The thing is that network and storage environments are very complex. By combining them and adding all the 3 and 4 letter acronyms mentioned above the complexity will increase 5-fold if not more. It therefore takes much and much longer to be able to pin-point an issue and advise on how to solve it.

I work in one of those support centers of a particular vendor and I see FC problems every day. Very often due to administrator errors but far more because of a problem with software or hardware. These can be very obvious like a cable problem but in most cases the issue is not so clear and it take a lot of skills, knowledge, technical information AND TIME to be able to sort this out. By adding complexity it just takes more time to collect and analyze the information and advise on resolution paths. I’m not saying it becomes undo-able but it just takes more time. Are you prepared and are you willing to provide your vendor this time to sort out issues?

Now, you probably think I must hold a major grudge against FCoE. On the contrary; I think FCoE is a great technology but it’s been created for technologie’s sake and not to help you as customer and administrator to really solve a problem. The entire storage industry is stacking protocols upon protocols to circumvent the very hard issue that they’ve screwed up a long time ago. (Huhhhh, why’s that?)

Be reminded that today’s storage infrastructure is still running on a 3 decade old protocol called SCSI (or SBCCS for z/OS which is even older). Nothing wrong with that but it implies that shortcomings of this protocol needs to be circumvented. SCSI originally ran on a parallel bus which was 8-bit wide and hit performance limitations pretty quick. So they created “wide scsi” which ran on a 16-bit wide bus. With increase of the clock frequencies they pumped up the speed however the problem of distance limitations became more imminent and so they invented Fibre-Channel. By disassociating the SCSI command set from the physical layer the T10 committee came up with SCSI-3 which allowed the SCSI protocol to be transported over a serialized interface like FC which had a multitude of benefits like speed, distance and connectivity. The same thing happened with Escon in the mainframe world. Both the Escon command set (SBCCS now known as Ficon) as well as SCSI (on FC known as FCP) are now able to run on the FC-4 layer. Since Ethernet back then was extremely lossy this was no option for a strict lossless channel protocol with low latency requirements. Now that they have fixed up Ethernet a bit to allow for loss-less transport over a relatively fast interface they now map the entire stack into a mini-jumbo frame and the FCP-4 SCSI command and data sits in a FC encapsulated frame which in turn now sits in an Ethernet frame. (I still can’t find the reduction in complexity, if you can please let me know.)

What should have been done instead of introducing a fixer-upper like FCoE is that the industry should have come up with an entirely new concept of managing, transporting and storing data. This should have been created based on todays requirements which include security (like authentication and authorization), retention, (de-)duplication , removal of awareness of locality etc. Your data should reside in a container which is a unique entity on all levels from application to the storage and every mechanism in between. This container should be treated as per policy requirements encapsulated in that container and those policies are based on the content residing in there. This then allows for a multitude of properties to be applied to this container as described above and allows for far more effective transport

Now this may sound like trying to boil the ocean but try to think 10 years ahead. What will be beyond FCoE? Are we creating FCoEoXYZ? 5 Years ago I wrote a little piece called “The Future of Storage” which more or less introduced this concept. Since then nothing has happened in the industry to really solve the data growth issue. Instead the industry is stacking patch upon patch to circumvent current limitations (if any) or trying to generate a new revenue stream with something like the introduction of FCoE.

Again, I don’t hold anything against FCoE from a technology perspective and I respect and admire Silvano Gai and the others at T11 what they’ve accomplished in little over three years but I think it’s a major step in the wrong direction. It had the wrong starting point and it tries to answer a question without anyone asking.

For all the above reasons I still do not advise to adopt FCoE and urge you to push your vendors and their engineering teams to come up with something that will really help you to run your business and not patching up “issues” you might not even have.

Constructive comments are welcome.

Kind regards,
Erwin van Londen

The end of spinning disks (part 2)

21/04/2011

Erwin van Londen

Storage Networking No Comments

Maybe you found the previous article a bit hypothetical and is not substantiated by facts but merely some guestimations?

To put some beef into the equation I’ll try to substantiate it with some simple calculations. Read on.

As shown in Cornell Uni’s report the expected amount of data generated will reach 1700 exabytes in 2011 with an additional 2500 in 2012. 1700 exabytes equates to 1 trillion, 700 billiard gigabytes in EU notation (say what…., look here)

So number-wise it looks like this: 1.700.000.000.000 GB

The average capacity of a disk drive in 2011 is around 1400 GB (the average of enterprise drives with high RPM of 600GB + the largest capacity wise commercially available for enterprise environments HDD of 2TB).In consumer land WD has a 6TB drive but these will not become mainstream until the end of 2011 or beginning 2012 . Maybe storage vendors will use the 3 and 4 TB versions but I do not have visibility of that currently.

1700EB / 1400GB = 1.214.285.714 disk drives are needed to store this amount of information. (Ohh, in 2012 we need 1.785.714.286 units :-))

This leads us to have a look at production capabilities and HD vendors. Currently there are two major vendors in the HDD market. Seagate (which shipped 50 million HDD in FQ3 2011) and WD shipping 49 million. (Seagate acquired HGST and WD is talking to the HDD division of Samsung) Those 4 companies combined have a production capacity of around 150 million diskdrives per quarter. This means on an annual basis a shortage of : 1.214.285.714 – 600.000.000 = 614.285.714 HDD’s
So who says the HDD business isn’t a healthy one? 🙂

OK, I agree, not everything is stored on HDD and the offload to secondary media like DVD,BlueRay,tape etc will cut a significant piece out of this pie however the instantiation of new data will primarily be done on HDD’s. Adoption of newer, larger capacity HDD is restricted for enterprise use because the access density is getting too high which equates to higher latency and lower performance which is not acceptable in these kind of environments.

This means new techniques will need to be adopted in all areas. From a performance perspective a lot can be gained with SSD’s (Solid State Drives) which have extremely good read performance but still lack somewhat in write performance as well as long term reliability. I’m sure over time this will be resolved. SSD will however not fill the capacity gap needed to accommodate the data growth.

As mentioned before my view is that this gap can and will be filled by advanced 3D optical media which provides new levels of capacity, performance, reliability and cost savings.

I’m open for constructive comments.

Cheers,
Erwin

EvL Consulting

Redundant on a 457

SCSI UNMAP and performance implications

OpenDNS with DNS-O-Matic

SoE, SCSI over Ethernet.

Why disk drives have become slower over the years

Why not FCoE?

HP ends Hitachi relationship

Will FCoE bring you more headaches?

The end of spinning disks (part 2)