Category Archives: Storage Networking

SoE, SCSI over Ethernet.

It may come as no surprise that I’m not a fan of FCoE. Although I have nothing against the underlying thought of converged networking I do feel that the method of encapsulating multiple protocols in yet another frame is overkill, adds complexity, requires additional skills, training and operating methods and introduces risk so as far as I’m concerned it shouldn’t be needed. The main reason FCoE is invented is to have the ability to traverse traffic from Fibre Channel environments through gateways (called FCF’s) to an Ethernet connected Converged Network Adapter in order to save on some cabling. Yeah, yeah I know many say you’ll save a lot more but I’m not convinced.
After staring at some ads from numerous vendors I still wonder why they never came up with the ability to directly map the SCSI protocol on Ethernet in the same way they do with IP. After all with the introduction of 10G Ethernet all issues of reliability appear to have gone (have they??) so it shouldn’t be such a problem to directly address this. This was the main reason why Fibre Channel was invented in the first place. I think from a development perspective this should be an evenly amount of effort to have SCSI directly transported on Ethernet compared to Fibre Channel.From an interface perspective it shouldn’t be such a problem as well. I think storage would be as happy to shove in an Ethernet port in addition to FC. They wouldn’t need to use any difficult FCoE or iSCSI mechanisms.

Since all, or at least a lot, development efforts these days seem to have shifted to Ethernet why still invest in Fibre Channel. Ethernet still has a 7 layer OSI stack but you should be able to just use three, the physical, datalink, and networking layer. This should be enough to shove frames back and forth in a flat Ethernet network (or Ethernet Fabric as Brocade calls it).For other protocol like TCP/IP this is no problem since they already use the same stack but just travel a bit higher up. This then allows you to have a routable iSCSI environment (over IP) as well as a native SCSI protocol running on the same network. The biggest problem is then security. If SCSI runs on a flat Ethernet network there is no way (yet) to secure SCSI packets arriving at all ports in that particular network segment. This would be the same as having no zoning active as well as disabling all LUN masking on the arrays. The only way to circumvent this is to invent some sort of “Ethernet Firewall” mechanism. (I’m not aware of a product/vendor who provides this but I’ve never heard of it.) I’ts pretty easy to spoof a MAC address so that’s no good as a security precaution. 

As usual this should then also have all the other security features like authentication, authorisation etc etc. Fibre Channel already provides authentication based on DH-CHAP which is specified in the FC-SP standard. Although DH-CHAP exists in the Ethernet world it is strictly tied to higher layers like TCP. It would be good though to see this functionality on the lower layers as well.

I’m not an expert on Ethernet so I would welcome comments that would provide some more insight of the options and possibilities.

Food for thought.

Regards,
Erwin

Why disk drives have become slower over the years

What is the first question vendors get (or at least used to get) when a customer (non-technical) calls???
I’ll spare you the guesswork: “What does a TB of disks do at your place ??”. Usually I’ll goofle around for the cheapest disk at a local PC store and say “Well Sir, that would be about 80 dollars”. I then hear somebody falling of the chair, trying to get up again, reach for the phone and with a resonating voice asking “Why are your competitors so expensive then?”. “They most likely did not gave a direct answer to your question.”, I reply.
The thing is an HDD should be evaluated on multiple factors and when you spend 80 bucks on a 1TB disk you get capacity and that’s about it. Don’t expect performance or extended MTBF figures let alone all the stuff than comes with enterprise arrays like large caches, redundancy in every sense and a lot more. This is what makes up the price per GB.


“Ok, so why have disk drives become so much slower in the past couple of years?”. Well, they haven’t. The RPM, seek time and latency have remained the same over the last couple of years. The problem is that the capacity has increased so much that the so called “access density” has increased linearly so the disk has to service a massive amount of bytes with the same nominal IOPS capability.


I did some simple calculations which shows the decrease in performance on larger disks. I didn’t assume any raid or cache accelerators.


I first calculated a baseline based on a 100GB disk drive (I know, they don’t exist but it just for the calculations) with 500GB of data that I need to read or write.


The assumption was to have a 100% random read profile. Although the host can read or write in increments of 512 bytes IO size theoretically this doesn’t mean the disk will write this IO in one sequential stroke. An 8K host IO can be split up in the smallest supported sector size on disk which is currently around 512 bytes. (Don’t worry, every disk and array will optimize this but again this is just to show the nominal differences)


So when I have 100GB disk drive this translates to a little over 190 million sectors.  In order to read 500 GB of data this would take a theoretical 21.7 minutes. The number of disks are calculated based on the capacity required for that 500GB (Also remember that disks use a base10 capacity value whereas operating systems,memory chips and other electronics use a base2 value so that’s 10^3 vs 2^10.)

Baseline
Sectors RPM Avrg delay in ms Max IOPS Disks required 6
100 190,734,863 10000 8 125 Num IOPS 750
Time Required 1302
in minutes 21.7

 If you now take this baseline and map this to some previous and current disk types and capacities you can see the differences.

GB Sectors RPM # Disk per A29 Num IOPS Time required in sec in min %pcnt of base line * times base value
9 17,166,138 7200 57 4731 206 3.44 15.83 6.32
18 34,332,275 7200 29 2407 406 6.77 31.19 3.21
36 68,664,551 10000 15 1875 521 8.69 40.02 2.50
72 137,329,102 10000 8 1000 977 16.29 75.04 1.33
146 278,472,900 10000 4 500 1953 32.55 150 0.67
300 572,204,590 10000 2 250 3906 65.1 300 0.33
450 858,306,885 10000 2 250 3906 65.1 300 0.33
600 1,144,409,180 10000 1 125 7813 130.22 600.08 0.17

You can see here that capacity wise to store the same 500GB on 146 GB disks you need less disks but you also get fewer total IOPS. This then translates into slower performance. As an example a 300GB drive with 10000RPM triples the time compared to the baseline disk to read this 500 gigabyte.


Now these a re relatively simple calculations however they do apply to all disks including the ones in your disk array.


I hope this also makes you start thinking about performance as well as capacity. I’m pretty sure your business finds it most annoying when your users need to get a cup of coffee after every database query. 🙂

Why not FCoE?

You may have read my previous articles on FCoE as well as some comments I’ve posted on Brocade’s and Cisco’s blog sites. It won’t surprise you that I’m no fan of FCoE. Not for the technology itself but for the enormous complexity and organisational overhead involved.

So lets take a step back and try to figure out why this has become so much of a buzz in the storage and networking world.

First lets make it clear that FCoE is driven by the networking folks and most notably Cisco. The reason for this is that Cisco has around 90% market share of the data centre networking side but they only have around 10 to 15% of the storage side. (I don’t have the actual numbers at hand but I’ m sure it’s not far off). Brocade with their FC offerings have that part (storage) pretty well covered. Cisco hasn’t been able to eat more out of that pie for quite some time so they had to come up with something else. So FCoE was born. This allowed them (Cisco) to slow but steady get the foot in the storage door by offering a, so called,  “new” way of doing business in the data centre and convince customers to go “converged”.

I already explained that their is no or negligible benefit from an infrastructural and power/cooling perspective so cost-effectiveness from a capex perspective is nil and maybe even negative. I also showed that the organizational overhaul that has to be accomplished is tremendous. Remember you’re trying to glue two different technologies together by adding a new one. The June-2009 FC-BB-5 document (where FCoE is described) is around 1.9 MB and 180 pages give or take a few. FC-BB-6 is 208 pages and 2.4 MB thick. How does this decrease complexity?
Another part that you have to look at is backward compatibility. The Fibre Channel standard went up to 16Gb/s a while ago and most vendors have released product for it already. The FC standard does specify backward compatibility to 2Gb/s. So I’m perfectly safe when linking up an 16G SFP with a 8Gb/s or 4 Gb/s SFP and the speed will be negotiated to the highest possible. This means I don’t have to throw away some older, not yet depreciated, equipment. How does Ethernet play in this game? Well, it doesn’t, 10G Ethernet is incompatible with 1G so they don’t marry up. You have to forklift your equipment out of the data center and get new gear from top to bottom. How’s that for investment protection? The network providers will tell you this migration process comes naturally with equipment refresh but how do you explain that if you have to refresh one or two director class switches were your other equipment can’t connect to it this is a natural process? This means you have buy additional gear that bridges between the old and the new; resulting in you paying even more. This is probably what is meant by “naturally”. “Naturally you have to pay more.”

So it’s pretty obvious that Cisco needs to pursue this path will it ever get more traction in the data center storage networking club. They’ve also proven this with UCS, which looks like to fall off the cliff as well when you believe the publications in the blog-o-sphere. Brocade is not pushing FCoE at all. The only reason they are in the FCoE game is to be risk averse. If for some reason FCoE does take off they can say they have products to support that. Brocade has no intention of giving up an 80 to 85% market share in fibre channel just to be at risk to hand this over the other side being Cisco Networking. Brocade’s strategy is somewhat different than Ciscos’. Both companies have outlined their ideas and plans on numerous occasions so I’ll leave that for you to read on their websites.

“What about the other vendors?”  you’ll say. Well that’s pretty simple. All array vendors couldn’t care less. For them it’s just another transport mechanism like FC and iSCSI and there is no gain nor loss if FCoE makes it or not. They won’t tell you this in your face of course. The other connectivity vendors like Emulex and Qlogic have to be on the train with Cisco as well as Brocade however their main revenue comes out of the server vendors who build products with Emulex or Qlogic chips in them. If the server vendors demand an FCoE chip either party builds one and is happy to sell it to any server vendor. For the connectivity vendors like these it’s just another revenue stream they link into and cannot afford to be outside a certain technology if the competition is picking it up. Given the fact there is some significant R&D required w.r.t. chip development these vendors also have to market their kit to have some ROI. This is normal market dynamics.

“So what alternative do you have for a converged network?” was a question that was asked to me a while ago. My response was “Do you have a Fibre Channel infrastructure? If so, then you already have a converged network.” Fibre Channel was designed from the bottom up to transparently move data back and forth irrespective of the upper protocol used including TCP/IP. Unfortunately SCSI has become the most common but there is absolutely no reason why you couldn’t add a networking driver and the IP protocol stack as well. I’ve done this many times and never have had any troubles with it.

The question is now: “Who do you believe?” and “How much risk am I willing to take to adopt FCoE?”. I’m not on the sales side of the fence not am I in marketing. I work in a support role and have many of you on the phone when something goes wrong. My background is not in the academic world. I worked my way up and have been in many roles where I’ve seen technology evolve and I know when to spot bad ones. FCoE is one of them.

Comments are welcome.

Regards,
Erwin

Will FCoE bring you more headaches?

Yes it will!!.

Bit of a blunt statement but here’s why.

When you look at the presentations all connectivity vendors (Brocade,Cisco,Emulex etc…) will give you they pitch that FCoE is the best thing since sliced bread. Reduction in costs, cooling, cabling and complexity will solve all of your to-days problems! But is this really true?

Let start with costs. Are the cost savings really that big as they promise. These days a server 1G Ethernet port sits on the motherboard and is more or less almost a free-bee. Expectation is that the additional cost of 10Ge will be added to a server COG but as usual they will decline over time. Most servers come with multiple of these ports. On average a CNA is 2 times more expensive then 2 GE ports + 2 HBA’s so that’s not a reason to jump to FCoE. Each vendor have different price lists so that’s something you need to figure out yourself. The CAPEX is the easy part.

An FCoE capable switch (CEE or FCF) is significantly more expensive than an Ethernet switch + a FC switch. Be aware that these are data center switches and the current port count on an FCoE switch is not sufficient to deploy large scale infrastructures.

Then there is the so called power and cooling benefit. (?!?!?) I searched my butt of to find the power requirements on HBA’s and CNA’s but no vendor is publishing these. I can’t imagine an FC HBA chip eats more than 5 watts however a CNA will probably use more given the fact it runs on a higher clock speed and for redundancy reasons you need two of them anyway so in general I think these will equate to the same power requirements or an eth+hba combination is even more efficient than CNA’s. Now lets compare a Brocade 5000 (32 port FC switch) with a Brocade 8000 FCoE from a BTU and power rating perspective. I used their own specs according to their data sheets so if I made a mistake don’t blame me.

A Brocade 5000 uses a maximum of 56 watts and has a BTU rating of 239 at 80% efficiency. An 8000 FCoE switch uses 206 watts when idle and 306 watts when in use. The BTU heat dissipation is 1044.11 per hour. I struggled to find any benefit here. Now you can say that you also need an Ethernet switch but even if that has the same ratings as a 5000 switch you still save a hell of a lot of power and cooling requirement on separate switches. I haven’t checked out the Cisco, Emulex and Qlogic equipment but I assume I’m not far off on those as well.

Now, hang on, all vendors say there is a “huge benefit” in FCoE based infrastructures. Yes, there is, you can reduce your cabling plant but even there is a snag. You need very high quality cables so an OM1 or OM2 cabling plant will not do. As a minimum you need OM3 but OM4 is preferred. Do you have this already? If so good you need less cabling, if not buy a completely new plant.

Then there is complexity. Also an FCoE sales pitch. “Everything is much easier and simpler to configure if you go with FCoE”. Is it??? Where is the reduction in complexity when the only benefit is that you can get rid of cabling. Once a cabling plant is in place you only need to administer the changes and there is some extremely good and free software to do that. So even if you consider this as a huge benefit what do you get in return. A famous Dutch football player once said “Elk voordeel heb z’n nadeel” (That’s Dutch with an Amsterdam dialect spelling :-)) which more or less means that every benefit has it’s disadvantage i.e. there is a snag with each benefit.

The snag here is you get all the nice features like CEE,DCBX,LLDP,ETS,PFC,FIP,FPMA and a lot more new terminology introduced into you storage and network environment. (say what???). This more or less means that each of these abbreviations needs to be learned by your storage administrators as well as you network administrators, which means additional training requirements (and associated costs). This is not a replacement for your current training and knowledge but this comes on top of that.
Also these settings are not a one-time-setup which can be configured centrally on a switch but they need to be configured and managed per interface.

In my previous article I also mentioned the complete organizational overhaul you need to do between the storage and networking department. From a technology standpoint these two “cultures” have a different mindset. Storage people need to know exactly what is going to hit their arrays from an applications perspective as well as operating systems, firmware, drivers etc. Network people don’t care. They have a horizontal view and they transport IP packets from A to B irrespective of the content of that packet. If the pipe from A to B is not big enough they create a bigger pipe and there we go. In the storage world it doesn’t work like this as described before.

Then there is the support side of the fence. Lets assume you’ve adopted FCoE in your environment. Do you have everything in place to solve a problem when it occurs. (mind the term “when” not “if”)  Do you know exactly what it takes to troubleshoot a problem. Do you know how to collect logs the correct way? Have you ever seen a Fibre Channel trace captured by an analyzer? If so, where you able to bake some cake of it and actually are able to pinpoint an issue if there is one and more importantly how to solve this? Did you ever look at fabric/switch/port statistics on a switch to verify if something is wrong? For SNIA I wrote a tutorial (over here) in which I describe the overall issues support organisations face when a customer calls in for support and also what to do about it. The thing is that network and storage environments are very complex. By combining them and adding all the 3 and 4 letter acronyms mentioned above the complexity will increase 5-fold if not more. It therefore takes much and much longer to be able to pin-point an issue and advise on how to solve it.

I work in one of those support centers of a particular vendor and I see FC problems every day. Very often due to administrator errors but far more because of a problem with software or hardware. These can be very obvious like a cable problem but in most cases the issue is not so clear and it take a lot of skills, knowledge, technical information AND TIME to be able to sort this out. By adding complexity it just takes more time to collect and analyze the information and advise on resolution paths. I’m not saying it becomes undo-able but it just takes more time. Are you prepared and are you willing to provide your vendor this time to sort out issues?

Now, you probably think I must hold a major grudge against FCoE. On the contrary; I think FCoE is a great technology but it’s been created for technologie’s sake and not to help you as customer and administrator to really solve a problem. The entire storage industry is stacking protocols upon protocols to circumvent the very hard issue that they’ve screwed up a long time ago. (Huhhhh, why’s that?)

Be reminded that today’s storage infrastructure is still running on a 3 decade old protocol called SCSI (or SBCCS for z/OS which is even older). Nothing wrong with that but it implies that shortcomings of this protocol needs to be circumvented. SCSI originally ran on a parallel bus which was 8-bit wide and hit performance limitations pretty quick. So they created “wide scsi” which ran on a 16-bit wide bus. With increase of the clock frequencies they pumped up the speed however the problem of distance limitations became more imminent and so they invented Fibre-Channel. By disassociating the SCSI command set from the physical layer the T10 committee came up with SCSI-3 which allowed the SCSI protocol to be transported over a serialized interface like FC which had a multitude of benefits like speed, distance and connectivity. The same thing happened with Escon in the mainframe world. Both the Escon command set (SBCCS now known as Ficon) as well as SCSI (on FC known as FCP) are now able to run on the FC-4 layer. Since Ethernet back then was extremely lossy this was no option for a strict lossless  channel protocol with low latency requirements. Now that they have fixed up Ethernet a bit to allow for loss-less transport over a relatively fast interface they now map the entire stack into a mini-jumbo frame and the FCP-4 SCSI command and data sits in a FC encapsulated frame which in turn now sits in an Ethernet frame. (I still can’t find the reduction in complexity, if you can please let me know.)

What should have been done instead of introducing a fixer-upper like FCoE is that the industry should have come up with an entirely new concept of managing, transporting and storing data. This should have been created based on todays requirements which include security (like authentication and authorization), retention, (de-)duplication , removal of awareness of locality etc. Your data should reside in a container which is a unique entity on all levels from application to the storage and every mechanism in between. This container should be treated as per policy requirements encapsulated in that container and those policies are based on the content residing in there. This then allows for a multitude of properties to be applied to this container as described above and allows for far more effective transport

Now this may sound like trying to boil the ocean but try to think 10 years ahead. What will be beyond FCoE? Are we creating FCoEoXYZ? 5 Years ago I wrote a little piece called “The Future of Storage” which more or less introduced this concept. Since then nothing has happened in the industry to really solve the data growth issue. Instead the industry is stacking patch upon patch to circumvent current limitations (if any) or trying to generate a new revenue stream with something like the introduction of FCoE.

Again, I don’t hold anything against FCoE from a technology perspective and I respect and admire Silvano Gai and the others at T11 what they’ve accomplished in little over three years but I think it’s a major step in the wrong direction. It had the wrong starting point and it tries to answer a question without anyone asking.

For all the above reasons I still do not advise to adopt FCoE and urge you to push your vendors and their engineering teams to come up with something that will really help you to run your business and not patching up “issues” you might not even have.

Constructive comments are welcome.

Kind regards,
Erwin van Londen

The end of spinning disks (part 2)

Maybe you found the previous article a bit hypothetical and is not substantiated by facts but merely some guestimations?

To put some beef into the equation I’ll try to substantiate it with some simple calculations. Read on.


As shown in Cornell Uni’s report the expected amount of data generated will reach 1700 exabytes in 2011 with an additional 2500 in 2012. 1700 exabytes equates to 1 trillion, 700 billiard gigabytes in EU notation (say what…., look here)

So number-wise it looks like this: 1.700.000.000.000 GB

The average capacity of a disk drive in 2011 is around 1400 GB (the average of enterprise drives with high RPM of 600GB + the largest capacity wise commercially available for enterprise environments HDD of 2TB).In consumer land WD has a 6TB drive but these will not become mainstream until the end of 2011 or beginning 2012 . Maybe storage vendors will use the 3 and 4 TB versions but I do not have visibility of that currently.

1700EB / 1400GB = 1.214.285.714 disk drives are needed to store this amount of information. (Ohh, in 2012 we need 1.785.714.286 units :-))

This leads us to have a look at production capabilities and HD vendors. Currently there are two major vendors in the HDD market. Seagate (which shipped 50 million HDD in FQ3 2011) and WD shipping 49 million. (Seagate acquired HGST and WD is talking to the HDD division of Samsung) Those 4 companies combined have a production capacity of around 150 million diskdrives per quarter. This means on an annual basis a shortage of : 1.214.285.714 – 600.000.000 = 614.285.714 HDD’s
So who says the HDD business isn’t a healthy one? 🙂

OK, I agree, not everything is stored on HDD and the offload to secondary media like DVD,BlueRay,tape etc will cut a significant piece out of this pie however the instantiation of new data will primarily be done on HDD’s. Adoption of newer, larger capacity HDD is restricted for enterprise use because the access density is getting too high which equates to higher latency and lower performance which is not acceptable in these kind of environments.

This means new techniques will need to be adopted in all areas. From a performance perspective a lot can be gained with SSD’s (Solid State Drives) which have extremely good read performance but still lack somewhat in write performance as well as long term reliability. I’m sure over time this will be resolved. SSD will however not fill the capacity gap needed to accommodate the data growth.

As mentioned before my view is that this gap can and will be filled by advanced 3D optical media which provides new levels of capacity, performance, reliability and cost savings.

I’m open for constructive comments.

Cheers,
Erwin

The end of spinning disks

Did you ever wonder how long this industry will rely on spinning disk? I do and I think that within 5 to 10/15 years we’ve reached the end of the abilities of disks to keep up with demand and data growth ratios. A report from Andrey V Makarenko of Cornell University estimates that around 1700 Exabytes (yes EXA-bytes) will be generated in 2011 alone with growth rates to over 2500 EXAbytes next year.

With new technologies invented and implemented in science, space exploration, health care and last but not least consumer electronics this growth ratio will increase exponentially. Although disk drive technology has kept pretty much pace with Moore’s law you can see the advances in development of this technology is declining. Rotational speed has been steady for years and the edges of perpendicular recording have almost been reached. This means that within the foreseeable future there will be a flipping point were demand will outgrow the capacity. Even if production facilities would be increased to keep up with demand, do we as society want to have these massive infrastructures which are very expensive to build and maintain as well as having a huge burden on our environment. So were does this leave us, do we have to stop generating data or generate it in a far more efficient way or should we also combine this with aggressive data life cycle management. I wrote an article earlier in this blog which shows how this could be achieved and it doesn’t take a scientist to understand it.
To go back to the subject there are talks that SSD will take over a significant amount of magnetic based drives and maybe it is so however it still lacks on reliability in one form or another. I’m sure this will be resolved in the not so distant future however will this technology be as cost effective as spinning disks have been in the last decades. I think this will take a significant amount of time to reach that point. So where do we go from here? It is my take that in addition to the uptake of SSD based drives significant advances will be made in 3D optical storage. This will not only allow for massive increase in capacity per cubic inch but also a reduction in cost, energy as well as a massive increase in performance.
Advancements in laser technology and photonic behavior as well as optical media will clear the pathway of adoption into data-centers the moment this will become commercially attractive.

There are numerous scientific studies as well as commercial entities working on this type of technology and due to market demand add significant pressure on the development of it. Check out this wikipedia article on 3D optical storage to get some more information around the technicalities.

Let me know your opinion.

Regards,
Erwin van Londen

Fibre Channel improvements.

So what is the problem with storage networking these days? some of you might argue that it’s the best thing since sliced bread and it’s the most stable way to shove data back and forth and maybe it is however this is not always the case. The problem is some gaps still exist which have never been addressed and one of them is resiliency. There is a lot that has been done to detect errors and to try to recover from them but nobody ever thought of how to prevent errors from occurring. (Until now that is). Read on.

So what is the evolution of a standard like Fibre Channel.It normally is born out of a need that isn’t addressed with current technologies. The primary reason FC was erected is that the parallel SCSI stack had a huge problem with distance. It did not scale beyond a couple of meters and was very sensitive to electrical noise which could disturb the reliable transmission that was needed for a data intensive channel protocol like SCSI. So somebody came up with the idea to serialise the data stream and FC was born. A lot of very smart people got together and cooked up the nifty things we now take for granted like massive address-space, zoning, huge increase in speed and lot of other goodies which could have never been achieved with a parallel interface.

The problem is that these goodies are all created in the dark dungeons of R&D labs. These guys don’t speak much (if at all) to end-user customers so the stuff coming out of these labs is very often extremely geeky.
If you follow a path from the creation of a new thing (whether technology or anything else) you see something like this:

  1. Market demand
  2. R&D
  3. Product
  4. Sales
  5. Customers
  6. Post sales support

The problem is that very often there is no link between #5/#6 and #2. Very often for good reason but this also inflicts some serious challenges. Since I’m not smart enough to work in #2 I’m on the bottom of the food chain working in #6. 🙂 But I do see the issues that arise in this path so I cooked something up. Read on.

Going back to fibre channel there is one huge gap and that is fault tolerance and the acting upon failures in a FC fabric. The protocol defines how to detect errors and how to try to recover from these but is does not have anything which defines how to prevent errors from reoccurring. This means that if an error has been detected and frames get lost we just say “OK, lets try it again and see if it succeeds now”. It doesn’t take a genius to see that if something is broke this will fail again.

So on the practical side there are a couple of things that most often go wrong and that is the physical side of things like SFP’s and cables. These result in errors like encoding/decoding failures, CRC errors, signal and synchronization errors. If these occur the entire frame including your data payload will get dropped and we’re asking to the initiator of that frame to try and resend it. If however the initiator does not have this frame in it’s buffers anymore we rely on the upper layer protocol to recover from this. Most of the time it succeeds, however, as previously mentioned, if things are really broke this will fail again. From an operating system perspective you will see this as SCSI check conditions and/or read/write failures. On a tape environment this will often result in failed back/restore jobs.

Now, you’re gonna say “Hold on buddy, that why we have dual redundant fabrics, multiple entries to our LUNS, multipathing etc etc” i.e. redundancy. True, BUT, what if it is just partially broken? An dodgy SFP or HBA might send out good signals but there could also be a certain amount of not so good signals. This will result in intermittent failures resulting in the above mentioned errors and if these happen often enough you might get these problems. So, although you have every piece of the storage puzzle redundant, you might still run into problems which, if severe enough, might affect your entire storage infrastructure. (and it does happen, believe me)

The underlying problem is that there is no communication between N-Ports and F-ports as well as lack of end-to-end path error verification to check if these errors occur in the fabric and if so how to mitigate or circumvent these. If an N-port sends out a signal to an F-port which gets corrupted underway there is no way the F-port is notifying the N-port and saying “He, dude you’re sending out crap, do something about it”. Similar issue is in meshed fabrics. We all grew up since 1998 with FSPF (Fabric Shortest Path First) which is a FC protocol extension to determine the shortest path from A to B in a FC fabric based on a least cost routing algorithm. Nothing wrong with that however what if this path is very error prone? Does the fabric have any means to make a decision and say “OK, I don’t trust this path, I’ll direct that traffic via another route”? No, there is nothing in the FC protocol which provides this option. The only way routes are redefined is if there are changes in the fabric like an N-Port coming online/offline and registers/de-registers itself with the fabric nameserver and RSCN (Registered Name Change Notifications) are sent out.

For this reason I submitted a proposal to the T11 committee via my, teacher and father of Fibre Channel Horst Truestedt, to extend the FC-GS services with new ways to solve these problems. (proposal can be downloaded here )

The underlying thoughts are to have port-to-port communication to be able to notify the other side of the link it is not stable as well as have and end-to-end error verification and notification algorithm so that hosts, hba’s and fabrics can act upon errors seen in the path to their end devices. This allows active redirection of frames to circumvent frames of passing via that route as well as the option to extend management capabilities so that storage administrators can act upon these failures and replace/update hardware and/or software before the problem becomes imminent and affects the overall stability of the storage infrastructure. This will in the end result in far greater storage availability and application uptime as well as prevent all the other nasty stuff like data corruption etc.

The proposal was positively received with an 8:0 voting ratio so now I’m waiting for a company to pull this further and actually starting to develop this extension.

Let me know what you think.

Regards
Erwin

SNW 2011 Thursday

Today was the last day of the spring edition of SNW. I think it’s been a good conference although the customer attendee numbers could have been better. I think these days the vendors are all keeping their gigs to themselves to prevent customers from wandering around and see other solutions as well after all SNW is an event from Computerworld and SNIA.

This conference has been around three major subjects and that were Cloud, virtualisation and convergence. The funny thing is that I have the impression people left the conference with more questions than answers. A lot of technologies are not fleshed out and nobody seems to have an accurate description of what cloud actually is. Some evidence is clearly there customers are not ready to to adopt new technologies, see here.Earlier in the week Greg Schulz and I had a chat and I mentioned cloud computing to me the total abstraction of business applications from the infrastructure but yet it seems that when I put my ears to the ground everyone comes up with another definition which suits their own needs.

Anyway, today was also my turn for a presso and I had a pretty good attendee turn-up considering it was the last day. Lets wait for the evaluations. I tried to keep it more tangible with issues people are facing today in their data-centers and gave some tips and hints how to overcome them.I had another presso lined up for submission but was not scheduled for a slot. Both of them are downloadable from the SNIA tutorial website both here and here. I do appreciate feedback so don’t hesitate to comment below.

At noon we went for lunch in a little mexican restaurant just on the east side of Santa Clara. Had a great time with Marc Farley, J Michel Metz and Greg Knieriemen. I hope to see them soon again. Great guys to hang around with.

Tomorrow is travel-time for me again back to Down Under after which I wind down to relax a bit with my family on Fiji.

So far the reporting from my side covering SNW Spring 2011.

Keep in touch.

Cheers
Erwin

SNW 2011 Wednesday

The Wednesday started off with 4 keynotes and the Best Practices awards and some of them gave some useful insight of datacentre transformation and how to approach this. It also, again, touched on the subject of organizational challenges that the inevitable change will incur if you want, or need, to adopt new strategies.

After the break I had a good conversation with Greg Schultz of @storagio. (Very nice guy, if you have the chance go look him up some time)We shared some insights, discussed primarily the ongoing discussion of where cloud computing in general can or should reside and how to attack the challenges. After that we both joined the session of J Michel Metz, in short know as J. J is the product manager of FCoE at Cisco and he elaborated on the options you now have with this new technology. I’ve attended a couple of the FCoE sessions and in the end everyone seems to agree that although the technology is there it will be extremely hard to adopt it since it requires an immense amount of organizational change to take away fear, risk, and mindset of the people who have worked on these very different technologies. Well, let me rephrase that, the technologies in itself are not that different w.r.t. packaging frames and switching or routing these though a network but the purpose these technologies serve is totally different. I wrote about this before in the article “Why FCoE will die a silent death”. I very much like the technology and admire Silvano what he has accomplished but it will be up to the customer to decide whether or not he/she wants to adopt this technology. The surprising thing that I got of these sessions is that although most vendors acknowledge Fibre Channel will be around for quite some time, they also hint that the development will slow down significantly. All development efforts these days seem to go into Ethernet so at some point in time customer wont have this choice anymore. Only time will tell if customers will accept that. I think it still requires a lot of engineering effort to fill in  the gaps that are still there to overcome reliability and stability issues.
After lunch I joined a wikibon panel session with FCoE advocates of the different vendors. It’s quite obvious they do want you to adopt this and although they don’t admit it they can’t afford FCoE to fail. It has been on the market now for two years and it looks like the adoption ratio is still extremely low or at least not as good as they wanted it to be. They also have to be very careful not to step on toes so the general sense was start now but do it controlled to protect your current assets and increase over time with natural technology refreshes.

The last breakout session of the day for me was the one with @YoClaus Mickelsen, @virtualheff Mike Heffernan and Jerry Sutton from TSYS who elaborated on the end-to-end server to storage virtualisation and the advantages HDS has to offer. Very good session with good questions and feedback. I also, finally,  met Michael  Jaslowski who works for Northern Trust. I have had the pleasure to analyze and fix some of their issues they’ve had in the past on their SAN infrastructure. Always good to meet up with people who you normally only speak to on the phone.

To close the day Miki Sandorfi had a general session around taking your own pace and look out for good solutions before adopting cloud technologies. Keep an open mind and explore what’s right for your business.

See ya tomorrow.

Cheers
E

SNW 2011 Tuesday

The second day at SNWUSA kicked off with three keynotes. One from Randy Mott, CIO of HP, who had a very good session of how they changed their internal infrastructure and processes to increase efficiency and ability to innovate. Second one was from my boss, or actually my bosses, bosses boss so that gives some perspective where I fit in the organization :-), Jack Domme who entered the world of massive data storage, preservation, and ability to harvest and analyse on these enormous amounts of data.


After that I went to the expo to check on the guys in the HDS booth and the setup of the demo’s. Always very nice to see all the software features and functions we provide.
My first breakout session was from John Webster who was providing an overview on “Big Data”. Now what the heck is big data? John provided on overview of some definitions, products and some freaky stuff that’s on the horizon. Then we’ re talking about brain implants and some things you really don’t want to talk or think about since this might touch the real essence of life. Anyway, when  we got our feet back on the ground again, I went over to Kevin Leahy who runs the Cloud strategy at IBM’s CTO office.I think this was the best visited break-out session and people actually had no seat. As Kevin pointed out there are a lot of points that needs to be looked at when either adopting or providing a cloud service. But on the other side it also gives you a lot of flexibility and options to provide IT services to your business and customers.

During lunch I recieved a message from Horst Truestedt who presented my FC-GS proposal on fibre-channel resiliency at the T11 committee meeting in Philadelphia. He mentioned it was extremely well received and by and got unanimously got accepted. This was the best news for today. I’ve been waiting a long time for this and it seems it’s coming now. Yeahh.

After lunch I had the opportunity to meet a lot of people who were here long before I got into this business and warped some pretty interesting discussions into the equation. Due to this I missed a session so I have to dig up the presso from the SNIA website.

The last two keynotes were from Krishna Nathan,who runs the development of IBM’s storage development and he had some interesting points but also some things that are already there in the market place and have been done and dusted by HDS for a long time. Wondering if IBM is trying to play catchup here.The last session was from Dave Smoley. Dave runs the internal IT stuff at Flextronics. His message was that even with a very tight budget and well defined standards innovation doesn’t have to suffer. Instead he encourages companies to push their employees to be creative and come up with ideas that might have high value to the business but not necessarily cost an arm and a leg.

During diner I met up with Robin Harris who runs the StorageMojo blog and a column on ZDNet , were we had some interesting discussions around SSD and life expectancy of disks. Not how long a single disk or SSD is running but we were questioning when disk would become obsolete. Robin thinks he’s going to be retired a long time before disks run out of existence whereas I have the view disks will cease to exist within 10 to 15 years. We’ll see, time will tell. 🙂

A lot of buzz has been going around regarding #storagebeers. It’s a social gathering of people of all shapes, size, background and mindsets all working in the storage industry. It’s a great event with a fantastic turnout. Word goes around this was the best #storagebeers ever. I can’t tell, haven’t been to one before but it indeed was a great evening.

 THE Marc Farley aka StorageRap at #storagebeers.

 The best part is there is absolutely no competitive mindset, everyone talks to everyone about everything but storage. (well a little bit but all geek stuff. :-))

As you can see on the right a great turnup.

That more-or-less sums up my day.

CU tomorrow.
Cheers, E