Tag Archives: resilient

Appalling state of Storage Networks

In the IT world a panic is most often related to a operating systems kernel to run out of resources or some unknown state where it cannot recover and stops. Depending on the configuration it might dump its memory contents to a file or a remote system. This allows developers to check this file and look for the reason why it happened.

A fibre-channel switch running Linux as OS is no different but the consequences can be far more severe.

Continue reading

Cross-fabric collateral damage

Ever since the dawn of time the storage administrators have been indoctrinated with redundancy. You have to have everything at least twice in order to maintain uptime and be able to achieve this to a level of around 99,999%. This is true in many occasions however there are exceptions when even dual fabrics (ie physically separated) share components like hosts, arrays or tapes.. If a physical issue in one fabric is impacting that shared component the performance and host IO may even impact totally unrelated equipment on another fabric and spin out of control.

Continue reading

Closing the Fibre-Channel resiliency gap – 2

So this morning I uploaded my proposal to T11 (13-348v0)  in order to get the ball rolling to get it adopted in the FC-LS-3 standard. (That would be awesome). Obviously the task to get things done in a very stable protocol which is known for some serious backward compatibility is not an easy undertaking. I’ve tried (and I think I succeeded) in leaving all existing behaviour intact. This way any transition towards an environment that supports this new architecture is seamless.

The document should be download-able for everyone so any feedback is highly appreciated.



Closing the Fibre-Channel resiliency gap

Fibre-Channel is still the predominant transport protocol for storage related data transmission. And rightfully so. Over the past +-two decades it has proven to be very efficient and extremely reliable in moving channel based data transmissions between initiators and targets. The reliability is due to the fact the underlying infrastructure is almost bulletproof. Fibre-Channel requires very high quality hardware as per FC-PI standard and a BER (Bit Error Rate) of less than 10^12 is not tolerated. What Fibre-Channel lacks though is an method of detection and notification to and from hosts if a path to a device is below the required tolerance levels which can cause frames to be dropped without a host to be able to adjust its behaviour on that path. Fibre-Channel relies on upper level protocols (like SCSI) to re-submit the command and that’s about it. When FC was introduced to the market back in the late 90’s, many vendors already had multipath software which could correlate multiple paths to the same LUN into one and in case one path failed it could switch over to the other. Numerous iterations further down the road nothing really exciting has been developed in that area. As per the nature of the chosen class-of-service (3) for the majority of todays FC implementations there is no error recovery done in an FC environment. As per my previous post you’ve seen that MPIO software is also NOT designed to act on and recover failed IO’s. Only in certain circumstances it will fail a path in a way that all new IO’s will be directed to one or more of the remaining paths to that LUN. The crux of the problem is that if any part of the infrastructure is less than what is required from a quality perspective and there is nothing on the host level that actively reacts on these kind of errors you will end up with a very poor performing storage infrastructure. Not only on the host that is active on that path but a fair chance exists other hosts will have the same or similar problems. (see the series of Rotten Apples in previous posts.)

So what is my proposal. Hosts should become more aware of the end-to-end error ratio of paths from their initiators to the targets and back. This results in a view where hosts (or applications) can make a judgement call of which path it can send an IO so the chances of an error are most slim. So how does this need to work. I is all about creating an inventory of least-error-paths. For this to be accomplished we need a way of getting this info to the respective hosts. Basically there are two ways of doing this. 1. Either create a central database which receives updates from each individual switch in the fabric and the host needs to query that database and correlate that with the FSPF (Fabric Shortest Path First) info in order to be able to sort things out or, and this would be my preferred way, we introduce a new ELS frame which can hold all counters currently specified in the LESB (Link Error Status Block) plus some more for future use or vendor specific info. I call this the Error Reporting with Integrated Notification frame. This frame is sent by the initiator to the target over all paths it has at its disposal. Each ingress port (RX port on each switch) which this frame traverses and increment the counters with its own values. When the target receives the frames it flips the SID and DID and send it back to the host. Given the fact this frame is still part of the same FC exchange it will also traverse the same path back so an accurate total error count of that path can be created.

Both options would enable each host of analyzing the overall error count on each path from HBA to target for each lun it has. The problem with option 1 is that the size of the database will increase exponentially proportional with the number of ports and paths and this might become such a huge size that it cannot live inside the fabric anymore and thus needs to be updated in an external management tool. This then has the disadvantage that host are depending on OOB network restrictions and security implications in addition to interop problematic issues. It also has the problem that path errors can be bursty depending on load and/or fabric behaviour. This management application will need to poll these switches for each individual port which will cause an additional load on the processors on each switch even while not necessary. Furthermore it is highly unreliable when the fabric is seeing a fair amount of changes which by default causes re-routing to occur and thus renders a calculation done by the host one minute ago, totally useless.

Option two has the advantage that there is one uniform methodology which is distributed on each initiator, target and path. It therefore has no impact on any switch and/or external management application and is also not relying on network (TCP/IP) related security restrictions, Ethernet boundaries caused by VLANS etc or any other external factors that could influence the operation.

The challenge is however that today there is no ASIC that supports this logic and even if I could get the proposal accepted by T11 it’ll take a while before this is enabled in hardware. In the meantime the ELS frame could be sent to the processor of the switch which in turn does the error count modification in the frame payload, CRC recalculation and other things required. Once more the bottleneck of this method will become the capability of the CPU of that particular switch especially when you have many high port-count blades installed . Until the ASICs are able to do this on the fly in hardware there will be less granularity from a timing perspective since each ELS frame will need to be sent to the CPU. To prevent the CPU from being flogged by all these update and pull requests in the transitional period there is an option to either extend the PLOGI to check if all ports in the path are able to support this frame in hardware or use this new ELS with a special “inventory bit” to determine the capabilities. If any port in the part does not not support the ELS frame it will flick it to 0. This allows the timing interval of each ELS frame to be inline with the capabilities of the end-to-end path. You can imagine that if all ports are able to do this in hardware you can achieve a much finer granularity on timing and hosts can respond much quicker on errors. If any port does not support the new ELS frame the timing can be adjusted to fall in between the E_D_TOV and R_A_TOV values (in general 2 and 10 seconds). The CPU’s on the switches are fairly capable to handle this. This is still much better than any external method of collecting fibre-channel port errors and having an out-of-band query and policy method. Another benefit is that there is a standard method of collecting and distributing the end-to-end path errors so even in multi-vendor environments it is not tied to a single management platform.


So lets look at an example.

erwin_els_example _1This shows a very simplistic SAN infrastructure where one host has 4 paths over two initiators to 2 storage ports each. All ports seem to be in tip-top shape.

erwin_els_example _2If one link (in this case between the switch and the storage controller) is observing errors (in any direction) the new ELS frame would increment that particular LESB counter value and the result would enable the host to detect the increase in any of the counters on that particular path. Depending on the policies of the operating system or application it could direct the MPIO software to mark that path failed and to remove it from the IO path list.

Similarly if a link shows errors between an initiator and switch it will mark 2 paths as bad and has the option to mark these both as failed.

erwin_els_example _3If you have a meshed fabric the number of paths exponentially grow with each switch and ISL you add. The below show the same structure but because an ISL has been added between the switches the number of potential paths between the host and LUN now grows to 8

erwin_els_example _4This means that if one of the links is bad the number of potential bad paths also duplicates.

erwin_els_example _5In this case the paths from both initiators on this bad link are marked faulty and can be removed from the target list by the MPIOsoftware.

erwin_els_example _7The fun really starts with the unexpected. Lets say you add an additional ISL and for some reason this one is bad. The additional ISL does not add new paths to the path-list on the host since this is transparent and is hidden by the fabric. Frames will just traverse one or the other irrespective of which patch is chosen by the host software. Obviously, since the ELS is just a normal frame, the error counters in the ELS might be skewed based on which of the two ISL’s it has been sent. Depending on the architecture of the switch you’ll have two options, either the ASIC accumulates all counters for both ports into one and add these onto the already existing counters, or you can use a divisional factor where the ASIC sums up all counters of the ISL’s and divides them by the number of ISL’s. The same can be done for trunks(brocade) / portchannels(Cisco). Given the fact that currently most the counters are used in 32bit transmission words the first option is likely to cause the counters to wrap very quickly. The second advantage of a divisional factor is that there will be a consistent averaging across all paths in case you have a larger meshed fabric and thus it will provide a more accurate feedback to the host.

I’m working out the details w.r.t. the internals of the ELS frame itself and which bits to use in which position.

This all should make Fibre-Channel in combination with intelligent host-based software an even more robust protocol for storage based data-transmissions.

Let me know what you think. Any comments, suggestions and remarks are highly appreciated.



Why convergence still doesn’t work and how you put your business at risk

I browsed through some of the great TechField Day videos and came across the discussion “What is an Ethernet Fabric?” which covered the topic of Brocade’s version of a flat layer 2 Ethernet network based on their proprietary “ether-fabric protocol”. At a certain point the discussion led to the usual “Storage vs. Network” and it still seems there is a lot of mistrust between the two camps. (As rightfully they should. :-))

For the video of the “EtherFabric” discussion you can have a look >>here<<

Convergence between storage en networking has been a wishful thinking ever since parallel SCSI became in it 3rd phase where the command set was separated from the physical infrastructure and became serialised over an “network” protocol called Fibre-Channel.

The biggest problem is not the technical side of the conversion. Numerous options have already been provided which allow multiple protocols being transmitted via other protocols. The SCSI protocol is able to be transmitted via FibreChannelC, TCPIP, iSCSI and even the less advanced protocol ATA can be transferred directly via Ethernet.

One thing that is always forgotten is the intention of which these different networks were created for. Ethernet was developed somewhere in the 70’s by Robert Metcalf at Xerox (yes, the same company who also invented the GUI as we know it today) to be able to have two computers “talk” to each other and exchange information. Along that path the DARPA developed TCP/IP protocol was bolted on top of that to make sure there was some reliability and a more broader spectrum of services including routing etc was made possible. Still the intention has always been to have two computer systems exchange information along a serialised signal.

The storage side of the story is that this has always been developed to be able to talk to peripheral devices and these days the dominant two are SCSI and Ficon (SBCCS over FibreChannel). So lets take SCSI now. Just the acronym already tells you its intent:  Small Computer Systems Interface. It was designed for a parallel bus, 8-bits wide, had a 6 meter distance limitation and could shove data back and forth at 5MB/s. By the nature of the interfaces it was a half-duplex protocol and thus a fair chunk of time was spent on arbitration, select, attention and other phases. At some point in time (parallel) SCSI just ran into brick wall w.r.t. speed, flexibility, performance, distance etc. So the industry came up with the idea to serialise the dataflow of SCSI. In order to do this all protocol standards had to be unlinked from the physical requirements SCSI had always had. This was achieved with SCSI 3. In itself it was nothing new however as of that moment it was possible to bolt SCSI onto a serialised protocol. The only protocols available at that time were Ethernet, token ring, FDDI and some other niche ones. These ware all considered inferior and not fit for the purpose of transporting a channel protocol like SCSI. A reliable, high speed interface was needed and as such FibreChannel was born. Some folks at IBM were working on this new serial transport protocol which had all the characteristics anyone would want in a datacentre. High speed (1Gbit/s, remember Ethernet at that time was stuck at 10Mb/s and token ring at 16Mb/s), both optical and copper interfaces , long distance, reliable (ie no frame drop) and very flexible towards other protocols. This meant that FibreChannel was able to carry other protocols, both channel and network including IP, HIPPI, IPI, SCSI, ATM etc. The FC4 layer was made in such a flexible way that almost any other protocol could easily be mapped onto this layer and have the same functionality and characteristics that made FC the rock solid solution for storage.

So instead of using FC for IP transportation in the datacentre some very influential vendors went the other way around and started to bolt FC on top of Ethernet which resulted in the FCoE standard. So we now have a 3 decade old protocol (SCSI) bolted on top of a 2 decade old protocol (FC) bolted on top of a 4 decade old protocol (Ethernet).

This in al increases the complexity of datacentre design, operations, and troubleshooting time in case something goes wrong. Although you can argue that costs will be reduced due to the fact you only need single CNA’s, switchports etc instead of a combination of HBA’s and NIC’s, but think about the fact you lose that single link. This means you will lose both (storage and network) at the same time. This also means that manageability is reduced to zero and you will to be physically behind the system in order resuscitate it again. (Don’t start you have to have a separate management interface and network because that will totally negate the argument of any financial saving)

Although it might seem that from a topology perspective and the famous “Visio” drawings the design seems more simplified however when you start drawing the logical connections in addition to the configurable steps that are possible with a converged platform you will notice that there is a significant increase in connectivity. 

I’m a support engineer with one of the major storage vendors and I see on a day to day basis the enormous amount of information that comes out of a FibreChannel fabric. Whether it’s related to configuration errors, design issues causing congestion and over-subscription, bugs, network errors on FCIP links and problems with the physical infrastructure. See this in a vertical  way were applications, operating systems, volume managers, file-systems, drivers etc. all the way to the individual array spindle can be of influence of the behaviour of an entire storage network and you’ll see why you do not want to duplicate that by introducing Ethernet networks in the same path as the storage traffic.
I’m also extremely surprised that during the RFE/RFP phase for a new converged infrastructure almost no emphasis is placed on troubleshooting capabilities and knowledge. Companies hardly question themselves if they have enough expertise to manage and troubleshoot such kind of infrastructures. Storage networks are around for over over 15 years now and still I get a huge amount of questions which touch on the most basic knowledge of these networks. Some call themselves SAN engineers however they’ve dealt with this kind of equipment less than 6 months and the only thing that is “engineered” is the day-to-day operations of provisioning LUNs and zones. As soon a zone commit doesn’t work for whatever reason many of them are absolutely clueless and immediate support-cases are opened. Now extrapolate this and include Ethernet networks and converged infrastructures with numerous teams who manage their piece of the pie in a different manner and you will, sooner or later, come to the conclusion that convergence might seem great on paper however there is an enormous amount of effort that goes into a multitude of things spanning many technologies, groups, operational procedures and others I haven’t even touched on. (Security is one of them. Who determines which security policies will be applied on what part of the infrastructure. How will this work on shared and converged networks?)

Does this mean I’m against convergence? No, I think it’s the way to go as was virtualization of storage and OS’es. The problem is that convergence is still in its infancy and many companies who often have a CAPEX driven purchase policy are blind to the operational issues and risks. Many things need to be fleshed out before this becomes real “production ready” and the employees who keep your business-data on a knifes-edge are knowledgeable and confident they master this to the full extent.

My advice for now:

1. Keep networks and storage isolated. This improves spreading of risk, isolates problems and recoverability in case of disasters.
2. Familiarise yourself with these new technologies. Obtain knowledge through training and provide your employees with a lab where they can do stuff. Books and webinars have never been a good replacement for one-on-one instructor led training.
3. Grow towards an organisational model where operations are standardised and each team follows the same principles.
4. Do NOT expect you suppliers to adopt or know these operational procedures. The vendors have thousands of customers and a hospital requires far different methods of operations than an oil company. You are responsible for your infrastructure and nobody else. The support-organisation of you supplier deals with technical problems and they cannot fix your work methods. 
5. Keep in touch with where the market is going. What looks to become mainstream might be obsolete next week. Don’t put your eggs in one basket.

Once more, I’m geek enough to adopt new technologies but some should be avoided. FCoE is one of them at this stage.

Hope this helps a bit in making you decisions.

Comments are welcome.

Erwin van Londen

One rotten apple spoils the bunch – 3

In the previous 2 blog-posts we looked at some areas why a fibre-channel fabric still might have problems even with all redundancy options available and MPIO checking for link failures etc etc.
The challenge is to identify any problematic port and act upon indications that certain problems might be apparent on a link.

So how do we do this in Brocade environments? Brocade has some features build into it’s FOS firmware which allows you to identify certain characteristics of your switches. One of them (Fabric-Watch) I briefly touched upon previously. Two other command which utilize Fabric_Watch are bottleneckmon and portfencing. Lets start with bottleneckmon.

Bottleneckmon was introduced in the FOS code stream to be able to identify 2 different kinds of bottlenecks: latency and congestion.

Latency is caused by a very high load to a device where the device cannot cope with the offered load however it does not exceed the capabilities of a link. As an example lets say that a link has a synchronized speeds of 4G however the load on that link reached no higher than 20MB/s and already the switch is unable to send more frames due to credit shortages. A situation like this will most certainly cause the sort of credit issues we’ve talked about before.

Congestion is when a link is overloaded with frames beyond the capabilities of the physical link. This often occurs on ISL and target ports when too many initiators are mapped on those links. This is often referred to as an oversubscribed fan-in ratio.

A congestion bottleneck is easily identified by looking at the offered load compared to the capability of the link. Very often extending the connection with additional links (ISL, trunk ports, HBA’s)  and spreading the load over other links or localizing/confining the load on the same switch or ASIC will most often help. Latency however is a very different ballgame. You might argue that Brocade also has a portcounter called tim_txcrd_zero  and when that reaches 0 pretty often you also have a latency device but that’s not entirely true. It may also mean that this link is very well utilized and is using all its credits. You should also see a fair link utilization w.r.t. throughput but be aware this also depends on frame size.

So how do we define a link as a high latency bottleneck? The bottleneckmon configuration utility provide a vast amount of parameters which you can use however I would advise to use the default settings as a start by just enabling bottleneck monitoring with the “bottleneckmon –enable” command. Also make sure you configure the alerting with the same command otherwise the monitoring will be passive and you’ll have to check each switch manually.

If a high latency device is caused by physical issues like encoding/decoding errors you will get notified by the bottleneckmon feature however when this happens in the middle of the night you most likely will not be able to act upon the alert in a timely fashion. As I mentioned earlier it is important to isolate this badly behaving device as soon as possible to prevent it from having an adverse effect on the rest of the fabric. The portfencing utility will help with that. You can configure certain thresholds on port-types and errors and if such a threshold has been reached the firmware will disable this port and alert you of it.

I know many administrators are very reluctant to have a switch take these kind of actions on its own and for a long time I agreed with that however seeing the massive devastation and havoc a single device can cause I would STRONGLY advise to turn this feature on. It will save you long hours of troubleshooting with elongated conference calls whilst your storage network is causing your application to come to a halt. I’ve seen it many times and even after pointing to a problem port very often the decision to disable such a port subject to change management politics. I would strongly suggest that if you have such guidelines in your policies NOW is the time to revise those policies and enable the intelligence of the switches to prevent these problem from occurring.

For some comprehensive overview, options and configuration examples I suggest you first take a look at the FOS admins guide of the latest FOS release versions. Brocade have also published some white-papers with more background information.



Fibre Channel improvements.

So what is the problem with storage networking these days? some of you might argue that it’s the best thing since sliced bread and it’s the most stable way to shove data back and forth and maybe it is however this is not always the case. The problem is some gaps still exist which have never been addressed and one of them is resiliency. There is a lot that has been done to detect errors and to try to recover from them but nobody ever thought of how to prevent errors from occurring. (Until now that is). Read on.

So what is the evolution of a standard like Fibre Channel.It normally is born out of a need that isn’t addressed with current technologies. The primary reason FC was erected is that the parallel SCSI stack had a huge problem with distance. It did not scale beyond a couple of meters and was very sensitive to electrical noise which could disturb the reliable transmission that was needed for a data intensive channel protocol like SCSI. So somebody came up with the idea to serialise the data stream and FC was born. A lot of very smart people got together and cooked up the nifty things we now take for granted like massive address-space, zoning, huge increase in speed and lot of other goodies which could have never been achieved with a parallel interface.

The problem is that these goodies are all created in the dark dungeons of R&D labs. These guys don’t speak much (if at all) to end-user customers so the stuff coming out of these labs is very often extremely geeky.
If you follow a path from the creation of a new thing (whether technology or anything else) you see something like this:

  1. Market demand
  2. R&D
  3. Product
  4. Sales
  5. Customers
  6. Post sales support

The problem is that very often there is no link between #5/#6 and #2. Very often for good reason but this also inflicts some serious challenges. Since I’m not smart enough to work in #2 I’m on the bottom of the food chain working in #6. 🙂 But I do see the issues that arise in this path so I cooked something up. Read on.

Going back to fibre channel there is one huge gap and that is fault tolerance and the acting upon failures in a FC fabric. The protocol defines how to detect errors and how to try to recover from these but is does not have anything which defines how to prevent errors from reoccurring. This means that if an error has been detected and frames get lost we just say “OK, lets try it again and see if it succeeds now”. It doesn’t take a genius to see that if something is broke this will fail again.

So on the practical side there are a couple of things that most often go wrong and that is the physical side of things like SFP’s and cables. These result in errors like encoding/decoding failures, CRC errors, signal and synchronization errors. If these occur the entire frame including your data payload will get dropped and we’re asking to the initiator of that frame to try and resend it. If however the initiator does not have this frame in it’s buffers anymore we rely on the upper layer protocol to recover from this. Most of the time it succeeds, however, as previously mentioned, if things are really broke this will fail again. From an operating system perspective you will see this as SCSI check conditions and/or read/write failures. On a tape environment this will often result in failed back/restore jobs.

Now, you’re gonna say “Hold on buddy, that why we have dual redundant fabrics, multiple entries to our LUNS, multipathing etc etc” i.e. redundancy. True, BUT, what if it is just partially broken? An dodgy SFP or HBA might send out good signals but there could also be a certain amount of not so good signals. This will result in intermittent failures resulting in the above mentioned errors and if these happen often enough you might get these problems. So, although you have every piece of the storage puzzle redundant, you might still run into problems which, if severe enough, might affect your entire storage infrastructure. (and it does happen, believe me)

The underlying problem is that there is no communication between N-Ports and F-ports as well as lack of end-to-end path error verification to check if these errors occur in the fabric and if so how to mitigate or circumvent these. If an N-port sends out a signal to an F-port which gets corrupted underway there is no way the F-port is notifying the N-port and saying “He, dude you’re sending out crap, do something about it”. Similar issue is in meshed fabrics. We all grew up since 1998 with FSPF (Fabric Shortest Path First) which is a FC protocol extension to determine the shortest path from A to B in a FC fabric based on a least cost routing algorithm. Nothing wrong with that however what if this path is very error prone? Does the fabric have any means to make a decision and say “OK, I don’t trust this path, I’ll direct that traffic via another route”? No, there is nothing in the FC protocol which provides this option. The only way routes are redefined is if there are changes in the fabric like an N-Port coming online/offline and registers/de-registers itself with the fabric nameserver and RSCN (Registered Name Change Notifications) are sent out.

For this reason I submitted a proposal to the T11 committee via my, teacher and father of Fibre Channel Horst Truestedt, to extend the FC-GS services with new ways to solve these problems. (proposal can be downloaded here )

The underlying thoughts are to have port-to-port communication to be able to notify the other side of the link it is not stable as well as have and end-to-end error verification and notification algorithm so that hosts, hba’s and fabrics can act upon errors seen in the path to their end devices. This allows active redirection of frames to circumvent frames of passing via that route as well as the option to extend management capabilities so that storage administrators can act upon these failures and replace/update hardware and/or software before the problem becomes imminent and affects the overall stability of the storage infrastructure. This will in the end result in far greater storage availability and application uptime as well as prevent all the other nasty stuff like data corruption etc.

The proposal was positively received with an 8:0 voting ratio so now I’m waiting for a company to pull this further and actually starting to develop this extension.

Let me know what you think.