Tag Archives: decoding

Errors and other nastiness on a serial transmission link

In some of my previous post I have outlined the reasons and consequences of errors on a link. There are multiple locations where a bit may flip from a 0 to a 1 or reverse. CRC errors, encoding errors, synchronisation and link failures are the usual symptoms and may lead to corrupted frames, primitives signals and primitive sequences. In the end they will lead to some sort of IO error or performance problem.

So where do these errors come from? The easiest ones to solve are those ones of a physical nature. Broken cables, wrongly fitted connectors, unclean connection points due to dust or incorrect handling of the cables. The ones who do not fall into these categories are much more difficult to detect. If you have checked and fixed your physical part of the puzzle we may need to dive in somewhat deeper into the “not-so-obvious” side of the fence and this is where it gets most interesting as well as complicated.

As you most likely know all current transmission protocols use a serial transmission link. This goes from 100Gb Ethernet to your SATA disk channel in your PC. Obviously Fibre Channel uses a serial transmission as well. As with many protocols the FC protocol is build upon layers and the serial transmission characteristics reside on the FC0 and FC1 layer.

A somewhat older picture of the FC stack

The above picture more or less outlines the stack. Of course the speeds and feeds have increased but the general operations have remained the same.

So if the physical side is out of the way what could then cause a bit to flip. Well, pretty often a degraded SFP or incorrect settings on an ASIC may cause interference. When you look at the defect list from Brocade/Cisco and other vendors you’ll see so now and they defects popping up where they have adjusted so called “SERDES settings”. SERDES stands for SERializer/DESerializer which inherently implies they must be doing something on the chip which serialises and de-serialises the data stream. To go further into this we need to have a look how a basic flow of bits is being sent from the ASIC through the SFP onto the wire.

First you start of with 8 bits and this runs thru the encoder chip. Here the byte is split into two blocks of 5 and 3 bits after which an XOR mathematical calculation takes place and this results into 4 and 6 bit sub-blocks which are then glued back together so you end up in 10 bits. These still reside in a parallel way in a chip and thus it will be sent to the serializer. Based upon hardware implementation a FIFO buffer might in between these two. After it has passed the serializer it will be passed onto the SFP driver which converts it from an electrical to optical signal and puts in on the wire.

On the receiving side things are a bit more complicated. First of all it needs to align the incoming bitstream to a meaningful piece of information and therefore it has to align the clock speed to that bit-stream. Given the fact a 10-bit transmission character can hold more information then a 8 bit byte (duhh) means we end up with some additional characters which we can utilize. The most important one is the K28.5 character (also called a comma) which is send on the very first character of each frame as well as all primitive signals or ordered sets. (IDLE/R_RDY/NOS/OLS/LR etc..) Remember that FibreChannel is WORD aligned and each word consists of 4 transmission characters ie 40-bits. A primitive signal is thus 40 bits of which the first character start with K28.5 and then 3 “data” characters Dxx.x.

Depending on the current running disparity the K28.5 looks like 0011111010 on a negative RD or like 1100000101 on a positive RD. These two characters do never show up anywhere else in a bit-stream irrespective of payload. So when the very distinctive K28.5 arrives at the ingress port the receiver knows the exact point of which to align the synchronisation of the stream. The reason why this is needed is because there will always be a slight discrepancy on the frequency of the clocks between the remote and local side. The alignment of the clock via the bit-stream is a very effective way of achieving very high data-rates with accurate bit and word synchronisation. This would never have been possible with parallel link such as in the “old” SCSI days.

The de-serialized bitstream is thus sent a chip which effectively does the comma realignment after which it can be pushed into the decoding chip so the 10-bit transmission character can now be converted back to an 8-bit byte by applying the same XOR algorithm, in an inverse manner, which was used to encode the byte in the first place. After that is done the byte will be put into a, so called, elastic buffer. This nifty piece of chip is required to balance the ingress rate of which the bitstream arrives and the rate of which the ASIC or FPGA is able to pull the data from this interface. If the ASIC is able to pull the data quicker out of the interface than the interface can deliver the bit stream you end up in a so called underrun. If, on the other hand, the interface is delivering the data quicker than the ASIC can pull it out you may end up in an overrun. This is where the flexibility of the FC protocol comes into play with regards to fill-words. Basically it means that according to the standard each transmitting interface shall send at least 6 fill-words (either IDLE or ARB(FF))between two frames. The remote side however only needs to detect 3 to detect an inter-frame gap so it has a 3 word playroom to either add fillwords into the elastic buffer to prevent underrun or remove fillword in the event of overrun.

On 10GFC and 16GFC the 64b/66b encoding/decoding algorithm is used so it will pass a descrambler and block-sync chip before it is able to be decoded. Depending on vendor implementation there are some variation at which point the actual decoding is done. Most of these cases the decoder will pull the 66 bits out of the elastic buffer and decodes these on the fly when the ASIC does a read request to that interface.

As you can see there are many points where things might go wrong and it is thanks due to the brilliance of these hardware engineers that a BER of 10^15 is fairly normal.

So going back to these SERDES defects Brocade and others have popped up. What do they mean. These settings are to adjust the pre-emphasis settings of the serdes chip in such a way that it will over-drive the first transition and gradually underdrives on each subsequent bit of the same polarity. Huhhhh, what does that mean???

The 8b/10b encoding/decoding schema makes sure you will never have more than 5 consecutive ones or 5 consecutive zeros in a bit stream. If you run into a long string of consecutive ones or zero you will run into a phenomenon called ISI (Inter Symbol Interference). This means the transmission path capacitance may be higher than the discharge capability of that circuit after a short transition. In plain English it means the circuit charges itself to a higher rate that it can discharge it at the next transition.

So lets visualise this.

In the above picture the green line represents the upper and lower boundaries of the maximum voltage levels of the circuit and the red lines represent the upper and lower values were the end-point distinguishes between a 1 or a 0. (These are determined in the standards bodies of the respective interfaces. For Fibre Channel it is determined by the FC-PI-xx standard of the T11 committee). As you can see due to the over charge on the circuit after 5 ones the discharge on the transition is not big enough to fall outside the detection zone boundaries and thus the end-point is unable to detect a transition from 1 to 0 and thus it will result in an encoding error. There is a second problem with this and that is the fact of clock synchronization. There are two options for serdes chip to be able to synchronise the clock rate, SS or Source Synchronous and CDR (Clock Data Recovery). Source Synchronous requires a separate clock signal on the wire for the remote side to hook into whilst CDR is using a phase lock to the incoming bitstream and as such can adjust the clock-rate accordingly. (Remember the K28.5 ??) If however the transmission characteristics of this signal deter due to the ISI problem outlined above or any other issue the receiving side will lose synchronization and thus will need to first re-align itself again in order to be able to successfully be able to decode the bit stream. This will cause inflight frame to be lost and if the problem is persistent enough it will result in a severe performance problem or other nastiness.

With pre-emphasis the sending side will make sure there is a stepped degraded power-level on each  subsequent bit and this will have the effect the capacitance value of the circuit will stay flat and the transition to the next different polarity will fall beyond the boundaries of the detection zone. An example of different pre-emphasis values show the effect on the eye-pattern below.

 
 

So when a vendor shows “defects” or “bugs” in relation to SERDES it does not mean there is really a hardware problem or software bug but more or less a way to track different settings on different ports to adjust the pre-emphasis values on those ports.

As you can imagine on a chassis with over 350 user ports and the even more ports on the back-end side of the fence there is a lot to track down and a change on one port may have a negative effect on other ports.

If we’re dealing with Brocade switches or directors the actual values are stored in the ASIC registers which can be collected via a supportsave. If you don’t know what you’re looking for then don’t even try. ūüôā There is also the option to adjust these values real-time however the previous comment is even more applicable. Unless you are a Brocade ASIC hardware engineer don’t even try. I did modify some values in a controlled lab environment and YES, thing do go haywire when entering values you pulled out of the hat.

Testing, testing… one-two-three…. For some advanced features.

Testing for a faulty port is always a bit tricky since you first need to know what you’re looking for and secondly you need to know what and how to interpret the outcome. Brocade has a utility in FOS called “portloopbacktest”. This utility allows you to test on a range of different options. the backdraw is that the switch needs to be in a non-operational (ie disabled) state.

If you’ve read the above you’ll see there are three distinct locations you need to test.
1. External to the SFP (ie the entire link including cabling, connectors, patch panels etc.)
2. External to the SERDES chip but before the SFP (ie internal ASIC/FPGA to SFP circuit)
3. Internal (parallel) to check for correct operation of encoding/decoding, elastic buffer, scrambler etc.

The below picture shows these three options. This is a snippet of a Virtex-II Pro RocketIO FPGA. You can see the layout of the TX and RX side with the SERDES chip, clock manager, encoder etc. Obviously depending on the vendor and usage you might see additional functionality being put into this like FEC calculators etc.

For option one you can use a loopback-plug directly on the SFP if you suspect a faulty laser or at the end of the link.
Normally you would start with the parallel test (Blue) and work your way up from there to the serial loopback (Red) and then the external one (Green). Also be aware the portloopbacktest uses a different port numbering schema than you might be accustomed to. This schema is based on the actual bladeport number according to the ASIC layout of the specific blade so it does not work based on the front-end port-numbers. This means that each blade type (16,32 or 48 port) will have a different numbering schema. Check the troubleshooting and command reference manual around these options.

On the 16G condor 3 hardware you can also use the new, so called, “D-Port” functionality. This however only works if you have two of these type of switches or 16G Brocade HBA’s connected to these 16G switches. If you implement these switches and/or HBA’s I would strongly suggest you test the link with this functionality as it can prevent future problems.

I hope this explains something around link errors you may find on fibre-channel ports and the cause plus resolution of it.

Let me know if you have further interest into different topics and I’ll see what I can cook up. ūüôā

Cheers,
Erwin

One rotten apple spoils the bunch – 4

Credit – who doesn’t want to have lots of it at the bank

According to Wikipedia the short definition in finance terms is :

Credit is the trust which allows one party to provide resources to another party where that second party does not reimburse the first party immediately (thereby generating a debt, but instead arranges either to repay or return those resources (or other materials of equal value) at a later date


In Fibre Channel it’s not much different. During initialization of a port they both provide the remote party a certain amount of resources (buffer credits) which tell the remote party it is allowed to send this x amount of frames. If this credit is used up the sending port is no longer allowed to send anymore frames. The receiving port get the frame and when the first couple of bytes have been read it will send a so called R_RDY to the sending port telling him to increase his credit by one. This part I described before.

A short side note is that this is normal operation in a class 3 FC network. A class 1 network uses End-to-End credits but this is hardly ever used. 

Taking the above into account it shows that all links in a FC network use this method for flow control. This is thus not only restricted to HBA, Disk, Tape and switchports but also internally within switches this method is used. The most explanatory way of showing this is when you have a blade based chassis where ports in one blade need to transfer frame to a port on another blade. Basically this means that the frame will traverse two more hops before it reaches that port. If you have a small switch with only one ASIC all frames will be short routed in that same ASIC.

As an example.
A Brocade 5100 has a single 40-port Condor2 ASIC which means all 40 front-end ports are connected to this same chip.



 

 

Any frame traversing from port 1 to port 9 will be switched inside that same chip. Sounds obvious. This also means there are only two points of B2B flow control and that is between the devices connected to both port 1 and 9.

When looking at a Brocade 5300 there is completely different architecture.

 

 

 

This switch has 5 GoldenEye2 ASICS which serve 80 front-end ports (16 each) and 4 back-end ASIC which serve as interconnect between those 5 front-end ASICs. Each GoldenEye2 chips has 32 8G ports and each front-end to back-end link is connected with a single 4-port trunk which allows for any-to-any 1:1 subscription ratio.

If we look at this picture and have an HBA connected to port 1 and a storage device connected to port 45 you’ll see that the frame has to traverse 2 additional hops from ASIC 1 to a back-end ASIC and from there onward to ASIC 5. (Internal links are not counted as an official “hop-count” so it is not calculated on fabric parameters.) As you have seen in my previous post when a link error between an end-device and a switch port surfaces causing credit depletion the switch or device will reset the link, bring back the credit count to login values and the upper layer protocol (ie SCSI)  has to retry the IO. There is no such mechanism on back-end ports. You might argue that given the fact this is all inside the switch and no vulnerable components like optical cables can cause corrupt primitives hence lost R_RDY’s are impossible. Unfortunately this is not entirely true. There might be circumstances where front-end problems might propagate to the back-end. This is often seen during very high traffic times and problematic front-end ports. The result is that one or more of those back-end links have a credit stuck at zero which basically means the front-end port is no longer allowed to send frames to the back-end therefore causing similar problems with high latency and elongated IO recovery times. The REALLY bad news is that there is (now : “was”) no recovery possible besides bouncing the entire switch. (By bouncing I mean a reboot and not throwing it on the floor hoping it will return in your hands. Believe me, it wont. At least not in one piece)

All the Brocade OEM’s have run into situations like this with their customers and especially on larger fabrics with multiple blade based chassis with hundreds of ports connected you can imagine this was not a good position to be in.

In order to fix this Brocade has implemented some new logic, albeit proprietary and not FC-FS standard, which allows you to enable a feature to turn on back-end credit checking. In short what it does is that it monitors the number of credits on each of these back-end links, if the credit counter stays at less than the number of credits negotiated during login for the E_D_TOV timeframe and no frames has been sent during that timeframe, the credit recovery feature will reset the link for you in a similar fashion as the front-end port do, and it will resume normal operation.

The way turn on this feature is also part of the bottleneckmon command:

bottleneckmon –cfgcredittools -intport -recover onLrOnly

To be able to use this command you have to be at least at FOS level 6.3.2d, 6.4.2a or higher.
The latest FOS releases also have a manual check in case you might suspect a stuck credit condition.

As soon as you’ve enabled this feature (which I suggest you do immediately) you might run into some new errorcodes in the eventlog. When a lost credit condition is observed you will see a C2-1012 which shows something similar like this:

Message , [C2-1012], ,, WARNING, ,S,P(): Link Timeout on internal portftx= tov= (>)vc_no= crd(s)lost= complete_loss:.  (s)>

If this happens due to a problem on the back-end whereby a physical issue might be the problem the cause is most likely an increased bit error rate causing the same encoding/decoding errors. As shown before this will also corrupt R_RDY primitives. In addition to the C2-1012 you will also see a C2-1006 sometimes followed by a C2-1010

Message , [C2-1006], ,, WARNING, ,S,C: Internal link errors reported, no hardware faultsidentified, continuing monitoring: fault1:, fault2:thresh1:0x.  

Message , [C2-1010], ,, CRITICAL, ,S, C: Internal monitoring has identified suspecthardware, blade may need to be reset or replaced: fault1:,fault2: th2:0x.  

There are some more errorcodes which outline specific conditions but I refer to the Brocade Message Reference guide for more info.

We talked a bit of frameflow and latency in the previous articles. Besides corruption due to a high bit-error rate you might be running into a high latency device which basically is a device which is slow in returning credits. Sometimes this is by design as firmware engineers might use this as a way to throttle incoming traffic which they are unable to offload onto the PCI bus or there is some other problem inside the system itself which causes elongated handling of data by which the HBA cannot send the data quick enough to memory via RDMA or to the CPU for further handling.
Although it is a choice of the firmware and design engineers normally the response time to send back  buffercredits should be virtually instantaneous.

To to give you a feeling of the timing an average R_RDY reponse time is around 1 to 5us. If you have a older network with low performance links this might increase to around 50us or sometimes higher.

Sorry, here should be a picture of a FC trace snippet
FC trace

As you can see (i hope you have good eyesight) the R_RDY is returned within 2us of the dataframe being sent. This includes the reception of the first 24 bytes of the frame and the routing decision time-span the switchport has made. So all in all pretty quick.

On somewhat slower equipment it looks a bit like this:

(The time between the frame and R_RDY shows a delta of 10.5us)

When you have a latency problem where the device is realy slow in returning credits this time is far greater than shown above.

The above picture come from a pretty expensive FC analyser and I don’t expect you to buy one (although it’s a pretty cool toy to get to the nitty gritty of things.) If you run into a performance problem where you don’t see any obvious physical issue you might have ran into a slow drain device. With the later FOS versions from Brocade there is a counter called er_tx_c3_timeout. This counters shows how many time a frame has been discarded due to upstream credit shortages. If this is an F-port than the device connected to this port is a very candid suspect of mucking up your storage network. If this is an ISL then you will need to look at devices that are more upstream of this port connected to other switches, routers or other equipment.

As always be aware that all these counters are cumulative and will just wrap after a while. You will have to establish a new baseline and then monitor for any counter to increase. The counter I mentioned above is also part of the porterrshow output since FOS version 7 which makes it very easy to determine such a condition.

I hope the blog posts in this series have helped a bit to explain how FC flow control work, how it operates in normal environments and also what might happen if it doesn’t go as planned plus the ways to prevent and solve such conditions.

Let me know if you want to know about this or any other topic and I’ll see what I can produce.

Regards,
Erwin

One rotten apple spoils the bunch – 3

In the previous 2 blog-posts we looked at some areas why a fibre-channel fabric still might have problems even with all redundancy options available and MPIO checking for link failures etc etc.
The challenge is to identify any problematic port and act upon indications that certain problems might be apparent on a link.

So how do we do this in Brocade environments? Brocade has some features build into it’s FOS firmware which allows you to identify certain characteristics of your switches. One of them (Fabric-Watch) I briefly touched upon previously. Two other command which¬†utilize¬†Fabric_Watch are bottleneckmon and portfencing. Lets start with bottleneckmon.

Bottleneckmon was introduced in the FOS code stream to be able to identify 2 different kinds of bottlenecks: latency and congestion.

Latency is caused by a very high load to a device where the device cannot cope with the offered load however it does not exceed the capabilities of a link. As an example lets say that a link has a synchronized speeds of 4G however the load on that link reached no higher than 20MB/s and already the switch is unable to send more frames due to credit shortages. A situation like this will most certainly cause the sort of credit issues we’ve talked about before.

Congestion is when a link is overloaded with frames beyond the capabilities of the physical link. This often occurs on ISL and target ports when too many initiators are mapped on those links. This is often referred to as an oversubscribed fan-in ratio.

A congestion bottleneck is easily identified by looking at the offered load compared to the capability of the link. Very often extending the connection with additional links (ISL, trunk ports, HBA’s) ¬†and spreading the load over other links or¬†localizing/confining the load on the same switch or ASIC will most often help. Latency however is a very different ballgame. You might argue that Brocade also has a portcounter called tim_txcrd_zero ¬†and when that reaches 0 pretty often you also have a latency device but that’s not entirely true. It may also mean that this link is very well utilized and is using all its credits. You should also see a fair link utilization w.r.t. throughput but be aware this also depends on frame size.

So how do we define a link as a high latency bottleneck? The bottleneckmon configuration utility provide a vast amount of parameters which you can use however I would advise to use the default settings as a start by just enabling bottleneck monitoring with the “bottleneckmon –enable” command. Also make sure you configure the alerting with the same command otherwise the monitoring will be passive and you’ll have to check each switch manually.

If a high latency device is caused by physical issues like encoding/decoding errors you will get notified by the bottleneckmon feature however when this happens in the middle of the night you most likely will not be able to act upon the alert in a timely fashion. As I mentioned earlier it is important to isolate this badly behaving device as soon as possible to prevent it from having an adverse effect on the rest of the fabric. The portfencing utility will help with that. You can configure certain thresholds on port-types and errors and if such a threshold has been reached the firmware will disable this port and alert you of it.

I know many administrators are very reluctant to have a switch take these kind of actions on its own and for a long time I agreed with that however seeing the massive devastation and havoc a single device can cause I would STRONGLY¬†advise¬†to turn this feature on. It will save you long hours of troubleshooting with elongated conference calls whilst your storage network is causing your application to come to a halt. I’ve seen it many times and even after pointing to a problem port very often the¬†decision to disable such a port subject to change management politics. I would strongly suggest that if you have such guidelines in your policies NOW is the time to revise those policies and enable the intelligence of the switches to prevent these problem from¬†occurring.

For some comprehensive overview, options and configuration examples I suggest you first take a look at the FOS admins guide of the latest FOS release versions. Brocade have also published some white-papers with more background information.

Regards,
Erwin

 

One rotten apple spoils the bunch – 1

Last week I had another one. A rotten apple that spoiled the bunch or, in storage terms, a slow drain device causing havoc in a fabric.

This time it was a blade-center server with a dubious HBA connection to the blade-center switch which caused link errors and thus corrupt frames, encoding errors and credit depletion. This, being a blade connected to a blade-switch, also propagated the credit depletion back into the overall SAN fabric and thus the entire fabric suffered significantly from this single problem device.

“Now how does this work” you’ll say. Well, it has everything to do with the flow-control methodology used in FC fabrics. In contrast to the Ethernet and TCP/IP world we, the storage guys, expect a device to behave correctly, as gentleman usually do. That being said, as with everything in life, there are always moment in time when nasty things happen and in the case of the “rotten apple” one storage device being an HBA, tape drive, or storage array may be doing nasty things.

Let’s take a look how this normally should work.

FC devices run on a buffer-to-buffer credit model. This means the device reserves an certain amount of buffers on the FC port itself. This amount of buffers is then communicated to the remote device as credits. So basically devices a gives the remote device permission to use X amount of credits. Each credit is around 2112 bytes (A full 2K data payload plus frame header and footer)

The number of credits each device can handle are “negotiated” during fabric login (FLOGI). On the left a snippet from a FLOGI frame were you see the number of credits in hex.

So what happens after the FLOGI. As an example we use a connection that has negotiated 8 credits either way. If the HBA sends a frame (eg. a SCSI read request) it knows it only has 7 credits left. As soon as the switch port¬†receives¬†the frame it has to make a decision where to send this frame to. It does this based on routing tables, zoning configuration and some other rules, and if everything is correct it will route the frame to the next destination. Meanwhile it¬†simultaneously¬†sends back a, so called, R_RDY primitive. This R_RDY tells the HBA that it can increase the credit counter back by one. So if the current credit counter was 5 it can now bump it back up to 6. (A “primitive” lives only between two directly connected ports and as such it will never traverse a switch or router. A frame can, and will, be switched/routed over one or more links)

Below is a very simplistic overview of two ports on a FC link. On the left we have an HBA and on the right we have a switch port. The blue lines represent the data frames and the red lines the R_RDY primitives.

As I said, it’s pretty simplistic. In theory the HBA on the left could send up to 8 frames before it has to wait for an R_RDY to be returned.

So far all looks good but what if the path from the switch back to the device is broken? Either due to a crack in the cable, unclean connectors, broken lasers etc. The first problem we often see is that bits get flipped on a link which in turn causes encoding errors. FC up to 8G uses a 8b10b encoding decoding mechanism. According to this algorithm the normal 8 data bits are converted to a, so called, 10-bit word or transmission character. These 10 bits are the actual ones that travel over the wire. The remote side uses this same algorithm to revert the 10-bits back into the original 8 data bits. This assures bit level integrity and DC balance on a link. However when a link has a problem as described above, chances are that one or more of these 10-bits flip from a 0 to 1 or vice-versa. The recipient detects this problem however since it is unaware of which bit got corrupted it will discard the entire transmission character. This means that if such a corruption is detected it will discard en entire primitive, or, if the corrupted piece was part of a data frame, this entire frame will be dropped.

A primitive (including the R_RDY) consists of 4 words. (4 * 10 bits). The first word is always a control character (K28.5) and it is followed by three data words (Dxx.x). 

0011111010 1010100010 0101010101 0101010101 (-K28.5 +D21.4  D10.2  D10.2 )

I will not go further into this since its beyond the scope of the article.

So if this R_RDY is discarded the HBA does not know that the switch port has indeed free-ed up the buffer and still think it can only send N-1 frames. The below depicts such a scenario:

As you can see when an R_RDY is lost at some point in time it will become 0 meaning the HBA is unable to send any frames. When this happens an error recovery mechanism kicks in which basically resets the link, clearing all buffers on both side of that link and start from scratch. The upper layers of the FC protocol stack (SCSI-FCP, IPFC etc) have to make sure that any outstanding frame have either to be re-transmitted or the entire IO needs to be aborted in which case this IO in it’s¬†entirety needs to be re-executed.¬†As you can see this will cause a problem on this link since a lot of things are going on except actually making sure your data frames are transmitted. If you think this will not have such an impact be aware that the above¬†sequence¬†might run in less than one tenth of a second and thus the credit depletion can be reached within less than a second. So how does this influence the rest of the fabric since this all seems to be pretty confined within the space of this particular link.

Let broaden the scope a bit from an architectural perspective. Below you see a relatively simple, though architecturally often implemented, core-edge fabric.

Each HBA stands for one server (Green, Blue,Red and Orange), each mapped to a port on a storage array.
Now lets say server Red is a slow drain device or has a problem with its direct link to the switch. It is very intermittently returning credits due to the above explained encoding errors or it is very slow in returning credits due to a driver/firmware timing issue. The HBA sends a read request for an IO of 64K data. This means that 32 data frames (normally FC uses a 2K frame size) will be sent back from the array to the Red server. Meanwhile the other 3 servers and the two storage arrays are also sending and receiving data. If the number of credits negotiated between the HBA’s and the servers is 8 you can see that after the first 16K of that 64K request will be send to Red server however the remaining 48K still is either in transit from the array to the HBA or it is still in some outbound queue in the array. Since the edge switch (on the left) is unable to send frames to the Red server the remaining data frames (from ALL SERVERS) will stack up on the incoming ISL port (bright red). This in turn causes the outbound ISL port on the core switch (the one on the right) to deplete its credits which means that at some point in time no frames are able to traverse the ISL therefore causing most traffic to come to a standstill.

You’ll probably ask “So how do we recover from this?”. Well, basically the port on the edge switch to the Red server will send a LR (Link Reset) after the agreed “hold-time”. The hold time is a calculated period in which the switch will hold frames in its buffers. In most fabrics this is 500ms. So if the switch has had zero credit available during the entire hold period and it has had at least 1 frame in its output buffer it will send a LR to the HBA. This causes both the switch and HBA buffer to clear and the number of credits will return to the value that was negotiated during FLOGI.

If you don’t fix the underlying problem this process will go on forever and, as you’ve seen, will severely impact your entire storage environment.

“OK, so the problem is clear, how do I fix it?”

There are two ways to tackle the problem, the good and the bad way.

The good way is to monitor and manage your fabrics and link for such a behavior. If you see any error counter increasing verify all connections, cables, sfp’s, patch-panels and other hardware sitting in between the two devices. Clean connectors, replace cables and make sure these hardware problems do not re-surface again. My advice is if you see any link behaving like this DISABLE IT¬†IMMEDIATELY¬†!!!! No questions asked.

The bad way is to stick your head in the sand and hope for it go away. I’ve seen many of such issues¬†crippling entire fabrics and due¬†strictly¬†enforced change control severe outages¬†occurred¬†and elongated recovery (very often multiple days) was needed to get things back to normal again. Make sure you implement emergency procedures which allow you to bypass these operational guidelines. It will save you a lot of problems.

Regards,
Erwin van Londen