Tag Archives: fillwords

Signal quality and link stability

I really think I should stop with fillword discussions but here is one more. What happens even if you have set the correct fillword, have made sure all hardware is in tip-top shape and still the encoding errors fly around like a swarm of hornets. Then the problem of ISI might be more problematic.

The main issue still is that the receiving side is unable to distinguish between a 0 and 1. The so called eye-pattern is too narrow or too distorted in such a way the receiver is just seeing gibberish.

Continue reading

Fillwords IDLE vs ARBff (one last time)

I’ve written about fillwords a lot (see here, here, and here) but I didn’t show you much about the different symptoms an incorrect fillword setting may incur.

As you’ve seen fillwords are a very nifty way of maintaining bit and word-sync on a serial transmission link when no actual frames are sent. Furthermore they also are replaceable with other primitive signals (Like R_RDY, VC_RDY etc) to utilize a very simple instruction method between two ports without interfering with actual frames. That means that fillwords are ALWAYS squeezed in between frames.


Continue reading

One rotten apple spoils the bunch – 2

As mentioned in my previous post it only takes a single device to really cause some serious havoc in a storage environment. Now, “Why”, you may ask, do we have all these redundant kit in our environment like dual fabrics, redundant controllers, dual HBA’s , MPIO software etc whilst this “slow drain device” is the absolute “Achilles heel” of the entire storage infrastructure.

Well, lets take a step back why it has come to this point. As with most hardware and software it develops over time so when we started doing network based storage in the early to midst 90’s we started out with a brand new protocol called Fibre-Channel. (I’m sure you heard of it.) This first iteration was based on arbitrated loop basically meaning we connect the TX port of an HBA to a RX port of a disk or tape device and vice versa effectively causing a loop in a P-t-P topology.  When more HBA’s and/or storage devices were inserted you would get a ring topology. This was OK when you had around 3 or 4 devices in a ring (126 were possible) however from a manageability perspective you can imagine this was nightmare. So a new device called a FC-HUB was invented. This at least provided a single connectivity platform so you could run all your cables to the same box. Internally however this was still a loop topology since each hub port just forwarded the frames to the next port which in turn sent it to the device who, if the frame was not addressed to him sent it back to the hub and so on until it reached the destination. Now, this wasn’t really an effective way of doing things so at first the hub got a bit more intelligent by becoming a, so-called, loop switch. This meant the hub port looked at the destination address and if it wasn’t destined for a device attached to his port he would just sent it on to the next hub. This continued until the destination port was reached who then opened the port and sent the frame to the device.

As you can imagine in some larger loop topologies whenever a device came online or off-line every single device in that loop had to be made aware of this change and as such the LIP (Loop Initialization Protocol) was invented. This protocol made sure that each device got a sort of “update” of the appeared or disappeared device. Later on the loop methodology was almost entirely abandoned by switched fabrics who are far more intelligent in shoving frames in the right direction.

Now remember that Fibre-Channel was developed with one thing in mind ans that was to get the maximum possible speed out of very reliable networks. This also meant that no error-correction is done on a protocol layer and ever possible recovery option available was handled by the upper layer protocols like IP or SCSI.
The problem still was that you always has a single point of failure irrespective of which topology you chose. If you had a server in a loop and the HBA had a problem the entire loop could potentially be mucked up. The same when a AL-HUB or FC switch had a problem. All your connections to your disks would be lost and at best you had the luck to use journal-led filesystems who were relatively fast in recovering. How many of you have waited 5 or more hours for a windows chkdsk to finish just to find out it had no problem of the entire disk was corrupted and you had to restore from tape.

So to circumvent that the storage folk more or less determined that you would need at least 2 of everything physically separated so no component could affect the availability of another. This is were MPIO comes in since when you have multiple paths to a device over separate channels the operating system just sees it as a different device so potentially you end up with two disks (or tapes or whatever) which physically it the same volume. MPIO software fixed that by building in logic to present just one volume to the OS. The other thing they build in MPIO was the link error detection. If a link dropped light or lost sync for whatever reason the HBA would go into a non-active state and sends a signal to the upper layer that it had lost the link, MPIO could redirect all IO’s to the other paths and everything would live happily ever after. If that link came back again MPIO would pick this up and provided the option to use that path again and we were on our way.

This shows that MPIO is relying on HBA state signals upon which MPIO can act. The problem however is that a link might drop somewhere else in the fabric.This way the HBA has no problem since its link is still up, in sync and shows no other issues. The only way for MPIO to observe such a problem is to detect an IO failure and react on one or more of these failures by putting the logical path in an offline state. (The physical link from the HBA to the switch is still online.)
This imposes another problem. What if there is no IO going over that path. Many storage networks are designed in an active passive configuration so only one logical path is sending and receiving IO’s. If there is a problem on the passive side of the path but it is further downstream in the fabric the HBA will not notice this and, as such, there will be no notification to the MPIO layer and MPIO will never put this path offline. In case of a real problem on the active side MPIO tries to fail over however it will run into the same problem and both paths to the devices will fail therefore causing the same problem. Many MPIO software vendors like HDLM from Hitachi have build in logic to test for such conditions. In HDLM you configure so called IEM (Intermittent Error Monitoring). HDLM will poll the target device by sending a sector 0 read request every once in a while to the target device and if that succeeds it will wait for the next polling cycle. If an error has been observed more times than the configured threshold it will put the path offline.

You might think we’ve covered everything now and I wish it was true. MPIO only acts upon frames going AWOL but as you’ve seen in my previous article the major problem is often beyond the data frames and a vast majority these days is due to problems in flow control. This in turn causes slow drain device which have an effect of depleting credits further downstream.

Only the FC layer 2 has any notion of buffer credits and this is never propagated to the upper level protocol stack. This is true for any HBA, firmware, driver, MPIO software and OS. If any problems occur downstream of the initiator or upstream of the target, all devices in that particular path will incur a performance impact and an availability problem at some point in time. MPIO will NOT help in this case as I explained above.

The only way to prevent this from happening is active monitoring and management of you entire fabric and if any apparent link issues do surface fix them immediately.

What do you look for in these cases. Basically all errors that might affect an FC frame or FC traffic flow.
In Brocade FOS there is a command called “porterrshow” of which the output looks like this.

The 7 columns outlined show if any issues with frames and/or primitives have been happening at some point in time. (Use the “help porterrshow” command to show an explanation of each of the columns.). Use subsequent porterrshow command to see if any of them are increasing. The other option is to create a new baseline with the “statsclear” commandso all counters are reset to 0.

Cisco has a similar output albeit being a non-table format with the “show interface detailed-counters”.

The next article outlines an option in Brocade FOS to detect a slow drain device with the bottleneckmon feature and how to  automatically disable a port if too many errors of one of the above counters have occurred in a certain time-frame. If you have a Brocade FOS admin manual look at the port-fencing feature.

Kind regards,

One rotten apple spoils the bunch – 1

Last week I had another one. A rotten apple that spoiled the bunch or, in storage terms, a slow drain device causing havoc in a fabric.

This time it was a blade-center server with a dubious HBA connection to the blade-center switch which caused link errors and thus corrupt frames, encoding errors and credit depletion. This, being a blade connected to a blade-switch, also propagated the credit depletion back into the overall SAN fabric and thus the entire fabric suffered significantly from this single problem device.

“Now how does this work” you’ll say. Well, it has everything to do with the flow-control methodology used in FC fabrics. In contrast to the Ethernet and TCP/IP world we, the storage guys, expect a device to behave correctly, as gentleman usually do. That being said, as with everything in life, there are always moment in time when nasty things happen and in the case of the “rotten apple” one storage device being an HBA, tape drive, or storage array may be doing nasty things.

Let’s take a look how this normally should work.

FC devices run on a buffer-to-buffer credit model. This means the device reserves an certain amount of buffers on the FC port itself. This amount of buffers is then communicated to the remote device as credits. So basically devices a gives the remote device permission to use X amount of credits. Each credit is around 2112 bytes (A full 2K data payload plus frame header and footer)

The number of credits each device can handle are “negotiated” during fabric login (FLOGI). On the left a snippet from a FLOGI frame were you see the number of credits in hex.

So what happens after the FLOGI. As an example we use a connection that has negotiated 8 credits either way. If the HBA sends a frame (eg. a SCSI read request) it knows it only has 7 credits left. As soon as the switch port receives the frame it has to make a decision where to send this frame to. It does this based on routing tables, zoning configuration and some other rules, and if everything is correct it will route the frame to the next destination. Meanwhile it simultaneously sends back a, so called, R_RDY primitive. This R_RDY tells the HBA that it can increase the credit counter back by one. So if the current credit counter was 5 it can now bump it back up to 6. (A “primitive” lives only between two directly connected ports and as such it will never traverse a switch or router. A frame can, and will, be switched/routed over one or more links)

Below is a very simplistic overview of two ports on a FC link. On the left we have an HBA and on the right we have a switch port. The blue lines represent the data frames and the red lines the R_RDY primitives.

As I said, it’s pretty simplistic. In theory the HBA on the left could send up to 8 frames before it has to wait for an R_RDY to be returned.

So far all looks good but what if the path from the switch back to the device is broken? Either due to a crack in the cable, unclean connectors, broken lasers etc. The first problem we often see is that bits get flipped on a link which in turn causes encoding errors. FC up to 8G uses a 8b10b encoding decoding mechanism. According to this algorithm the normal 8 data bits are converted to a, so called, 10-bit word or transmission character. These 10 bits are the actual ones that travel over the wire. The remote side uses this same algorithm to revert the 10-bits back into the original 8 data bits. This assures bit level integrity and DC balance on a link. However when a link has a problem as described above, chances are that one or more of these 10-bits flip from a 0 to 1 or vice-versa. The recipient detects this problem however since it is unaware of which bit got corrupted it will discard the entire transmission character. This means that if such a corruption is detected it will discard en entire primitive, or, if the corrupted piece was part of a data frame, this entire frame will be dropped.

A primitive (including the R_RDY) consists of 4 words. (4 * 10 bits). The first word is always a control character (K28.5) and it is followed by three data words (Dxx.x). 

0011111010 1010100010 0101010101 0101010101 (-K28.5 +D21.4  D10.2  D10.2 )

I will not go further into this since its beyond the scope of the article.

So if this R_RDY is discarded the HBA does not know that the switch port has indeed free-ed up the buffer and still think it can only send N-1 frames. The below depicts such a scenario:

As you can see when an R_RDY is lost at some point in time it will become 0 meaning the HBA is unable to send any frames. When this happens an error recovery mechanism kicks in which basically resets the link, clearing all buffers on both side of that link and start from scratch. The upper layers of the FC protocol stack (SCSI-FCP, IPFC etc) have to make sure that any outstanding frame have either to be re-transmitted or the entire IO needs to be aborted in which case this IO in it’s entirety needs to be re-executed. As you can see this will cause a problem on this link since a lot of things are going on except actually making sure your data frames are transmitted. If you think this will not have such an impact be aware that the above sequence might run in less than one tenth of a second and thus the credit depletion can be reached within less than a second. So how does this influence the rest of the fabric since this all seems to be pretty confined within the space of this particular link.

Let broaden the scope a bit from an architectural perspective. Below you see a relatively simple, though architecturally often implemented, core-edge fabric.

Each HBA stands for one server (Green, Blue,Red and Orange), each mapped to a port on a storage array.
Now lets say server Red is a slow drain device or has a problem with its direct link to the switch. It is very intermittently returning credits due to the above explained encoding errors or it is very slow in returning credits due to a driver/firmware timing issue. The HBA sends a read request for an IO of 64K data. This means that 32 data frames (normally FC uses a 2K frame size) will be sent back from the array to the Red server. Meanwhile the other 3 servers and the two storage arrays are also sending and receiving data. If the number of credits negotiated between the HBA’s and the servers is 8 you can see that after the first 16K of that 64K request will be send to Red server however the remaining 48K still is either in transit from the array to the HBA or it is still in some outbound queue in the array. Since the edge switch (on the left) is unable to send frames to the Red server the remaining data frames (from ALL SERVERS) will stack up on the incoming ISL port (bright red). This in turn causes the outbound ISL port on the core switch (the one on the right) to deplete its credits which means that at some point in time no frames are able to traverse the ISL therefore causing most traffic to come to a standstill.

You’ll probably ask “So how do we recover from this?”. Well, basically the port on the edge switch to the Red server will send a LR (Link Reset) after the agreed “hold-time”. The hold time is a calculated period in which the switch will hold frames in its buffers. In most fabrics this is 500ms. So if the switch has had zero credit available during the entire hold period and it has had at least 1 frame in its output buffer it will send a LR to the HBA. This causes both the switch and HBA buffer to clear and the number of credits will return to the value that was negotiated during FLOGI.

If you don’t fix the underlying problem this process will go on forever and, as you’ve seen, will severely impact your entire storage environment.

“OK, so the problem is clear, how do I fix it?”

There are two ways to tackle the problem, the good and the bad way.

The good way is to monitor and manage your fabrics and link for such a behavior. If you see any error counter increasing verify all connections, cables, sfp’s, patch-panels and other hardware sitting in between the two devices. Clean connectors, replace cables and make sure these hardware problems do not re-surface again. My advice is if you see any link behaving like this DISABLE IT IMMEDIATELY !!!! No questions asked.

The bad way is to stick your head in the sand and hope for it go away. I’ve seen many of such issues crippling entire fabrics and due strictly enforced change control severe outages occurred and elongated recovery (very often multiple days) was needed to get things back to normal again. Make sure you implement emergency procedures which allow you to bypass these operational guidelines. It will save you a lot of problems.

Erwin van Londen

Fill Words. What are those, what do they do and why are they needed

There has been quite some confusion around the use of fill words with the adoption of the 8G fibre-channel standard. Some admins have reported that they have problems connecting devices on this speed as well as numerous headaches in long-distance replication especially when DWDM/CWDM equipment is involved.

An ordered set is a transmission word used to perform control and signaling functions. There are 3 types of ordered sets defined:

1. Frame delimiters. These identify the start and end of frames.
2. Primitive signals. These are normally used to indicate events or actions (like IDLE)
3. Primitive Sequences which are used to indicate state or condition changes and are normally transmitted continuously until something causes the current state to chance. Examples are NOS,OLS,LR,LRR

So what is a fill-word? A fill-word is a primitive signal which is needed to maintain bit and word synchronization between two adjacent ports. Is doesn’t matter what port type (F-port,E-port,N-Port etc) it is. They are not data frames in the sense that they transport user-data but instead they communicate status messages between these two ports. If no user-data is transmitted the ports will send so called IDLE frames. These are just frames with some bit pattern where the ports are able to keep there synchronization on a bit-level as well as a word level. The IDLE primitive is a 10-bit transmission character on the wire, as any ordered set starts with K28.5 which is a fibre-channel notation for 8B10B encoding and three data words of which the last 20 bits are 1010101010….etc. Depending on the content of these transmission characters it’s either a fill-word or non-fillword.

Examples of fillwords are IDLE, ARB(F0), ARB(FF) and non-fillword are R_RDY, VC_RDY etc.

So what happened recently with the introduction of the 8G standard.

In the 1,2 and 4G standard the IDLE primitive signal was used to keep bit and word synchronization. This bitpattern was OK on those speeds however it has been observed that when increasing the clock speed this pattern caused high emissions which in turn could cause problems on adjacent ports and links. In order to reduce that the standard now requires links that are using 8G speed to use the ARB(ff) fill-word. This is a different bit-pattern which doesn’t have this high emission characteristic.

You might wonder what does this have to do with my connection problem? If links negotiate on 8G speed they both have to use the ARB(FF) fill-word. If that doesn’t happen for some reason then the ports cannot maintain word synchronisation and therefore cannot change the port into the active state. This causes both ports to be in some sort of deadlock situation and although you may see that there is a green status light on your HBA and switch port it still is not able to transfer data.

The standard defines that ports who connect on 8G speed first have to initialize with IDLE fill-words and as soon as the port changes to the active state it should change the fill-word to ARB(FF).

It becomes even more complicated with DWDM and CWDM equipment particularly when multiplexers are used. These TDM devices normally crack open the fibre-channel link on a frame boundary level and then are able to multiplex this on a higher clock-rate so they are able to send data from multiple links into one wavelength. If however these TDM devices cannot open the fibre-channel link because they only look for IDLE fillwords then the end-to-end link will fail.

Verify with you manufacturer if you use TDM devices and if so do they support ARB(FF) fillwords. If not than you may have to force the linkspeed to a lower level like 4G.