I\’ve written about fillwords a lot (see here, here, and here) but I didn\’t show you much about the different symptoms an incorrect fillword setting may incur.
As you\’ve seen fillwords are a very nifty way of maintaining bit and word-sync on a serial transmission link when no actual frames are sent. Furthermore they also are replaceable with other primitive signals (Like R_RDY, VC_RDY etc) to utilize a very simple instruction method between two ports without interfering with actual frames. That means that fillwords are ALWAYS squeezed in between frames.
In the FC world a transmitter shall ALWAYS send 6 fillwords before sending the next frame. This allows that shifting ability on the RX side. The receiving side only needs to see 3 fillwords before it is able to process an arriving SOF. If there are only two fillwords seen and then a new SOF comes in the frame will be invalidated . There is no maximum on the number of fillwords since there may not be an actual IO coming from the host so there simply is no FC frame active on the link.
As I mentioned in my previous posts fillwords come in two flavours these days. IDLE and ARBff (there are more like ARBfe etc but these are rarely used and you cannot configure those on any switch or other mainstream equipment.)
When a device (N-Port) is connected to a switch (F-port) it first negotiates the actual speed. This is done via the timed stepping model where the port records a supported speed list array and picks the highest mutual speed of both endpoints. Once that speed has been determined it needs to get bit and word-synchronisation. The first one is obvious, it needs to be able to distinguish between a 0 and a 1. The second one is a bit more difficult. Since FC is a so called \”word-aligned\” protocol every frame or primitive needs to be aligned on a word boundary. In FC this boundary is 4 bytes or 40-bits (we\’re talking 10-bit encoded transmission characters here). To determine the word boundaries of primitives and frames we always start with a special character called the “comma” or K28.5 in short FC notation. The 3 characters after that determine the type of the primitive or start of frame (SoF).
In between frames.
If there is no IO being sent by the host there will be no frames obviously. You cannot just stop transmitting bits over this link because of minute differences in clock speed the signal rate may become miss-aligned. As soon as you then send a new SoF the receiving side will first need to re-align itself but has no means to do this since it has no reference points at that time. Since FC is using CDR (Clock Data Recovery) to align itself on the incoming bit stream, see my article here, you need to have an active bit-stream on that link. This is where fillwords come in. With the transition to 8Gb and higher transmission speeds there was a distinct problem called RFI (Radio Frequency Interference) when using IDLE fillwords which causes noise to appear on the link. This can cause many issues as described in the article referenced above. By using a different fillword called arb(ff), which inherently seems to reduce the RFI due to a different bit-pattern, the chance of anything going wrong on a link is much less.
What can go wrong?
Well, here’s an example of a Brocade switch which has ports set fixed to 8G and the fillword mode set to 3. One thing I need to explain first is that there are two stages where fillwords can change. The first stage is when the link comes up after the speed has been negotiated and bit-synchronisation has been obtained and the second one is when the port-state-machine (PSM) has to transition from a LF2 state to the Active state. If one end of the link is starting to send IDLE and the other one ARB(FF) obviously they both cannot sync up on word-boundaries and hence the port will not come online. Another issue is the RFI problem mentioned above. Circuitries inside the equipment may be set up differently based upon the state of the port. If a port expects IDLEs but receives ARBs or the other way around this may cause massive port errors.
So here the example. Some switch-ports have been set to mode 3 on some ports where the speed is fixed to 8G. This means that the switch-port will first try to use ARB during link initialisation and ARB when the PSM moves to the Active state. If that fails what normally happens is that the switch-port sends a NOS (Not Operational) primitive after which it is expected the remote device sends an OLS (Offline) primitive. (never mind the terminology). As soon as the switch port sees the OLS it is triggered to change the initialisation sequence and will use IDLE during link initialisation and ARB when the PSM shifts to the Active state. The issue is that when RFI is causing havoc on the link when the device expect IDLEs but the switch send ARB there is a very high chance the NOS primitive sequence is either not received at all or it is corrupted in such a way the receiving side is unable to identify it as a NOS and hence will never send an OLS. Obviously when the switch does not receive an OLS it does not switch to IDLE/ARB and the port stays in this “limbo” state.
Ports of Slot 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
———————-+—+—+—+—+—–+—+—+—+—–+—+—+—+—–+—+—+—
Speed 8G 8G 8G 8G AN AN AN AN 8G 8G 8G 8G AN AN AN AN
Fill Word(On Active) 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
Fill Word(Current) 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
Ports of Slot 0 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
———————-+—+—+—+—+—–+—+—+—+—–+—+—+—+—–+—+—+—
Speed 8G 8G 8G AN AN AN AN AN AN AN AN AN AN AN AN AN
Fill Word(On Active) 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
Fill Word(Current) 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
As you can see many ports observed massive encoding errors. (I removed some columns to fit into the page)
porterrshow :
frames enc crc crc too too bad enc disc link loss loss
tx rx in err g_eof shrt long eof out c3 fail sync sig
0: 858 1.3k 0 0 0 0 0 0 3.2m 0 222 0 0
1: 10.5k 10.8k 0 0 0 0 0 0 0 0 250 0 0
2: 6.9k 7.5k 0 0 0 0 0 0 1.5m 0 498 0 1
3: 233 214 0 0 0 0 0 0 0 0 630 0 0
4: 1.3k 834 0 0 0 0 0 0 4.9m 1 0 0 0
5: 10.8k 10.5k 0 0 0 0 0 0 996.7k 1 0 0 0
6: 7.5k 6.9k 0 0 0 0 0 0 3.6m 1 0 0 0
7: 194 151 0 0 0 0 0 0 1.3m 1 0 0 0
8: 7.3k 5.8k 0 0 0 0 0 0 3.2m 0 756 0 0
9: 304.1k 573.8k 0 0 0 0 0 0 0 0 250 0 0
10: 15.8k 16.5k 0 0 0 0 0 0 537.4k 0 466 0 0
11: 1.0m 8.0m 0 0 0 0 0 0 3.3m 0 680 0 0
12: 5.8k 7.2k 0 0 0 0 0 0 3.0m 18 0 0 0
13: 573.8k 304.1k 0 0 0 0 0 0 1.5m 1 0 0 0
14: 16.5k 15.7k 0 0 0 0 0 0 2.9m 1 0 0 0
15: 8.0m 1.0m 0 0 0 0 0 0 601.5k 3 0 0 0
16: 435.5k 337.4k 0 0 0 0 0 0 3.6m 0 590 0 0
This then results in the massive port-error messages in the eventlog.
2014/06/17-16:58:34, [PORT-1003], 241, FID 128, WARNING, DR_SW1, Port 2 Faulted because of many Link Failures.
2014/06/17-16:58:43, [PORT-1003], 242, FID 128, WARNING, DR_SW1, Port 11 Faulted because of many Link Failures.
2014/06/17-16:58:43, [FW-1424], 243, FID 128, WARNING, DR_SW1, Switch status changed from HEALTHY to MARGINAL.
2014/06/17-16:58:43, [FW-1437], 244, FID 128, WARNING, DR_SW1, Switch status change contributing factor Faulty ports: 8 faulty out of 80 ports:config(10.00 percent,8). (Port(s) 2(0x2),3(0x3),8(0x8),10(0xa),11(0xb),16(0x10),17(0x11),18(0x12)).
2014/06/17-16:59:37, [FW-1425], 245, FID 128, INFO, DR_SW1, Switch status changed from MARGINAL to HEALTHY.
2014/06/17-16:59:39, [PORT-1003], 246, FID 128, WARNING, DR_SW1, Port 3 Faulted because of many Link Failures.
2014/06/17-16:59:40, [PORT-1003], 247, FID 128, WARNING, DR_SW1, Port 17 Faulted because of many Link Failures.
2014/06/17-16:59:55, [PORT-1003], 248, FID 128, WARNING, DR_SW1, Port 16 Faulted because of many Link Failures.
You might ask “Why do I only see encoding errors not a single CRC error?”. Well remember that the CRC is checked on a frame and not on a primitive. Since the port is not Online yet the attached device is not sending frames yet.
The solution to the above problem was to modify the fillword mode to 2 (which is the official way in the FC standard) which forces the switchport to use IDLE during initialisation and ARB during Active state.
Older equipment.
Switches and devices that do not support 8G have no idea of the arb(ff) fillword and basically look for IDLEs to align on the bitstream. If however a 4G device is connected to an 8G and the fillword mode is fixed to mode 1 (which forces the port to use ARB during initialisation and ARB during the Active state) you’ll end up with the same problem where the ports remain in this limbo state since they cannot align on word-boundaries simply because they look for something different.
If you have a 4G device connected to a 8G switch-port either set the speed fixed to 4G (which bypasses the entire fillword issue) or make sure the fillword mode is set to 1.
Below a table of fillword settings that should work according to official standards.
Fillword | Portspeed | |||
Mode | 1 | 2 | 4 | 8 |
0 | X | X | X | N |
1 | N | N | N | Z |
2 | N | N | N | X |
3 | N | N | N | Z |
N = Not supported
X = Supported
Z = Only supported when vendor approves.
I hope this shows a bit of the issues these fillwords can cause and how to solve them.
Kind regards,
Erwin van Londen
Hi Erwin,
Where did u get the official standards for the chart above regarding the fill modes? Do you have any links?
Hi,
These are based upon the official FC standards. You may see vendors advising differently especially when it comes to mode 3 and 2.
Hi,
I’ve got 3 HDS G1000 with 8gb ports .. what is the best practice for fillword please ?
Stef.
Hi Stef,
As per T11 fibre-channel standard the switch-ports on a 8G switch (DCX/DCX-4s or any Goldeneye2/Condor2 based switch) connected to any HDS array should be set to mode 2.
Hope this helps.
I’m seeing mass errors across my Brocade encryption switches currently all ports are set to 0 idle/ildle. The problem we are seeing is when a cfgenable is run some of the esx hosts see their storage connections become unstable. To recover the issue the port is disabled and the device recovers on the other connections to the SAN. Currently the storage is all HDS (vsp and hus) the major issue appears to be with the HUS as connections stay alive to the VSP.
We are planning to change the HDs devices to portcfgfillword 2 and the ESX hosts to 3. Is this a red herring or are we following the right path. any advice appreciated.
Thanks
Hi Tony,
In order to rule out incorrect fillwords having an effect you should set them according to manufacturers guidelines. You didn’t really specify WHICH errors you see as a bad-ordered-set is not necessarily a hard error but it can have a negative effect which then results in these hard errors.
Let me know how it goes.
Regards,
Erwin
hey Erwin the only error we are seeing is “er_bad_os 730762736 Invalid ordered set” the stat was cleared on Thursday last week and is happening on 20 ports on the switch and seem to be about 1 billion per port/day. We are working to fix them and should have them done by the end of the week.
Yes, that is the expected counter to ramp up when an invalid fill-word is set. If you set the correct fillword as I described these will stop to accumulate.