Powered by Blogger.

How to Identify SFP or cable failure in brocade fibre channel switch

In SAN you may encounter loss of link i.e. link errors – a bad hardware component in the SAN will start to corrupt the frames because of this condition the receiver will request sender to resend the frames thus causing latency and other related issues. Most of the time the culprit would be cable or SFP (Small form factor pluggable) but more analysis is required to narrow down to a single hardware component.

The purpose of this post is help you to identify if cable or SFP is causing the issue. In my example i have taken Brocade san switch. Switch will report an error saying “health of the switch is changed from healthy to marginal” adding to this it will also report what caused the switch to change its state in our case it will be a port. below is the event,

2013/08/05-XXXXX, [XXXX], 9378, SLOT 5 | XXX, WARNING, XXXXXXX, Switch status changed from HEALTHY to MARGINAL.

2013/08/05-XXXXX, [XXXX], 9379, SLOT 5 | XXX, WARNING, XXXXXXX, Switch status change contributing factor Marginal ports: 1 marginal ports. (Port(s) 9(0x9)

First thing that we should look after seeing this event is the porterrshow in that switch, enc outalone often implies cable fault, combination of enc out and crc err implies SFP fault but not always, in my switch below are the porterrshow values,

crc err: 0
enc out: 252.5k
disc c3: 13.9k
link fail: 558
loss sync: 1.1k
loss sig: 1.1k

all the other values were zero.

since enc out counter values is high we can say that the cable is faulty. Other counters are also incremented to high values, so now we need to check portshow of that particular port.

portshow 9
the above command will show more details, we should look at the section in the output.Lr_in: 555 Lr_out: 556 Ols_in: 0 Ols_out: 555

The above parameters will tell us if the destination or the source SFP is causing the error. If Lr_in and Ols_out is equal then its a normal condition, same applies for Lr_out and Ols_in.

if one counter is higher that its counterpart then the frame was already corrupted when it was received or corrupted by the switch.
"in" > "out" - frames were bad when received
"out" > "in" - corruption caused by switch

So in our case Lr_out greater than Ols_in hence the corruption is caused by faulty GBIC in switch port.
    Blogger Comment
    Facebook Comment