Part 4 The Saga Continues.

Everything was nice and quite for a few months for us here regarding this issue. Then classes started and things got a little busy and the problem is back with a vengeance. We have been getting paged once per night. It will be a random switch in the stack not always the same one. We have logged a support case with HP and we got the standard teir1 response of update to the most current firmware. So we will be upgrading to W.15.12.0011 during our next maintenance window. If this does not address the issue we will continue to poke away at the problem!

Part 1 | Part 1 follow up | Part 2 | Part 3 | Part 4

** Emergency Maintenance Announcement – No service interruption anticipated **

We replaced a pre-failing (3:26pm) drive in the Development cluster today. The array will be rebuilding over the next few hours.

Start: 10/27/2013 7:48 PM

End: 10/27/2013 7:48 PM

If you have questions or concerns about this maintenance, please contact the Shared Infrastructure Group at osu-sig (at) oregonstate.edu or call 737-7SIG.

** Maintenance Announcement – No service interruption anticipated **

We will be performing minor maintenance.

We do not anticipate any service interruption.

Start: 10/5/2013 01:00 AM

End: 10/5/2013 01:05 AM

If you have questions or concerns about this maintenance, please contact the Shared Infrastructure Group at osu-sig (at) oregonstate.edu or call 737-7SIG.

** Emergency Maintenance Announcement – No service interruption anticipated **

We will be replacing a failed raid/rom battery in a fast disk node in our lefthand SAN. This is the storage network that back’s our VMware infrastructure.

We do not anticipate any service interruption. Our nodes are redundant for each other, we will only be working on a single node as such the change should not be service interrupting.

Start: 07/14/2013 11:00 PM

End: 07/14/2013 12:00 PM

If you have questions or concerns about this maintenance, please contact the Shared Infrastructure Group at osu-sig (at) oregonstate.edu or call 737-7SIG.

** Maintenance Announcement – No service interruption anticipated **

We will be applying a configuration change to our iSCSI switches that support our StoreVirtual SAN. This is the storage network that back’s our VMware infrastructure. (configuring timezone offset take2, kicking up syslog debug level)

We do not anticipate any service interruption. Our switching is redundant, we will only change the switches one at a time, and the changes should not be service interrupting.

Start: 06/08/2013 10:00 PM

End: 06/08/2013 10:15 PM

If you have questions or concerns about this maintenance, please contact the Shared Infrastructure Group at osu-sig (at) oregonstate.edu or call 737-7SIG.

** Maintenance Announcement – No service interruption anticipated **

We will be applying a configuration change to our iSCSI switches that support our StoreVirtual SAN. This is the storage network that back’s our VMware infrastructure. (configuring syslog)

We do not anticipate any service interruption. Our switching is redundant, we will only change the switches one at a time, and the changes should not be service interrupting.

Start: 06/04/2013 10:00 PM

End: 06/04/2013 10:15 PM

If you have questions or concerns about this maintenance, please contact the Shared Infrastructure Group at osu-sig (at) oregonstate.edu or call 737-7SIG.

We made the changes requested of us in Part 2. However we are still experiencing the occasional ping/fail on one of the switches. We have not seen an loop detected. Interesting to have loop protection turned on and doesn’t hurt anything but the blade centers and flex10s do not seem to be the problem.

So where do we go now?! Another call into hp support and they have directed us to perform the following steps:

  1. Re-seat all components in the switch that keeps paging us.
  2. Enable syslog on all switches and see if they say anything.
  3. Add a timezone offset to the ntp configuration.

Hopefully this will show us what is happening as we still do not have a resolution. Or is it time to start replacing hardware? Is the sfp unit in the switch bad? is the 10g module bad? is the switch bad? lots of questions and no real firm answers as to why we get woken up in the middle of the night yet all seems fine except for a quick down/up event.

Part 1 | Part 1 follow up | Part 2 | Part 3 | Part 4

** Maintenance Announcement – No service interruption anticipated **

We will be applying a configuration change to our iSCSI switches that support our StoreVirtual SAN. This is the storage network that back’s our VMware infrastructure.

We do not anticipate any service interruption. Our switching is redundant, we will only change the switches one at a time, and the changes should not be service interrupting.

Start: 06/01/2013 10:00 PM

End: 06/01/2013 11:00 PM

If you have questions or concerns about this maintenance, please contact the Shared Infrastructure Group at osu-sig (at) oregonstate.edu or call 737-7SIG.

This last Monday we experienced two more pages for downed events for one of our switches one at 8am and one at 5pm. This did not impact service but is troubling as we want everything to be healthy all the time in our environment. For a description of the problem we are seeing take a look at my earlier blog post and its follow up.

I put in a support call to HP and referenced the older ticket and the repeat of the problem. Support requested a copy of the output from each switch from the command show tech all. I dumped the output and sent it off to the helpful support person. Later that day the support person called back and asked about why were were on such a new version of the firmware! So I pointed out that it was their support whom gave us the copy of the firmware and told us to run it. At the end of this support call HP has come back with two changes. They would like us to add loop protect on the ports that feed our blade centers. They would also like us to reconfigure both switch2s in each site so that their trunk ports are statically defined instead of auto-detected/dynamic.

So our next maintenance window is this Saturday and we will perform the following changes:

All switches will get:

config
loop-protect mode port
loop-protect a2
write mem
exit

Each switch2 will get (where ? is 1 for site1 and 2 for site2):

config
no interface <port list> lacp
trunk <port list> trk? lacp
write mem
exit

Part 1 | Part 1 follow up | Part 2 | Part 3 | Part 4

** Maintenance Announcement – No service interruption anticipated **

We will be applying a configuration change to our iSCSI switches that support our StoreVirtual SAN. This is the storage network that back’s our VMware infrastructure.

We do not anticipate any service interruption. Our switching is redundant, we will only change the switches one at a time, and the changes should not be service interrupting.

Start: 05/18/2013 10:00 PM

End: 05/18/2013 11:00 PM

If you have questions or concerns about this maintenance, please contact the Shared Infrastructure Group at osu-sig (at) oregonstate.edu or call 737-7SIG.