Kayako Logo
News and Announcements Kayako news and announcements [Subscribe]

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  (#1) Old
Mohit Sharma Offline
Staff
 
Mohit Sharma's Avatar
 
Posts: 39
Join Date: Aug 2006

Owned License
Hosted Servers Problem - 09-01-2008, 10:57 AM

Hi,

There is a sporadic connectivity for all our hosted servers due to a major power failure in Gnax. Please refer the following URL "http://www.webhostingtalk.com/showthread.php?t=662073".

Regards,

Mohit Sharma


Mohit Sharma (mohit.sharma ]at[ kayako.com)
----------------------------------------------------------------
---
   
Reply With Quote
  (#2) Old
Jamie Edwards Offline
Operations Manager
 
Jamie Edwards's Avatar
 
Posts: 4,316
Join Date: Jan 2006
Location: UK

SupportSuite
Owned License

09-01-2008, 11:33 AM

Although we are yet to hear from the data centre, connectivity has now resumed and the servers are coming back up. We will be investigating the cause of this as soon as we can get in touch with data centre.


Jamie Edwards (jamie.edwards ]at[ kayako.com)
----------------------------------------------------------------
---
  • New to the forum? New user's guide here.
  • Submit bug reports here.
  • Submit support tickets via the members area.
  • Submit sales queries either via live chat or via e-mail.
  • There is no official ETA on Version 4.
   
Reply With Quote
  (#3) Old
Jamie Edwards Offline
Operations Manager
 
Jamie Edwards's Avatar
 
Posts: 4,316
Join Date: Jan 2006
Location: UK

SupportSuite
Owned License

09-01-2008, 06:02 PM

We are still experiencing problems with servers 12 and 17 - we are waiting to hear from the data centre and won't be leaving until they are back up and functioning.

We apologise for the inconvenience.


Jamie Edwards (jamie.edwards ]at[ kayako.com)
----------------------------------------------------------------
---
  • New to the forum? New user's guide here.
  • Submit bug reports here.
  • Submit support tickets via the members area.
  • Submit sales queries either via live chat or via e-mail.
  • There is no official ETA on Version 4.
   
Reply With Quote
  (#4) Old
Mohit Sharma Offline
Staff
 
Mohit Sharma's Avatar
 
Posts: 39
Join Date: Aug 2006

Owned License
10-01-2008, 07:33 AM

Hi,

All the servers are working fine now, The explanation that we received from Data Center is as follows:

-------------SNIP----------------------

At approximately 4:45 am EST, the NAP suffered a power outage lasting
approximately 10 seconds from Georgia Power.

The generators fired and came online 15 seconds after the initial outge and
the load was transferred to generators which ran for 30 minutes while
monitoring the incoming power quality from GA Power at which time the load
was transferred back to utility.

One of the UPS's that serves part of the facility suffered a battery outage
on 2 different redundant strings which caused it to drop the load.
We installed a second redundant string approximately 9 months ago to
minimize the possibility of this type of situation. The batteries in the 2
strings are setup in parallel meaning each is capable of carrying the full
load for up to 5 minutes.

All it takes is 1 battery in a string to fail for the entire string to fail.
this is the same in all ups systems and is the reason we installed the
second string from advice from the manufacturer.

The original string batteries are 1.5 years old and were installed new. The
second string is 9 months old and was installed new.

A single battery in the second string failed after 3 batteries in the first
string failed.

We turned the generators back on to avoid an interruption during
troubleshooting and maintenance and MGE sent a tech onsite within an hour to
troubleshoot at which time we discovered the battery issue. we replaced the
batteries within an hour of diagnosis and brought the system back onlnine
and out of maintenance bypass.

The load is currently protected and all batteries have been tested again.

Both sets of batteries have been maintained and tested by MGE direct service
every 6 months under a pm plan that they recommended for proper maintenance
and operation.

This was extremely rare and unforseen to have something like this happen.

We are purchasing our own battery tester and will set up a monthly pm on the
batteries that we will conduct ourselves in addition to the 6 month pm that
MGE does on the UPS as well as the batteries. We are also researching a real
time battery monitoring system that can predict battery failure.

Batteries are the weakest link in the system and we feel like we properly
followed recommended engineering and maintenance on these systems. - however
that will not assure 100% as we found out today in a very rare incident.




Extemporaneous events that continued to affect service during the outage:
one of the main metro e switches that runs the links of our backbone went
offline during the outage and during that powerinduced reboot we lost
connectivity to half our backbones. we have our backbones split in half -
with half going out the east and half out the west side of the building
taking dirverse paths across redundant switches to the final interconnect
points.
the switch was unstable when it came back online due to a gbic that died and
for some odd reason rebooted itself several times about every 10 minutes. we
replaced the gbic with a spare we keep onsite.

This caused half the backbones to go up and down and placed a large cpu load
on the different core routers we have due to bgp table loads going on - this
is very cpu intensive and when you have a lot of up and down it can appear
that the network is completely down (it is if you are on a link that is
flapping) but the fact is that the entire network was not down but was
impacted. this settled down when the switch was stabilized.

We split our backbones up over several different redundant backbone routers.

once this switch was brought back online and stabilized the network
stabilized as well.

an access switch that serves 16 servers also died and we replaced it with a
spare once we found the issue. we keep spares on site for every piece of
network gear we have.

an apc that was only 6 months old and is a dual fed apc from 2 different
power sources (including the newer ups) failed and did not come back - we
replaced it with an onsite spare. it was bizarre to say the least and of
course it powered one of our 3 main dns clusters so we lost dns capacity for
an hour.

Most of the issues currently going on are related to server hardware that
did not do well in a power reboot situation or need a fsck. we are actively
working on them and will not rest until all is well.

Many customers in the facility do have A and B feeds from our power. we
offer this through different ups systems / different power panels and
different transformers. Some very early customers that purchased a and b
feeds when we only had one ups system at the NAP are on the same ups and as
such lost power. those customers will be offered a free move on their b feed
to the newer ups to increase their power diversity - they simply need to
open a ticket.

What are we doing on power in the future?

We have another UPS from MGE on order as of 4 weeks ago that is due to
deliver in mid Feb that will increase the diveristy of the power in the
facility. We plan on having 2 battery strings on it as well.

We are in the process of installing another set of 5 cummins generators and
another 3000 amp transformer which will further diversify our generator and
transformer plant - this will be completed in mid february - construction of
this is going on currently we took delivery of the switchgear and generators
2 weeks ago. 4 ups/ will be moved to the new power feeed and g enarators to
diversify the power source to the UPS . this will give us 100% redundancy on
the A / B feeds at that point.

We installed a redundant b feed to our metro e gear and 2 dual fed apcs at
our TELX cabinet after TELX suffered a complete UPS failure at 56 marietta 4
months ago. This turned out to be good because there was another complete
failure of the B ups 4 weeks ago - but we were not affected since we had a
redundant feed from them. the outage affected all customers on the second
floor. we would have more than 50% of our network had we not been on dual
fed apcs and dual power feeds at the building which would have been bad.

we are increasing the battery pm schedule to monthly from biannual.

we are researching a battery monitoring system for the strings.

we will be taking a fuel delivery this week to restock our main fuel supply

we are examining in depth on of our 4 core metro switch abnormalities this
morning and if we do not find a rfo from the manufacturer will be examining
replacing it or upgrading to a different more robust solution - which has
been in our long term plan but may get moved up.

we will be doing another power examination of our core swithcing routers (
currently 6 of them all with dual fed power ) and our core metro e switches
(currently 4 of them) to make sure that our power feeeds are truly redundant
and no legacy circuits are there to affect them.

we will be examining our on site spares inventory to make sure we are still
at correct levels since we used some items this morning.


An access switch died which was replaced. There was a power outage on one
grid. Those servers were powered on manually after the UPS was replaced,
however some servers may have had problems getting booted back up, which
required a tech to go out and go through the boot process manually.


We appologize for the outage caused by the failure of the primary and backup
batteries and will continue to provide the best service at an excellent
price.
The MGE tech that has all the major accounts in Atlanta including Coca-cola
and several others told us that this was a very freak occurance with
negligible odds of happening and in his opinion we have done everything
right on our maintenance and pm and redundancy of the batteries and he would
have done the same thing and that there was really nothing he would have
recommended different at that point.

we are still going to make the changes above that I mentioned though.

-------------------------------

Regards,

Mohit Sharma


Mohit Sharma (mohit.sharma ]at[ kayako.com)
----------------------------------------------------------------
---
   
Reply With Quote
Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Hosted solution reliability - what should I expect? So far - not good. EugeneFXI Comments, Questions & Feedback 17 21-06-2007 09:14 PM
[HOW TO] Get the Winapp client runing on Terminal Servers Paul Agerbeek LiveResponse Desktop Application 0 12-10-2006 11:34 AM
Migrating Tickets and admin panel configuration from Hosted eSupport to your server christinasc SupportSuite, eSupport and LiveResponse 0 24-05-2006 07:30 PM
SupportSuite Hosted Email Problem SimonJones SupportSuite, eSupport and LiveResponse 2 06-04-2006 09:10 AM



Powered by vBulletin® Version 3.6.8
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
LinkBacks Enabled by vBSEO 3.1.0

Kayako provides online help desk software and support solutions; enabling companies to improve their support and reduce costs.

Our three main products include: SupportSuite, eSupport and LiveResponse



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46