Quote from bali_survivor:
To me it proves that there is something fundamentally wrong with your acceptance testing procedures.
( and the rate that you are pushing out "upgrades" was already indicative that this was more than likely )
While downtime is clearly not acceptable, it's not reasonable to draw conclusions about IB's quality assurance in this way. Looking from the outside it's just not possible to know the source of the problem. It could lie in so many different areas.
Running high availabitily systems is a lot more difficult than people realise. In particular, automatic failovers are difficult because there can be failure modes not forseen in the system design. No matter how much testing happens in a test environment, there can always be something different in a production environment that screws thing up.
As an example, I worked on a system for a large telco that was prepared to spend the money to have 24x7. Everything was dualed with automatic failover, striped and mirrored disk arrays, hot swappable disks, dual lans, dual redundant power supplies for everything. Failover well tested. It went down for a day because one of the disk power supplies caught fire - there was *smoke and flames*. Of course the whole thing had to be shut down. It could not resonably be said to have been anyones fault or due to incompetance or negligence.
So while IB's customers have every right to demand reliable operation, drawing conclusions about IB's development and operational practices is not really justified because none of us know the facts.
I think in the interest of better customer relations, IB should be providing a more detailed account of what went wrong, and what actions are in place to ensure it doesn't happen again.
Incidently I have had no problems with IB today, though last friday there were some issues,.