History

  1. In 2006 Engine Yard built what has become known as a cloud
  2. We built that system on a very tight budget
  3. We chose a SAN system that advertised fully redundant capabilities
  4. We ordered said fully redundant units
  5. We were told by the vendor that the redundant units were unavailable, but coming soon
  6. We updated commercial offering and technical documentation to show that clusters were *not* redundant across SAN shelves
  7. We achieved high levels of operational excellence
  8. We upgraded to new SAN units from vendor as of our 3rd cluster, EY02
  9. We experienced rapid growth through 2008, adding ey03-ey05
  10. We experienced our first ever data loss on September 3, on one of those new SAN units
  11. We experienced our second ever data loss on October 8, on one of those new SAN units, on the Github cluster
  12. We experienced our third ever data loss on October 10, on one of those new SAN units
  13. We have never experienced operational downtime nor data loss on the original SAN units

 

The Original Plan

  1. It’s obvious that the performance and reliability of the original units lulled us into a false sense of security regarding single points of failure, and assuming that new units would perform as well as old units, i.e. past performance is not a reliable indicator of future performance
  2. We’re not implicating the vendor as responsible for these issues, instead EY has out grown their product line
  3. Therefore, must replace the SAN with a major vendor which has no single point of failure
  4. Immediately after the first data loss, we decided to bring in a high end SAN vendor

 

Where We Are Today

We received an email from our current SAN vendor. We posted about the reception of this email, but didn’t want to say more until the vendor had demonstrated to us that a fix was in hand. A snippet of this email is included below.

When reading this email, it’s important to note that our SAN vendor does not manufacture it’s own equipment, but instead uses standard SuperMicro equipment. SuperMicro has become a major OEM and supplier of data center servers by being a reliable manufacturer.

“We have discovered the cause of the multiple disk failures (domino failures) in the SR2461 chassis.”

snip…

“We have confirmed that many of the Engine Yard units that you have are the early backplane version.  Our laboratory tests have demonstrated that domino failures can occur with the early backplane design.  We have also demonstrated that replacing the backplane with the new design rectifies the problem.”

 

Our Decision

We are choosing to retain our current SAN vendor. As noted above, their original units have never experienced the type of reliability problems that the new units have caused. The new units will be upgraded to eliminate the failures that we have experienced.

We do not make this decision lightly. Here are several data points to clarify our concerns and understand how we achieved this decision:

  1. The cost of a new SAN system was enormous.
  2. Reconfiguring the current clusters to access the new SAN would require a change of protocol from AoE to iSCSI. This would require substantial changes to the network environment of existing clusters, and would have known derogatory CPU performance per I/O. To be clear, the I/O performance of the new system would likely be improved -vs- our current system, however, that simply exacerbates the additional CPU per I/O overhead of iSCSI.
  3. The new SAN system, while boasting extremely high uptime potential, would be a single point of failure at the scale of a data center outage. Any disruption to the SAN service would take *all* customers in a data center offline, whereas our current system limits SAN related outages to approximately 20 customers at a time.
  4. Switching to a new SAN vendor would result in a substantial disruption of service for each and every one of our customers, whereas we believe we can implement the fix with minimal disruption of service.

 

    – Tom Mornini, CTO and Founder

    Post a Comment

    You must be logged in to post a comment.