Dear Engine Yard customer,

We originally planned a large SAN system upgrade this weekend on our shared clusters, but we’ve heard from many of you in the past few days and we’ve decided to change our plans.

We’ve heard that:

1) We did not give enough notice for the scheduled SAN backplane replacement.

2) The scheduled maintenance was going to occur too early in the day.

3) The next few weeks of December represent an important time for many of your businesses.

4) You’d prefer to continue as-is through December with a small risk of SAN problems, instead of having a long maintenance occur right now.

So, we are going to wait until January to perform the maintenance on all shared clusters.

To reduce the risk of SAN problems right now, our staff has enacted a plan, with the assistance of our SAN vendor, to reduce the risk of catastrophic disk failure.  In the event of a disk failure, we will take the unit with the failed disk offline, and replace it individually.  This should result in less than 30 minutes of down time on a per-incident basis.

We want to convey that although the additional steps we have taken achieve a reasonable risk reduction, no step outside of total replacement eliminates the risk entirely, and that’s what we need to do in January.

Following the confirmation of a new maintenance period with our SAN vendor, our staff will be alerting you at least 30 days before the date of the rescheduled maintenance.  At this time we expect the date to be in late January.  (Engine Yard partners will receive additional advance notice as well.)

I want to personally thank everyone who communicated concerns with our team over the last few days.  We hear you loud and clear!

Thanks again for your support on this issue — we are anxious to see this SAN problem behind us forever and it’s now just a matter of scheduling the right time to replace the backplanes on the affected SAN units.

Sincerely,

–Taylor Weibley

Dir. of Support

We’re going to be doing the big CoRAID backplane swap this coming weekend. This is a big maintenance because we’ll be swapping around a lot of hardware.

I expect all customers will be notified about the maintenance windows before the end of the day, if they haven’t been already.

Time zone for the hours below is PST.

Sacramento 12/12 through 12/14

xc88 – 12/12 @ 18:00 to 21:00

ey02 – 12/12 @ 21:00 to 0:00

sa00 – 12/13 @ 0:00 to 2:00

ey01 – 12/13 @ 18:00 to 21:00

nr00  – 12/13 @ 21:00 to 23:30

ey05 – 12/14 @ 18:00 to 21:00

hc00 – 12/14 @ 21:00 to 23:30

New Jersey 12/16 through 12/17

ey03 – 12/16 @ 18:00 to 21:00

ey04 – 12/16 @ 21:00 to 0:00

We’ve tested and validated the CoRAID SR2461 backplane update on our dev cluster. Everything went well. The procedure is well documented & works. We’ve also completed an initial pass at our internal documentation for implementing the update.

We’re working on the logistics for the deployment of this update to all affected clusters over the next few days and should have a solid replacement plan soon.

Thanks!

-Edward Muller

History

  1. In 2006 Engine Yard built what has become known as a cloud
  2. We built that system on a very tight budget
  3. We chose a SAN system that advertised fully redundant capabilities
  4. We ordered said fully redundant units
  5. We were told by the vendor that the redundant units were unavailable, but coming soon
  6. We updated commercial offering and technical documentation to show that clusters were *not* redundant across SAN shelves
  7. We achieved high levels of operational excellence
  8. We upgraded to new SAN units from vendor as of our 3rd cluster, EY02
  9. We experienced rapid growth through 2008, adding ey03-ey05
  10. We experienced our first ever data loss on September 3, on one of those new SAN units
  11. We experienced our second ever data loss on October 8, on one of those new SAN units, on the Github cluster
  12. We experienced our third ever data loss on October 10, on one of those new SAN units
  13. We have never experienced operational downtime nor data loss on the original SAN units

 

The Original Plan

  1. It’s obvious that the performance and reliability of the original units lulled us into a false sense of security regarding single points of failure, and assuming that new units would perform as well as old units, i.e. past performance is not a reliable indicator of future performance
  2. We’re not implicating the vendor as responsible for these issues, instead EY has out grown their product line
  3. Therefore, must replace the SAN with a major vendor which has no single point of failure
  4. Immediately after the first data loss, we decided to bring in a high end SAN vendor

 

Where We Are Today

We received an email from our current SAN vendor. We posted about the reception of this email, but didn’t want to say more until the vendor had demonstrated to us that a fix was in hand. A snippet of this email is included below.

When reading this email, it’s important to note that our SAN vendor does not manufacture it’s own equipment, but instead uses standard SuperMicro equipment. SuperMicro has become a major OEM and supplier of data center servers by being a reliable manufacturer.

“We have discovered the cause of the multiple disk failures (domino failures) in the SR2461 chassis.”

snip…

“We have confirmed that many of the Engine Yard units that you have are the early backplane version.  Our laboratory tests have demonstrated that domino failures can occur with the early backplane design.  We have also demonstrated that replacing the backplane with the new design rectifies the problem.”

 

Our Decision

We are choosing to retain our current SAN vendor. As noted above, their original units have never experienced the type of reliability problems that the new units have caused. The new units will be upgraded to eliminate the failures that we have experienced.

We do not make this decision lightly. Here are several data points to clarify our concerns and understand how we achieved this decision:

  1. The cost of a new SAN system was enormous.
  2. Reconfiguring the current clusters to access the new SAN would require a change of protocol from AoE to iSCSI. This would require substantial changes to the network environment of existing clusters, and would have known derogatory CPU performance per I/O. To be clear, the I/O performance of the new system would likely be improved -vs- our current system, however, that simply exacerbates the additional CPU per I/O overhead of iSCSI.
  3. The new SAN system, while boasting extremely high uptime potential, would be a single point of failure at the scale of a data center outage. Any disruption to the SAN service would take *all* customers in a data center offline, whereas our current system limits SAN related outages to approximately 20 customers at a time.
  4. Switching to a new SAN vendor would result in a substantial disruption of service for each and every one of our customers, whereas we believe we can implement the fix with minimal disruption of service.

 

    – Tom Mornini, CTO and Founder

    We’re waiting for a set of backplanes to ship from CoRAID so we can test them in our dev cluster and finalize the roll out plan. We’re expecting the test backplanes to arrive on Monday, November, 24, 2008. 

    For more insight into the issues related to the CoRAID backplanes here is an email from CoRAID describing the discovery and issues related to the fix.

    Hello Lee and all,
    
    Recreating and identifying the cause of the multi-disk failure was not an
    easy task.  Our initial testing of the chassis when it was first introduced
    did not expose any problems.  Units were shipped to Engine Yard and other
    customers without reported problems.  Our disk failure testing did not
    expose any problems.  But it should be noted that the disk used by Coraid in
    our tests were not the same model and make as Engine Yard is using.
    
    In our search to locate the source of multi-disk failures in the SR2461
    chassis, we tested many different theories; including firmware code bugs,
    disk temperature, disk vibration and backplane power.  All of these tests
    were unsuccessful in exposing the problem until we discovered that early
    SR2461 chassis had been shipped with a different disk backplane.  We then
    discovered that our chassis supplier had updated the backplane without
    notifying us of the change.  Our purchasing agreement with our vendor
    requires them to share ECO changes when they are made, but for this
    particular change a mistake was made and it was not reported to Coraid.
    
    After discovering the backplane change had occurred, we surveyed our test
    systems and found that they all had the newer versions of backplane.  We
    located an early backplane version and installed it in our lab test
    environment.  When power stress testing was done on a full complement of
    disks, we were able to repeatedly replicate the multi disk failure case that
    Engine Yard has experienced.  When the backplane was replaced with the new
    version, our power stress testing works without problem.
    
    To insure we are satisfied with the test results, we purchased 24 of the
    drives you use and repeated our stress tests without seeing any problems with the new version backplane.
    
    Disk drive power consumption varies based upon operating load, manufacture
    and model.  The power problem associated with the early backplane was caused
    by marginal power distribution on the backplane.  The new design backplane
    has corrected this power distribution problem, at least with all the disk
    drives we have tested.
    
    Coraid has obtained replacement backplanes for all SR2461's with the early
    backplane version.  We are ready to start the backplane change out as soon
    as Engine Yard has completed its survey to locate the affected chassis.  We
    will do our best to work with Engine Yard's operational schedules to
    expedite this field change.
    
    Coraid will do all we can to remedy this problem.  Please advise how you
    would like to proceed.
    
    My apologies,
    
    Jim

    -Edward Muller

    CoRAID, our current SAN vendor, has been working internally to determine the reasons for our multiple disk failures. After a discussion with one of their suppliers they determined that about 50 of our units had their backplanes recalled by the supplier. But CoRAID wasn’t notified.

    We’re preparing a plan to replace those units and repair the backplanes. I’ll have more information about that plan during the coming week.

    On the replacement SAN front we’ve finalized the need for fiber channel to the db class machines and are working with vendors to rectify configurations and we’re adjusting our TCO calculations. We have more discussions planned for the coming next week on how that will affect our customers.

    I’ll also been on the Weekly Customer Conference call to field questions.

    Thank You,

    Edward Muller

    We’ve been working on migration strategies and TCO numbers for the new SAN purchase.

    I’ve also generated some base TCO numbers that I can build a more complete model from. Those TCO numbers are based on some theoretical growth numbers. It’s interesting to see how each vendor’s options stack up.

    The technical team has spent more time with the vendors discussing solutions. We’ve tentatively decided that we like Fiber Channel, at least for the new Db Tier machines. I’m waiting for quotes from the vendors so I know how that affects the costs. 

    So why do we like FC?

    • We can’t get the throughput with iSCSI that we want, at least at a sane, and same, physical connection count.
    • We’ll have a ton of extra cables in the rack, leading to reduced air flow and additional management overhead, more switch ports, etc.
    • If we go to denser servers, which we’re looking at, we’ll want even more ports, or 10Gbe.
    • Power on 10Gbe is something like 33w per port. I’m still waiting to see what the per port power on 4 Gb/s FC is like.
    • 10Gbe copper switches won’t be out for a few months yet. And, even though I trust Extreme Networks a LOT, I’m not sure I want a generation 1 product at the core of the network.
    • 10Gbe will also cost a lot.
    • We can run fiber longer, with less delay.
    • FC is apparently more mature than iSCSI.

    FC Downside?

    • It costs a bunch more.
    • We have a minimal set of FC skills in house, at least when compared to IP, SCSI & Ethernet.
    • Additional switching equipment.
    • Additional management overhead of those switches.

    We’re still determining if we need it everywhere. I think we need it in the Db Tier though.

    -Edward Muller

    I’ve eliminated one vendor, who I still need to call (where did the time go?). They offered a great product, but we just weren’t comfortable enough with it.

    Sunil Pareenja, our VP of Finance, has been working on the legal and financial aspects of the deal since Tuesday. Agreements are getting hammered out and product features and TCOs are being normalized so we have a more apples to apples comparison of the remaining products.

    To incorporate a SAN and a new DB tier, I spent a lot of time on Wednesday discussing architectural changes to our existing cluster infrastructure with the rest of the technical team. The interesting news out of the conversation is that we didn’t need to change much. Our clusters are already designed to scale fairly large.  We ended up changing a few things with the cluster design to help allow the Cluster Support Engineers to support them a little easier, but it’s not a radical re-design. I’ll be posting more soon.

    The technical team met with 2 of the 3 remaining SAN vendors today. We’re meeting with the third one on Friday.

    Special thanks to Extreme Networks for sending one of their Network Engineers up to Sacramento for 2 days to work with us. I think he’s acquired more SAN knowledge than he ever wanted to.

    I also had lunch with Hitachi Data Systems today. They claim to have a DMX killer. Looks compelling and I’m taking a further look. I’ve asked for a rough quote and some technical documentation to read over the weekend. I don’t know if we’ll include them this late in the game though, it will have to be VERY compelling.

    And I’m also closing off a side conversation I’ve been having with Left Hand Networks, who has recently been acquired by HP. They offer an interesting product and they have a downloadable VMWare storage appliance to test out their software. It’s pretty slick, but my gut tells me they may be a little too ahead of the curve for us atm. If this project was running slower, I’d love to have tested their software beyond my laptop.

    -Edward Muller

    P.S. A few pics once I sync up my camera.

    I received calls today from both Hitachi Data Systems and Fujitsu Computer Systems. They apparently read the blog. Hello HDS and FCS!

    EMC came back today and we disqualified the Clariion line. We’ve asked for a DMX, aka Symmetrix, quote which they will provide tomorrow.

    In the meantime we’re crunching through the other vendors proposals.

    We’re also in the process of flying in Lee Jensen (Cluster Engineering), Dan Peterson (Systems Engineering) & Tyler Poland (DBA) into our CA data center to work through the implementation details along with myself (Edward Muller), our CTO, Tom Mornini & our Systems Architect, Jayson Vantuyl. We’ll also be pulling in Storage Architects from our chosen SAN vendor as well as a Network Engineer from our switch vendor (Extreme Networks). More on that as things come together.

    -Edward Muller