ClockGate 2017 – The Intel Atom C2000

The pieces are coming together in “ClockGate” and it would appear that Intel the worlds largest CPU manufacturer is in the centre of the mess.   According to TheRegister – and while not confirmed, Intel’s C2000 processor has a fault that will cause device bricking, but nobody is talking.   A cross section of equipment from various manufacturers, and confirmed with my investigation – they all have this same Intel C2000 processor.   After Intel’s comments to the register, I think the culprit has been found.

Who is affected

The first to open up about was Cisco – admitting to problems with everything from ISR 4K’s, NCS Optical Gear, some ASA 5500 series firewalls, a few Nexus 9K Fabric modules and both the MS350 switch and MX84 firewall from Meraki.   I was going to write about it – but wanted to figure out what is actually going down here.

Cisco is not alone – Dell is also affected, users of Synology storage devices have been talking about it.  HP, NEC, NetGear, SuperMicro, and the list goes on and on.

HP MoonShot M300/M350,   Dell FX,  Segate home NAS products,  PFSense NetGate

I applaud Cisco for being first out of the gate to say “We have a problem, and we are fixing it”,  many vendors would sit around and figure out how they can sweep this under the rug, but Cisco is getting out in front of it.

The list of who is affected is growing – hourly.

The Cone of Silence

Nobody is talking,  Cisco is refusing to name the vendor, and Intel is refusing to name the product manufacturers but the writing is clearly on the wall.  Dell also isn’t talking, and when we reached out to some of our contacts – we received no responses from a few vendors (including Cisco).

The silence is not that much of a surprise, Intel is a huge partner with everyone involved and without Intel, these companies have no products, and without products, Intel isn’t selling silicon – so everyone is protecting everyone.

Cisco is at at the table with how to replace the affected devices – others are still quiet.

What caused this?

This little guy – the Intel Atom C2000.   Designed to provide power and scale into smaller footprints for intelligent system applications, systems on a chip and as a processor in the DPDK – the Data Plane Development Kit with the ability to improve packet processing speeds.

intel-atom-c2000-1000x562

Image result for Atom C2000

This little guy did.  The Intel C2000 series.  Intel issues an errata note AVR.54 that basically states that “System May Experience Inability to Boot or May Cease Operation,” because the clock outputs on the chip simply stop functioning.  Apparently this is occurring because Intel didn’t think people would use this SOC – constantly, and as a result the clock output is failing.

If you want all the nerdy specs on the C2000 – Click Here.. 

You need a clock – without it, CPU’s lose touch with the rest of the system – including things like BIOS and bus connected devices.   So once this clock signal fails – your system will not even boot up.

The statement is not really acceptable, you sold it for DPDK, and as a scaleable IoT processor, but yet in your own words (via TheRegister) ” degradation of a circuit element under high use conditions at a rate higher than Intel’s quality goals after multiple years of service”

How do we fix this?

Intel is issuing a new stepping for the Atom C2000 and has to fix this in silicon – that is a pretty expensive fix.     Some kind of board level repair might be possible, but we cannot find details right now.

If you have Cisco SmartNet with On Site support they will send someone to replace it, but that is not the magic bullet, because someone has to arrange and co-ordinate that all.  Partners will have to be involved – who will pay for all these services.

In a discussion with CRN Magazine – Jennifer Ho – Manager of Cisco’s Business Critical Communications has said “Unfortunately, because our funding is focused on providing the products, we are unable to reimburse for on-site services to replace the affected devices. Customers may have field engineering service as an option for their services contract, in which case the field engineering support would be included with the replacement.”

Cisco is clear – they are only paying for product.

There is also a delay – with so many people asking for replacements – rationing of replacement hardware is already occurring.

Justin’s Thoughts….

This is one of the largest fiasco’s since CapacitorGate, when one guy stole a faulty capacity formula and gave it to another company, who sold it to tons of manufactures of motherboards – and then I was replacing cap’s on motherboads in my house along with millions of others.

I’m pretty happy with Cisco on this one (Yeah bring on the “your a Cisco fan boy” comments) but the evidence is clear, they were first in front of it, and didn’t try to blame someone else they are just out there to fix it.

The big problem is who is going to pay for all this work – Cisco has said, they will not.

This is a pretty big hit – and these types of things need to stop – IoT devices with faulty ANYTHING can spell disaster and be potentially dangerous.   Just think if an electric car was powered by this chip, and one day the computer didn’t start up, or failed while driving.  Think of the oil rig which had a drill being controlled by a chip like this.

Right now nobody is really being hurt with this one – but it makes me worry about things to come in the IoT market with failures like this.

 

 

 

5 thoughts on “ClockGate 2017 – The Intel Atom C2000

  1. I appreciate your blog post about this Justin. I was having a discussion with my integration engineer about this very issue yesterday. At the time, I had not seen anything from Cisco indicating they would re-imburse any customer for PS services associated with replacing an afflicted device. And we were questioning how that part of the issue will be handled. I’m not saying what you wrote is the gospel, but Cisco (despite doing the right thing and jumping out in front of this issue) will likely take a public relations hit from smaller customers with regards to not covering some or all of the PS efforts to replace the equipment. They already have a reputation for being the “Cadillac” when it comes to pricing of hw/sw…this won’t particularly help.

    That said, this won’t change my positioning of their products, it will just present a bit of a challenge with customer/product messaging when encountering a savvy IT person on the customer side. 😛

    -Eric

    Like

  2. I don’t think Cisco is doing as good a job as you think. I have 3 ISR 4331 that they told me should be effected by this.
    However since they have not had a problem yet, they do not want to replace them and yes they do have 24x7x4 SmartNet.

    Like

  3. I just got off the phone with our Dell rep and was forwarded this information…
    Hope this helps some people out!

    Products possibly affected are as follows:

    a. S3048-ON
    b. S4048-ON
    c. S4048T-ON
    d. S6010-ON
    e. S6100-ON (Chassis only. No impact on modules)
    f. Z9100-ON
    g. C9010 (both RPM card and line card are impacted)
    h. N and X series switches do not currently fall under the scope of this failure.

    Like

  4. I just got off the phone with my Dell Rep and I was forwarded this information…
    Products possibly affected are as follows:

    a. S3048-ON
    b. S4048-ON
    c. S4048T-ON
    d. S6010-ON
    e. S6100-ON (Chassis only. No impact on modules)
    f. Z9100-ON
    g. C9010 (both RPM card and line card are impacted)
    h. N and X series switches do not currently fall under the scope of this failure.

    Hopefully this may help someone.

    Like

Leave a comment