8 votes

Cores that don’t count

6 comments

  1. skybrian
    Link
    This is a paper about rare CPU malfunctions seen in data centers that cause silent data corruption. (When you do things at big enough scale, rare problems become common.)

    This is a paper about rare CPU malfunctions seen in data centers that cause silent data corruption. (When you do things at big enough scale, rare problems become common.)

    3 votes
  2. [5]
    teaearlgraycold
    Link
    I'm surprised the "Hardware mitigations" section doesn't mention RISC CPUs as a possible solution. Maybe Google has had these problems in ARM datacenters, but if the problem is that there are...

    Investigation fingers a surprising cause: an innocuous change to a low-level library. The change itself was correct, but it caused servers to make heavier use of otherwise rarely-used instructions.

    I'm surprised the "Hardware mitigations" section doesn't mention RISC CPUs as a possible solution. Maybe Google has had these problems in ARM datacenters, but if the problem is that there are untested pathways in a CPU then making a simpler ISA for your CPU seems like the obvious solution.

    3 votes
    1. [4]
      skybrian
      Link Parent
      They hint that some CPU models are worse than others, without naming names: So this seems to be a rather polite heads-up that something is wrong that avoids offending Google's suppliers....

      They hint that some CPU models are worse than others, without naming names:

      CEEs appear to be an industry-wide problem, not specific to any vendor, but the rate is not uniform across CPU products.

      So this seems to be a rather polite heads-up that something is wrong that avoids offending Google's suppliers. Negotiations with with CPU vendors over pricing and delivery of large orders for data centers tends to be politically sensitive. And that's going to be particularly true now with chip shortages.

      Also, while it's true that the Intel instruction set is really complex, instruction set decoding is only one part of the chip and even if it's due to a rarely-used instruction, the errors could be happening elsewhere.

      It seems like having an open source design would allow for better outside research into the causes, though:

      But because we have limited knowledge of the detailed underlying hardware, and no access to the hardware-supported test structures available to chip makers, we cannot infer much about root causes. Even worse, we cannot always detect bad computations immediately.

      4 votes
      1. [3]
        teaearlgraycold
        Link Parent
        I hope they go for an open source cpu. It would be an M1-level event if Google spent big on RISCV

        I hope they go for an open source cpu. It would be an M1-level event if Google spent big on RISCV

        2 votes
        1. [2]
          skybrian
          Link Parent
          Google has already made some processors for machine learning. (By which I mean designed them; they probably don't have a fab.) It's still early for RISC V, so I think if they did a CPU at all, I'm...

          Google has already made some processors for machine learning. (By which I mean designed them; they probably don't have a fab.) It's still early for RISC V, so I think if they did a CPU at all, I'm guessing they would be more likely to make a non-open-source ARM chip, similar to Apple. For a tech giant there isn't all that much benefit from sharing a design publicly compared to be able to share it internally (including with any academics they want).

          This doesn't happen overnight. Apple made several mobile phone chips before they did the M1.

          1 vote