26 votes

Time is not a synchronization primitive

12 comments

  1. [2]
    JRandomHacker
    Link
    This is one of those things that I think every software dev learns the hard way, and it's nice to see Xe spell it out so cleanly. I think the best summary of "why you shouldn't use time-delays to...

    This is one of those things that I think every software dev learns the hard way, and it's nice to see Xe spell it out so cleanly. I think the best summary of "why you shouldn't use time-delays to synchronize things" is the line:

    Waiting one second does not make the service ready. The service being ready makes it ready.

    12 votes
    1. automaton
      Link Parent
      Wait so you mean all those // It's probably not ready yet. window.setTimeout(() => work(), 1000); I have in my code are wrong?!

      Wait so you mean all those

      // It's probably not ready yet.
      window.setTimeout(() => work(), 1000);
      

      I have in my code are wrong?!

      1 vote
  2. [4]
    Shimmer
    Link
    Calling sleep() is almost always the wrong thing to do in any serious concurrent program.

    Calling sleep() is almost always the wrong thing to do in any serious concurrent program.

    9 votes
    1. [3]
      VoidSage
      Link Parent
      I think it's okay in specific cases. An example would be polling - make an API call, wait 5 seconds do it again, etc To be clear I don't think polling is always the right thing to do, but...

      I think it's okay in specific cases. An example would be polling - make an API call, wait 5 seconds do it again, etc

      To be clear I don't think polling is always the right thing to do, but sometimes it is unavoidable

      8 votes
      1. [2]
        Shimmer
        Link Parent
        Polling doesn't make a program concurrent. All the API calls are in a single thread, and the polling happens in a loop: do sleep 5 seconds poll server until server is done

        Polling doesn't make a program concurrent. All the API calls are in a single thread, and the polling happens in a loop:

        do
            sleep 5 seconds
            poll server
        until server is done
        
        1. VoidSage
          Link Parent
          Agreed, but polling can exist in a concurrent program - in the context of your example I could kick off a separate thread to poll for data updates You could also make multiple API calls within the...

          Agreed, but polling can exist in a concurrent program - in the context of your example I could kick off a separate thread to poll for data updates

          You could also make multiple API calls within the polling loop each on a separate thread as long as you block the polling thread until they all complete.

  3. spit-evil-olive-tips
    Link
    one of the worst codebases I've ever worked on in my career was the <decent chance you've heard of it> service at <company you've definitely heard of> it had no unit tests, only integration tests...

    one of the worst codebases I've ever worked on in my career was the <decent chance you've heard of it> service at <company you've definitely heard of>

    it had no unit tests, only integration tests that depended on a local MySQL instance.

    those tests started out as a really cool way of doing integration tests. in production, the service was a distributed system made up of multiple microservices running on different hosts. for these integration tests, all those services were started up within a single JVM, with an in-memory message bus substituted for the network-based message bus used for real deployments.

    so in theory, you can run the entire system locally on your dev box, and write tests against its API as if it were the real thing.

    in practice...lots of operations run by this system were eventually consistent or asynchronous. over time, the tests became littered with sleep(30) calls. some operation takes maybe 5-10 seconds under normal circumstances, so someone would just toss in a 30-second sleep and assume it'd be done by then.

    unsurprisingly, this led to tests that were notoriously unreliable. if the operation took longer than expected, or failed entirely, the test would just continue running. rather than fail right away and say "timed out waiting for X to complete", it would fail at some later point, with an error message / assertion failure that may or may not make it obvious that X not being complete was the cause of the problem.

    the full test suite also took an annoyingly long time to run (2-3 hours) because of how much time was spent sleeping unnecessarily, with delays longer than they needed to be. with these sort of hardcoded sleeps, you only have two types of delays, too short or too long. and you're going to err on the side of too-long sleeps because of how much trouble a too-short sleep will cause.

    this created a "worst of both worlds" sort of thing, where the test suite took long enough to run that you would kick it off and then context-switch to something else. but, a few portions of the test suite were still performance-sensitive, so that the other thing you were doing might cause test failures if it was too resource-intensive.

    oh, did I mention that having these tests pass was, in theory, a required step before posting a code review that would commit changes to the main branch?

    some actual advice on that team, that I overheard a senior dev giving a more junior dev, is that they should run the test suite twice in a row. they should note the failures in each test run, and as long as a different set of test cases failed in each run, their changes "passed" the test suite and could be posted for code review and then merged to the main branch.

    6 votes
  4. [3]
    skybrian
    Link
    The tricky bit for testing is that you will always have performance concerns and that's all about how much time passed compared to how much time you expect to pass. This is a fuzzy concern. If it...

    The tricky bit for testing is that you will always have performance concerns and that's all about how much time passed compared to how much time you expect to pass. This is a fuzzy concern. If it often takes one second for a service to become ready, you might want to investigate that? This makes setting a low timeout seem like the right thing to do.

    But let's zoom out a bit. What are tests for? A test suite is a notification service that alerts you to potential problems. We can judge it by how good the notifications are.

    Flaky timeout failures in a test suite result in a terrible notification service that sends you random alerts that something might be wrong, without doing much to help you figure out what it is. Maybe it's a problem in production code, but more likely the server running the tests was busier than usual. Not now, test suite!

    Notifications about potential performance slowdowns are useful but the notification service needs to be designed differently.

    I think it would be useful to write in a test case that you expect a service to take less than a second to start up. When that regularly doesn't happen, some kind of warning makes sense, but there should probably be some other dashboard for it. This falls under performance monitoring for test suites.

    When a test fails entirely, seeing some warnings about how long things took versus how long they're expected to take might be helpful diagnostics. Also, if you're running the test interactively, the developer is waiting on it. If they see some warnings about blown performance expectations then they might decide it's never going to finish.

    Sometimes that's not true. If you waited long enough, it would finish. I've sometimes gone to lunch and when I got back, the test actually did finish successfully, far later than expected, and that was an important clue that it's not a correctness problem, it's a performance problem. If you stop the test early then you don't get that clue.

    5 votes
    1. [2]
      blueshiftlabs
      Link Parent
      You need some sort of timeout for the overall test, though, to detect the case where things have gotten permanently wedged. A test can't tell the difference between "this is making progress, but...

      You need some sort of timeout for the overall test, though, to detect the case where things have gotten permanently wedged. A test can't tell the difference between "this is making progress, but very slowly" and "this isn't making progress at all."

      2 votes
      1. skybrian
        Link Parent
        Absolutely. The occasional vague hint you might get by waiting longer has a high cost and there are diminishing returns. This makes setting timeouts tricky.

        Absolutely. The occasional vague hint you might get by waiting longer has a high cost and there are diminishing returns.

        This makes setting timeouts tricky.

        1 vote
  5. Moonchild
    Link
    As I said on the red website: Absolutely true as stated, but there is another sense in which time can be used as a synchronisation primitive. To wit, if a sufficiently high-precision, globally...

    As I said on the red website:

    Absolutely true as stated, but there is another sense in which time can be used as a synchronisation primitive. To wit, if a sufficiently high-precision, globally synchronised clock is available, then its timestamps can be used to latently order events. In effect, what’s needed is that, for events A and B, happens-before(A, B) implies timestamp(A) < timestamp(B); hence, the precision of the clock has to be greater than the communication delay. (Concurrent events might have the same timestamp, or different timestamps; it doesn’t matter, as semantics are preserved in either event, though you do need some deterministic tiebreaker in the former case.)

    Of course, actually realising such a clock is impossible, but that hasn’t stopped anybody. Many intel cpus feature an ‘invariant tsc’—a clock which is guaranteed to be synchronised across all cores in a system (including across sockets, in a multi-socket system), and which runs quite fast (3ghz, on my cpu), so it could in principle be used for this purpose. I believe there is an analogue in the distributed world which actually sees some use, though I forget the name. (Somebody says that google spanner works this way.)

    To formalise this a bit more: we can say that sampling the clock and receiving a timestamp t synchronises-with everybody else who sampled the clock and received a timestamp <t. Whereas in the example, the assumption underlying the faulty code was an assumption about the timing of future events, which no memory/execution model I know of accounts for at all (eg c++ only says that threads will make progress ‘eventually’).

    2 votes
  6. slambast
    Link
    As soon as I saw the first Go snipped I thought "oh, this needs a wait group". Then I saw the other snippet where net.Listen was simply lifted outside of the goroutine's scope... I'm not sure I...

    As soon as I saw the first Go snipped I thought "oh, this needs a wait group". Then I saw the other snippet where net.Listen was simply lifted outside of the goroutine's scope... I'm not sure I like what this says about me.

    At work, our codebase is pretty good about avoiding time synchronization, but there's one situation where it keeps popping up, which is: there's another process running that needs to finish, and will take some unspecified amount of time, that we don't have visibility into. For example:

    • An integration test sends several messages through a mocked pubsub interface, then checks that they were sent. But pubsub is running in a different process, so we just wait a bit before checking.
    • We run docker commands to (a) start the local container with postgres, and (b) start another container that runs migrations on the newly created container. Here, the migration runner needs to wait until the postgres container is up and accepting connections, but we don't have an interface to that. So we essentially poll once time.Sleep(1 * time.Second).

    I don't have good solutions for these cases, but what they seem to have in common is that they require bidirectional communication. We need the "other" service to tell us when something is ready or complete, so we're basically stuck polling, and polling is time synchronization at its core.

    1 vote