23 votes

User-defined Order in SQL

24 comments

  1. [9]
    Deely
    Link
    I cant shake feelings that this is a bit overkill. Personally I will just remove uniqueness constraint and will use 'update todos set pos = pos + 1 where id >= 3 '. In the end sorting is not a...

    I cant shake feelings that this is a bit overkill. Personally I will just remove uniqueness constraint and will use 'update todos set pos = pos + 1 where id >= 3 '.
    In the end sorting is not a critical attribute, and can be fixed easy later...

    What Im missing?

    12 votes
    1. [6]
      rkcr
      Link Parent
      Besides performance, if you're in a multiple user situation, using floats allows all users to simultaneously reorder items without concurrency issues. (I've had a lot of experience working with...

      Besides performance, if you're in a multiple user situation, using floats allows all users to simultaneously reorder items without concurrency issues.

      (I've had a lot of experience working with float positions from my time working at Trello; that's a use case where integer positions would not work.)

      9 votes
      1. [4]
        Deely
        (edited )
        Link Parent
        Thanks for the info, do you remember, is there a necessity to recalculate stored floats values (from time to time) to reduce the depth/precision?

        Thanks for the info, do you remember, is there a necessity to recalculate stored floats values (from time to time) to reduce the depth/precision?

        3 votes
        1. [3]
          whbboyd
          Link Parent
          Yes, this is necessary. (Not /u/rkcr, nor have I ever worked at/on Trello, but I've built similar systems using floats for "midpoint" insertion.) There are always pathological access patterns that...

          Yes, this is necessary.

          (Not /u/rkcr, nor have I ever worked at/on Trello, but I've built similar systems using floats for "midpoint" insertion.)

          There are always pathological access patterns that will force you to increase precision at every step, and there is very limited headroom for that sort of thing before you either don't have any more precision, or things just get too slow to be usable.

          (Now, Trello has relatively few items—they're drawing cards in a browser window, so it'll be of questionable utility past a few hundred, and absolutely will not work client-side past a few thousand—and so the need for this is lower, and they may just allow their system to fail at the limits rather than trying to dynamically recompute their order.)

          5 votes
          1. rkcr
            Link Parent
            /u/whbboyd has it right. Trello would detect when a degenerate case was approaching and redo the positions on all items in a collection if we were headed for disaster. (I don't know how often this...

            /u/whbboyd has it right. Trello would detect when a degenerate case was approaching and redo the positions on all items in a collection if we were headed for disaster. (I don't know how often this code actually ran - pretty rarely - but we planned for it.)

            Also - while Trello only renders a few lists/cards at a time, it is possible to have thousands of cards in lists.

            That also reminds me - one other nice aspect of float-based positioning is that a card could be archived (essentially removed from the list), then un-archived later, and it would still fit into the same spot as before (as long as there wasn't a position redo, but like I said, that's rare).

            3 votes
          2. tibpoe
            Link Parent
            Can you not use two strategies in this case? When it gets too pathological, then you take a lock, clean things up, and resume.

            There are always pathological access patterns that will force you to increase precision at every step

            Can you not use two strategies in this case? When it gets too pathological, then you take a lock, clean things up, and resume.

            1 vote
      2. teaearlgraycold
        Link Parent
        You could just make the process transactional.

        You could just make the process transactional.

    2. Bwerf
      Link Parent
      The performance angle I think. Depends on your usecase of course.

      The performance angle I think. Depends on your usecase of course.

      3 votes
    3. teaearlgraycold
      Link Parent
      That’s what I did before for lists that were less than 10 elements long.

      That’s what I did before for lists that were less than 10 elements long.

      1 vote
  2. brogeroni
    Link
    Recently for a project I'm working on I needed to store a user defined list of items. What's the best way to do it? Turns out someone smarter than me (Joe Nelson, author of PostgREST) already...

    Recently for a project I'm working on I needed to store a user defined list of items. What's the best way to do it?

    Turns out someone smarter than me (Joe Nelson, author of PostgREST) already wrote a blogpost.

    Hopefully it helps other people like it helped me.

    5 votes
  3. [6]
    Greg
    Link
    I really like this kind of blog post, it’s clear, concise, and does a good job of getting straight into the why of the problem. Ultimately most programming is about trade offs, so the thought-out...

    I really like this kind of blog post, it’s clear, concise, and does a good job of getting straight into the why of the problem. Ultimately most programming is about trade offs, so the thought-out reasoning (and concrete data) can be a lot more useful than the final answer that happened to work for someone else.

    I will say that in my experience, much as my nerd brain wishes it were otherwise, the beautiful and elegant solution can often end up causing headaches later that the dumb as a brick one avoids. One I see right off the bat here is that it depends on a custom C extension, which rules out using it in a lot of cloud environments - and even if us real developers™ only ever work by writing to our own custom hardware with tiny magnets we forged by hand, the fact is that a huge number of deployments will be on RDS or equivalent.

    But that’s fine, because the post also gives numbers on how well the built in arbitrary-precision data type works with simple halving, and that’s what I see as most useful here. That confirms you can still get single-row updates the vast majority of the time, but if you update an index and it’s grown over some arbitrary depth (say 32 bits), then you can fire off a separate query afterwards to transactionally lock and reindex the list as a whole. I/O overhead and concurrency issues are still reduced by several orders of magnitude compared to doing that every time with integer indices, but you can do it in a couple of lines of standard SQL.

    4 votes
    1. [3]
      krellor
      Link Parent
      I didn't say it in my reply, but I thought the same thing about cloud compatibility. My instinct when designing for scale is to anticipate something like the following for growth: It runs on a box...

      I didn't say it in my reply, but I thought the same thing about cloud compatibility. My instinct when designing for scale is to anticipate something like the following for growth:

      1. It runs on a box under the desk.
      2. It runs on a small cluster in a MDF.
      3. it runs on multiple clusters in a data center.
      4. Oh god it keeps growing put it in the cloud.
      5. We've finally gotten so big private data centers make financial sense again.

      So to my lizard cloud brain, starting something built to outlive numeric type while precluding cloud or server less compatibility feels unnatural.

      But not every project takes that path, so 🤷‍♂️.

      5 votes
      1. [2]
        vord
        Link Parent
        In between steps 2 and 5 there is a great forgotten in-between: Colocation. $50-$100 a month can buy a lot of different options to host a 1 or 2U server.

        In between steps 2 and 5 there is a great forgotten in-between: Colocation.

        $50-$100 a month can buy a lot of different options to host a 1 or 2U server.

        1 vote
        1. krellor
          Link Parent
          For the Colo route of say that MDF is equivalent to per-U Colo, and data center is equivalent to a cage at a peering point like hurricane electric, the SIX, or other regional fiber exchange location.

          For the Colo route of say that MDF is equivalent to per-U Colo, and data center is equivalent to a cage at a peering point like hurricane electric, the SIX, or other regional fiber exchange location.

          1 vote
    2. [2]
      vord
      Link Parent
      The author does appear to overlook what a lot of people do: No matter what, you need to run maintainence tasks on your data. No amount of trickery is going to avoid problems related to neglect of...

      The author does appear to overlook what a lot of people do: No matter what, you need to run maintainence tasks on your data.

      No amount of trickery is going to avoid problems related to neglect of this forever.

      I was contemplating as I was reading a different method: Index position and an offset. If an item is rearranged, and the index would not be unique between the records above/below, set equal to the row above and increment offset.

      Have a maintainence job that recomputes the index positions periodically.

      Ideally the uniqueness constraint is threefold: The joining FK (ie: list_ID), the position, and the offset.

      2 votes
      1. scherlock
        Link Parent
        Yup, the BASIC approach the followed by a renumbering after a period of inactivity would be straight forward and portable.

        Yup, the BASIC approach the followed by a renumbering after a period of inactivity would be straight forward and portable.

        2 votes
  4. krellor
    Link
    I like the article, but do feel like the author hand waved the limitation of the stern brocot tree method just a tad. It has an eventual limitation, just like the other numerics. The deeper...

    I like the article, but do feel like the author hand waved the limitation of the stern brocot tree method just a tad. It has an eventual limitation, just like the other numerics.

    The deeper observation I think is that every tree eventually needs rebalancing and boundary rule enforcement unless you want things to puke on the user.

    Personally, I'd avoid this method to avoid the reliance on bespoke rdbms features. Nothing sucks more then outgrowing an rdbms and needing to refactor more than you should need to to move.

    I think that the integer spacing method, with boundary checking, and a scheduled batch rebalance of the spread works well for me. It does require writing the stored procedure, and a basic mutex check for the rolling rebalance. But if you are really worried about running out of numeric and float precision I sort of think you should be factoring other scaling issues as well.

    That said, I like the stern brocot method, and have implemented it in the past using my own in house library.

    Thanks for sharing!

    3 votes
  5. [2]
    Rudism
    Link
    I would like to see some kind of performance benchmark comparison between these methods, and how much data and how often ordering updates would have to be coming in before you'd start experiencing...

    I would like to see some kind of performance benchmark comparison between these methods, and how much data and how often ordering updates would have to be coming in before you'd start experiencing real-world performance bottlenecks. I suspect most applications could use the more inefficient solutions to this problem (including the obvious simple integer position column) and not have to worry about it too much until they scaled to a size that the majority of applications will never reach.

    Sometimes you also have to weigh the complexity of a solution against the efficiency benefits, especially if you're working on a team of programmers with varying skill levels.

    3 votes
    1. Greg
      Link Parent
      I think you’re probably right, with the possible exception of the integer approach. Anything that locks multiple rows simultaneously on every request - especially an unbounded number of rows -...

      I think you’re probably right, with the possible exception of the integer approach. Anything that locks multiple rows simultaneously on every request - especially an unbounded number of rows - sets off my spidey sense because the scaling there can look totally fine until it hits the wrong overlap with another process and then suddenly it tanks by a factor of 10,000 all at once.

      2 votes
  6. [5]
    zod000
    Link
    I'm sure OP has their reasons, but every time I've been in this situation, I've just added a sort order column to the table. Updating the ID column of the record just feels wrong and likely...

    I'm sure OP has their reasons, but every time I've been in this situation, I've just added a sort order column to the table. Updating the ID column of the record just feels wrong and likely wouldn't work at all if you have the column as a foreign key in another table.

    2 votes
    1. vord
      Link Parent
      I mentioned this in another reply, but 100%. Your positioning is an attribute, not an identifier. Using it as both is asking for trouble.

      I mentioned this in another reply, but 100%.

      Your positioning is an attribute, not an identifier. Using it as both is asking for trouble.

      3 votes
    2. [3]
      Greg
      Link Parent
      Oh I read the examples as just being illustrative, since there wasn’t a PK column at all. I was assuming there’d inherently be an ID column (and user ID, created at, last updated, etc.) but the...

      Oh I read the examples as just being illustrative, since there wasn’t a PK column at all. I was assuming there’d inherently be an ID column (and user ID, created at, last updated, etc.) but the post was about specifically how to structure the sort order column when you’re adding it.

      3 votes
      1. [2]
        zod000
        Link Parent
        If that is the case, it wasn't clear to me. Also, if that is the case, why does the order value need to be unique? How large are these user lists that you can't just have the user make their...

        If that is the case, it wasn't clear to me. Also, if that is the case, why does the order value need to be unique? How large are these user lists that you can't just have the user make their sorting adjustments and then do an update of the set of records and save the order position as integers?

        Maybe I'm not getting the issue here as this just doesn't seem like a difficult issue.

        1. Greg
          Link Parent
          Uniqueness is there to guarantee that it can be sorted consistently; the post doesn’t have a field to differentiate between multiple lists (one of the reasons I assumed it was a toy example, in...

          Uniqueness is there to guarantee that it can be sorted consistently; the post doesn’t have a field to differentiate between multiple lists (one of the reasons I assumed it was a toy example, in fact), but you’d still want UNIQUE (list_id, sort_order) even if you did go for the integer approach otherwise two rows could have the same sort index.

          And yeah, there are definitely situations where just using integers could work fine - but it potentially falls down if you need reliable concurrency, long lists, or large user volumes. I probably wouldn’t install a whole extension to deal with that, but adding one extra query to use the numeric type rather than integer and rebalance only when necessary seems easy enough to be a win in my book.

          2 votes