34 votes

Life altering PostgreSQL patterns

35 comments

  1. [8]
    EsteeBestee
    (edited )
    Link
    There's some obvious sounding, but very sound advice in there. I have direct experience with being fucked over by bad database designs. Like you would think it's obvious to have created_at and...

    There's some obvious sounding, but very sound advice in there. I have direct experience with being fucked over by bad database designs. Like you would think it's obvious to have created_at and updated_at everywhere you can, but please for the love of god make sure you actually have those columns and actually write to them. Our damn schema has those in most tables now, but there are still some using created_on (and some tables with both created_on and created_at and different bits of code write to one or the other, no I'm not joking) and some with no indication of any sort of creation date at all and it's a fucking nightmare to run queries against to generate reports.

    One that maybe isn't super obvious but I will say is super great advice was "Represent statuses as a log". You never ever know if 5 years after you create a table if you need to audit the data in there for some reason and need to know past values instead of current values. For example, we have an integration with a vendor that sends us account data with statuses. The data is under a lot of scrutiny as it has to do with financial shit and there are statuses linked to the accounts we see and sometime shit hits the fan with one of our customers and they need to have a data trail of the statuses on their accounts at certain points of time in the past and you/they are just fucked sometimes if you only keep the current state of data. At the very least, if you have an 'account' table or something, at least have an 'account_status' table, too, that can be joined on 'account.id' or whatever and tracks all status changes per account.

    14 votes
    1. vord
      Link Parent
      And that's why you always setup those two triggers, means nobody ever has to think about writing to them, it happens automatically. For stuff that need auditing, a trigger for update or delete...

      but please for the love of god make sure you actually have those columns and actually write to them

      And that's why you always setup those two triggers, means nobody ever has to think about writing to them, it happens automatically.

      For stuff that need auditing, a trigger for update or delete that leaves the prior row in a different table helps.

      9 votes
    2. [6]
      Eji1700
      (edited )
      Link Parent
      I am currently working on getting data out of a vendor database into something we can actually use. This is from a BILLION dollar company on one of their flagship products. Lets say you have...

      I am currently working on getting data out of a vendor database into something we can actually use. This is from a BILLION dollar company on one of their flagship products.

      Lets say you have table1. Table1 has

      numA, name, date, thingId

      then you have zable1 (yes...zable...so it appears at the bottom of table lists). zable1 has

      numA, name, option, stuff

      Every, single, table, has a "numA" column. That's not the real name, but they are all the SAME name. In most cases, it's a primary key, but not all of them. In the example above, table1 has duplicate values in numA, and you join zable1, like this:

      select * 
      from table1 t
      left join zable1 z on t.thingId = z.numA
      

      Oh and those names, they are shockingly accurate. The only way you know what zable1 (which go up to zable90s) does, is if you have access to the data dictionary, which is woefully lacking in useful information.

      There are NO "updated at" columns of any real use.

      I could ramble more about it but i'd just wind up sad.

      Please, for the love of god, whatever personal "i'll fix it later" project you're doing, FIX IT NOW.

      2 votes
      1. [3]
        Weldawadyathink
        Link Parent
        That reminds me of this wonderful database story. Ping /u/EsteeBestee, since you might be interested too.
        • Exemplary

        That reminds me of this wonderful database story. Ping /u/EsteeBestee, since you might be interested too.

        4 votes
        1. [2]
          EsteeBestee
          Link Parent
          That was an incredible read and now I suddenly don’t feel as bad about my situation. HOW DO YOU RUN OUT OF COLUMNS?!

          That was an incredible read and now I suddenly don’t feel as bad about my situation. HOW DO YOU RUN OUT OF COLUMNS?!

          2 votes
          1. vord
            Link Parent
            One product that I know relies on MS Dynamics and SQL Server. Due to the user-driven nature of the product (You don't need programmers to do all this stuff!) which autogenerated table structures,...

            One product that I know relies on MS Dynamics and SQL Server. Due to the user-driven nature of the product (You don't need programmers to do all this stuff!) which autogenerated table structures, things got ugly and we did actually hit up against a SQL Server column limit back in 2016 (which IIRC was something like 1023).

            Edit: as linked story says, was 1024 at the time. Think Dynamics had a reserved column or two which prevented you from hitting the database limit with your own stuff.

      2. EsteeBestee
        Link Parent
        Oh my god, holy shit. That also reminds me that we have 80 tables prefixed with “xx_” because someone wanted them grouped together in the schema and THAT’S NOT HOW THAT SHOULD WORK!!! I also...

        Oh my god, holy shit. That also reminds me that we have 80 tables prefixed with “xx_” because someone wanted them grouped together in the schema and THAT’S NOT HOW THAT SHOULD WORK!!!

        I also checked this morning and we’re now up to 666 tables in our cursed platform DB. At the very least we have a data warehouse with good architecture so that I can use that for queries instead in most cases.

        3 votes
      3. vord
        Link Parent
        Old systems are the worst, because back in the day table names and sometimes column names were limited to 8 characters. Large ERP system I worked with has arcane table names like ASCABINC, where...

        Old systems are the worst, because back in the day table names and sometimes column names were limited to 8 characters.

        Large ERP system I worked with has arcane table names like ASCABINC, where each letter is an abbreviation for something.

        3 votes
  2. [4]
    krellor
    Link
    The only one I disagree with is strictly adhering to singular name forms for tables. I disagree that the majority of queries will result in selecting individual records. In fact, my experience is...

    The only one I disagree with is strictly adhering to singular name forms for tables. I disagree that the majority of queries will result in selecting individual records. In fact, my experience is that for major applications, I'm mostly working with collections of things, and uniformly referencing collections in plural form and individuals in singular makes for intuitive code.

    That said, if you find a situation where the reverse is true, you shouldn't be dogmatic either way. But also, English grammar is just a guideline to aid in readable code. If you need to take a few liberties with spelling or names, as long as it is documented do what makes sense.

    5 votes
    1. xk3
      Link Parent
      I think they meant when writing the select statement you are referencing individual values like "city.name" instead of "cities.name". I agree this shouldn't be dogmatic but it should be consistent...

      selecting individual records

      I think they meant when writing the select statement you are referencing individual values like "city.name" instead of "cities.name". I agree this shouldn't be dogmatic but it should be consistent (whenever possible) to minimize confusion. I've preferred to always use plurals and I never had second thoughts about "cities.name".

      But I agree, it's not a strong tip like the others are.

      7 votes
    2. [2]
      tanglisha
      Link Parent
      I don't think it matters whether they're single or plural, so long as you pick one and stick with it. Some random combination is so much worse that whatever bad might come from either.

      I don't think it matters whether they're single or plural, so long as you pick one and stick with it. Some random combination is so much worse that whatever bad might come from either.

      1 vote
      1. krellor
        (edited )
        Link Parent
        Agreed. Like most things, consistency within the project is king.

        Agreed. Like most things, consistency within the project is king.

        2 votes
  3. [2]
    Eji1700
    Link
    I think the only thing I really take issue with is: If you're writing any serious amount of SQL without using aliases in your query, you're insane. Having descriptive table names so that when...

    I think the only thing I really take issue with is:

    Name your tables singularly

    If you're writing any serious amount of SQL without using aliases in your query, you're insane. Having descriptive table names so that when you're looking at your list you know exactly what's in there is very helpful. There are reasons to have singular names vs plural.

    Everything else seems like sound enough advice or is coming from an area I know little about or might be at odds with my particular use cases.

    5 votes
    1. vord
      Link Parent
      We're not in 1999 anymore. Much like variables, make your table and column names as long as they need to be to convey what they do at a glance. Oh and put comments on your tables and columns....

      We're not in 1999 anymore. Much like variables, make your table and column names as long as they need to be to convey what they do at a glance.

      Oh and put comments on your tables and columns. Future you will thank you.

      4 votes
  4. [11]
    vord
    Link
    I worked as a DBA most of my adult life so far. Most experienced with Oracle, then SQL Server, then Postgres. The article is pretty good, but I have two major thoughts on the author's thoughts....

    I worked as a DBA most of my adult life so far. Most experienced with Oracle, then SQL Server, then Postgres. The article is pretty good, but I have two major thoughts on the author's thoughts.

    UUIDs for PK sucks unless you have a very specific need for it. Indexes are the lifeblood of anything beyond a toy database, and using UUIDs to join many things together gets nasty fast. The only time UUIDs make sense in my mind is highly distributed systems with intermittent connectivity which rely on eventual consistency.

    Views are absolutely horrid. Don't use them. NEVER join views together. This is the understatement of the century:

    The query planner also seems to have trouble seeing through them in general.

    While Oracle has made progress on that front, it is still very much a crapshoot whether it will help or hurt any given query. A view is by and large a black box to an optimizer...it must compute the entire view to make use of it in a join. If you have two views which leverage a table with 100,000,000 rows in it, and you try to join those views together, you'll end up querying that table twice. The hit can be masked if there is good indexing and smaller tables, but if you end up hitting a full table scans (and with UUIDs for PKs, it'll happen more) on a gigantic tables, it's a nightmare.

    3 votes
    1. [2]
      xk3
      (edited )
      Link Parent
      Huh... TIL. I always thought postgres views are mostly transparent to the query planner but that is not always the case:...

      Huh... TIL. I always thought postgres views are mostly transparent to the query planner but that is not always the case:

      dbt (a common tool in the Analytics Engineering community) often favors using views:

      Generally start with views for your models, and only change to another materialization when you notice performance problems.

      But this is mostly for OLAP workflows which run a handful of times per week where the performance difference might not matter or might not immediately be obvious.

      The speed difference between INT4 and UUID index lookup can be over an order of magnitude in really long and wide tables. Luckily they aren't usually that bad: https://ardentperf.com/2024/02/03/uuid-benchmark-war/

      However, I absolutely agree that UUIDs are pretty awful, even as external IDs: they aren't great for debugging as they are hard to visually compare (you might think they are the same but your glance only compared the first and last 2 chars, etc). You can't even double-click-select them because they have dashes. And users definitely prefer smaller IDs like "BOB123"

      3 votes
      1. vord
        Link Parent
        It does seem that the newer UUID schemes are significantly better than the ones I remember, but the memory pressure problem still applies when OLTP databases are sized in the multiple of...

        It does seem that the newer UUID schemes are significantly better than the ones I remember, but the memory pressure problem still applies when OLTP databases are sized in the multiple of terabytes.

        UUIDs seem to be much better in distributed systems where the network will be your bottleneck, but that's not always viable.

        1 vote
    2. [3]
      TurtleCracker
      Link Parent
      Doesn’t UUIDv7 address most of the indexing and performance concerns?

      Doesn’t UUIDv7 address most of the indexing and performance concerns?

      1 vote
      1. [2]
        vord
        Link Parent
        I mentioned elsewhere in this topic that I see UUIDv7 is much better, but the added 25% memory usage is still killer in large-scale applications. I like @krellor's solution of having a regular...

        I mentioned elsewhere in this topic that I see UUIDv7 is much better, but the added 25% memory usage is still killer in large-scale applications. I like @krellor's solution of having a regular integer PK, then a UUID for API usage.

        3 votes
        1. TurtleCracker
          Link Parent
          We are transitioning from bigint to uuid for primary keys. The flexibility for the frontend to set the identifier without checking the backend first is super helpful. We also have many...

          We are transitioning from bigint to uuid for primary keys. The flexibility for the frontend to set the identifier without checking the backend first is super helpful. We also have many applications in our ecosystem that communicate and having certainty on ID uniqueness across them is worth it for us.

          4 votes
    3. [5]
      0xSim
      Link Parent
      How? We always use UUIDs in our tables in SQL Server, and it seems to be ok? We also manage a lot of data. Not financial transactions, but enough to have performances as the first concern.

      using UUIDs to join many things together gets nasty fast

      How?

      We always use UUIDs in our tables in SQL Server, and it seems to be ok? We also manage a lot of data. Not financial transactions, but enough to have performances as the first concern.

      1. [3]
        zod000
        Link Parent
        UUID/GUIDs in general are just a lot slower to index and take up more space, so at scale they can become a performance issue. The only time I really care to use UUIDs for a PK is when the value in...

        UUID/GUIDs in general are just a lot slower to index and take up more space, so at scale they can become a performance issue.

        The only time I really care to use UUIDs for a PK is when the value in question will be publicly accessible as part of an API or URL. I really don't like having systems that are able to be trivially iterated over.

        2 votes
        1. [2]
          krellor
          Link Parent
          In those situations I typically create an auto incrementing integer as the PK, and a separate UUID that can be exposed for public API invocations. That let's me handle foreign keys and other...

          In those situations I typically create an auto incrementing integer as the PK, and a separate UUID that can be exposed for public API invocations. That let's me handle foreign keys and other internal operations like joins with integers which are more efficient at scale, and have UUIDs for public reference.

          The largest public facing platform I ever managed was a massive regulatory database that allows the public to generate accounts and API tokens, allowed for custom query creation through the API or a web interface, and allowed third parties to build apps on top of our APIs using rate limited accounts. That was a ton of fun to optimize, and the work that went into it was quite the education. Ultimately I ended up having the relational database that the queries would run against using set operations against the internal keys to build a result set that only included unique IDs of what to return, and then a separate select behind the scenes against a denormalized warehouse of the data so that the request doesn't have to materialize data across so many foreign keys relationships.

          2 votes
          1. zod000
            Link Parent
            I've done the exact same thing in specific instances where I felt the UUIDs were likely to cause the performance issues, but generally it wasn't worth the time unless the tables are really huge...

            I've done the exact same thing in specific instances where I felt the UUIDs were likely to cause the performance issues, but generally it wasn't worth the time unless the tables are really huge and/or have a constant flood of writes.

      2. vord
        (edited )
        Link Parent
        Had a vendor system that relied of UUIDs exclusively in an OLTP system. Had atomic row transactions on the order of several thousand per minute, often with a lot of updates or deletes. The indexes...

        Had a vendor system that relied of UUIDs exclusively in an OLTP system. Had atomic row transactions on the order of several thousand per minute, often with a lot of updates or deletes.

        The indexes would get trashed every hour. And because it was highly normalized, and often wide (see @xk3's reply), performance would tank without constant rebuilds because the indexes would be wide near the top and sparse on the branches, which is the opposite of what you would want.

        Unless your index is quite sparse, inserting UUIDs into a sorted index (best for performance) requires re-sorting the block. Having too sparse of an index makes it less performant. Imagine you're trying to create a sorted list in minimal amount of memory (as a large database will almost always be memory-constrained). If you put everything in a sequential row, you merely allocate another bit of memory and tack it on the end. But if you need to insert random data, and you run out of room, you have to reallocate the memory to fit the new record, moving around the data to fit it in. This also applies cache pressure which can age things out while they're still hot. And since a good database will cache both hot indexes and hot data, any cache pressure isn't great, and you can only scale memory so high relative to size of data.

        Other systems we've had with similar levels of transactional workload didn't have this problem because the PKs weren't essentially randomized at the top. They didn't need additional sorting, just being tacked on the end and periodically compacted.

        Edit: I edited quite heavily to fix some early-morning wrong-word logic.

        1 vote
  5. [7]
    revivinglaziness
    Link
    Thank you for sharing this! I'd seen many of these but not all in one place. This is my first time seeing enum tables, but I like the idea of that pattern a lot. It matches well with the liberal...

    Thank you for sharing this! I'd seen many of these but not all in one place.

    This is my first time seeing enum tables, but I like the idea of that pattern a lot. It matches well with the liberal (and justified, IMO) use of FK constraints they suggest. But I think it could also help avoid the need for some schema migrations, which are always a little bit nerve-wracking no matter how well you automate them.

    1 vote
    1. [6]
      em-dash
      Link Parent
      I hate enum tables so much. They sound so tempting and yet they're so bad in practice. The thing about enum values is they usually correspond to enum values in your codebase (otherwise they...

      I hate enum tables so much. They sound so tempting and yet they're so bad in practice.

      The thing about enum values is they usually correspond to enum values in your codebase (otherwise they wouldn't really be enums, just data). Now instead of your application just requiring a specific schema version to work, it requires a specific schema and data version. This isn't better, it's just marginally different to migrate and more tempting to do it the wrong way.

      At least this implementation uses the text representation as the key. Most people will autopilot add an integer or UUID key and now you have to make sure that mapping is consistent across environments.

      4 votes
      1. [5]
        vord
        Link Parent
        The main way enum tables make sense is if you'll have many disparate teams working against your data model while not working closely together. Having a strict enum table helps insure data...

        The main way enum tables make sense is if you'll have many disparate teams working against your data model while not working closely together. Having a strict enum table helps insure data integrity and notification across teams (yo the enum table is adding/changing, make sure you won't have breakage).

        That said, they can be a real PITA if you don't need them.

        3 votes
        1. [4]
          em-dash
          Link Parent
          Amusingly, with one particularly memorable pair of enum tables (in the same product) I had the opposite experience because of how badly they were handled: they had serial integer keys and people...

          helps insure data integrity and notification across teams

          Amusingly, with one particularly memorable pair of enum tables (in the same product) I had the opposite experience because of how badly they were handled: they had serial integer keys and people kept adding values to the end and using the same number as each other, and that caused Weird and Exciting Results.

          I finally "solved" this by writing the current maximum value for each in dry erase marker on the window next to my desk, and declaring that anyone wanting to add a new value must first physically walk over and increment the appropriate value and use the new value as the key. It's a mutex enforced by physics!

          3 votes
          1. [3]
            vord
            Link Parent
            In-database sequence for declaring the enumID. Throw out anything they try to insert in its place.

            In-database sequence for declaring the enumID. Throw out anything they try to insert in its place.

            1 vote
            1. [2]
              em-dash
              Link Parent
              The numbers had to match numbers in the corresponding code-enum, so no, you really did need to insert them with explicit IDs. (Epilogue: eventually I did just drop the tables, after inventing a...

              The numbers had to match numbers in the corresponding code-enum, so no, you really did need to insert them with explicit IDs.

              (Epilogue: eventually I did just drop the tables, after inventing a mildly terrifying pipeline to move all the extra metadata they had accumulated into constants in the right places in the codebase. It involved SQL that generated vim commands which were then pasted into a terminal with the right file open in vim (this was before bracketed paste was widely supported). I do not remember why "just regex-replace a CSV dump into an array of structs" wasn't an option, but I'm pretty sure I did it with vim specifically because I thought it was funny.)

              1 vote
              1. vord
                Link Parent
                I'm sure I'm just missing key bits of context,but if they're choosing the numbers in a codebase that would imply they could change them, wouldn't it? If it is technically possible, but never done,...

                I'm sure I'm just missing key bits of context,but if they're choosing the numbers in a codebase that would imply they could change them, wouldn't it?

                If it is technically possible, but never done, it is a policy failure. Which is the worst kind of failure, because fixing it often involves lots of meetings to the point coping with the bad pattern is often more cost effective.

  6. Akir
    Link
    Man, This conversation is making me miss the days when DBA was one of the hats I was wearing. I miss how much control you have over the data, and spending time to optimize how it’s stored,...

    Man, This conversation is making me miss the days when DBA was one of the hats I was wearing. I miss how much control you have over the data, and spending time to optimize how it’s stored, queried, and restricted. Sadly I stopped around the time when Postgres started putting really interesting data in there because as a one man operation in an increasingly complex business, in-house development for the things that needed any sort of data management was untenable.

  7. [2]
    gco
    Link
    On UUIDs: I thought you still had to check for uniqueness to avoid collisions, is there something preventing this issue?

    On UUIDs:

    You don't need to coordinate with the database to produce one.

    I thought you still had to check for uniqueness to avoid collisions, is there something preventing this issue?

    1. xk3
      (edited )
      Link Parent
      It's good practice to check but a lot of applications don't because the likelihood of collisions is almost zero from a probability perspective. But I would still do it if only for mental hygiene....

      It's good practice to check but a lot of applications don't because the likelihood of collisions is almost zero from a probability perspective.

      But I would still do it if only for mental hygiene. It helps me sleep better at night. But I rarely use UUID anyways. I would give a unique name to each database and then use a sequence generator for internal keys. Then you can use db_id + row_id as a unique internal key. The db_id could be the hardware MAC address if that is guaranteed to be unique. This solves essentially the same purpose that UUID has for distributed databases. This is a surrogate key: an artificially created key that has no inherent business meaning; a system-generated identifier, such as an auto-incrementing integer or UUID.

      These are in contrast to "business keys": columns which have business meaning (aka natural keys) which need to be unique then you need to use that data anyway to check uniqueness. UUID does not help here.

      1 vote