11 votes

Dealing with databases, inserts, updates, etc. in Python

Current Library: built in sqlite
Current db: sqlite (but will have access to Snowflake soon for option 1 below)

Wondering if anyone here has some advise or a good place to learn about dealing with databases with Python. I know SQL fairly well for pulling data and simple updates, but running into potential performance issues the way I've been doing it. Here are 2 examples.

  1. Dealing with Pandas dataframes. I'm doing some reconciliation between a couple of different datasources. I do not have a primary key to work with. I have some very specific matching criteria to determine a match (5 columns specifically - customer, date, contract, product, quantity). The matching process is all built within Python. Is there a good way to do the database commits with updates/inserts en masse vs. line by line? I've looked into upsert (or inserts with clause to update with existing data), but pretty much all examples I've seen rely on primary keys (which I don't have since the data has 5 columns I'm matching on).

  2. Dealing with JSON files which have multiple layers of related data. My database is built in such a way that I have a table for header information, line level detail, then third level with specific references assigned to the line level detail. As with a lot of transactional type databases there can be multiple references per line, multiple lines per header. I'm currently looping through the JSON file starting with the header information to create the primary key, then going to the line level detail to create a primary key for the line, but also include the foreign key for the header and also with the reference data. Before inserting I'm doing a lookup to see if the data already exists and then updating if it does or inserting a new record if it doesn't. This works fine, but is slow taking several seconds for maybe 100 inserts in total. While not a big deal since it's for a low volume of sales. I'd rather learn best practice and do this properly with commits/transactions vs inserting an updating each record individually within the ability to rollback should an error occur.

17 comments

  1. [7]
    unkz
    Link
    sqlite supports bulk insert via VALUES, eg. INSERT INTO table (column1, column2) VALUES ('value1', 'value4'), ('value2', 'value5'), ('value3', 'value6') I would recommend moving to an ORM of some...

    sqlite supports bulk insert via VALUES, eg.

    INSERT INTO table 
        (column1, column2) 
        VALUES 
            ('value1', 'value4'),
            ('value2', 'value5'),
            ('value3', 'value6')
    

    I would recommend moving to an ORM of some kind, like SQLAlchemy so you don't end up hand-coding SQL (blech).

    6 votes
    1. [6]
      imperator
      Link Parent
      Yeah, I'll need to look into this. SQL was the first language I learned so I'm quite comfortable with it.

      Yeah, I'll need to look into this. SQL was the first language I learned so I'm quite comfortable with it.

      1. [5]
        Minty
        Link Parent
        Hand-coding SQL is not so much about SQL itself being bleh, but sanitizing and otherwise properly handling all values to avoid security issues.

        Hand-coding SQL is not so much about SQL itself being bleh, but sanitizing and otherwise properly handling all values to avoid security issues.

        2 votes
        1. [4]
          imperator
          Link Parent
          ah that makes sense. Definitely want best practice. I'm the only one running it so not too concerned about security. There is no user input it's just pulling data from eBay and putting it into a...

          ah that makes sense. Definitely want best practice. I'm the only one running it so not too concerned about security. There is no user input it's just pulling data from eBay and putting it into a database.

          1. Minty
            Link Parent
            It's all nice and dandy until Bobby Tables tries to sell something ;)

            It's all nice and dandy until Bobby Tables tries to sell something ;)

            3 votes
          2. [2]
            wcedmisten
            Link Parent
            Personally I wouldn't consider using ORM a best practice. It's an abstraction layer that many devs choose to avoid (myself included). I would say it's more about how much control your want over...

            Personally I wouldn't consider using ORM a best practice. It's an abstraction layer that many devs choose to avoid (myself included). I would say it's more about how much control your want over your DB queries.

            As for security, most sql libraries warn you to use prepared statements or similar to prevent SQL injection. As long as you're following those guidelines and not inserting user input directly into your queries you should be fine.

            3 votes
            1. imperator
              Link Parent
              Interesting, thank you.

              Interesting, thank you.

              1 vote
  2. [2]
    unkz
    Link
    It would be helpful if you indicated what database library you are using (SQLAlchemy, Django ORM, raw drivers?) and what database you are using (MySQL, Postgres, MS-SQL, ?). Bulk insert is...

    It would be helpful if you indicated what database library you are using (SQLAlchemy, Django ORM, raw drivers?) and what database you are using (MySQL, Postgres, MS-SQL, ?). Bulk insert is possible in most cases.

    5 votes
    1. imperator
      Link Parent
      Sure. I'll also update the post. I was using the raw sqlite driver, but I'm not opposed to changing. In the JSON instance the database is sqlite. The pandas one is currently sqlite, but I'm going...

      Sure. I'll also update the post.

      I was using the raw sqlite driver, but I'm not opposed to changing. In the JSON instance the database is sqlite. The pandas one is currently sqlite, but I'm going to have access to snowflake soon as well.

      1 vote
  3. [4]
    spit-evil-olive-tips
    Link
    if those 5 columns uniquely identify a row, you do have a primary key, but it's a composite key (made up of multiple columns) and a natural key (made up of columns that already exist in the data)....
    • Exemplary

    primary keys (which I don't have since the data has 5 columns I'm matching on)

    if those 5 columns uniquely identify a row, you do have a primary key, but it's a composite key (made up of multiple columns) and a natural key (made up of columns that already exist in the data). most examples you're looking at will use an artificial primary key (such as a UUID).

    one simple option, you can always build a lookup table in your code, by using a dict where the keys are (customer, date, contract, product, quantity) tuples, and the values are the corresponding dataframes. this has the downside that it will be limited by the available memory you have, but that may not be an issue for your use case.

    you may also be able to define a unique constraint over those 5 columns, and then use upserts triggered by violations of that constraint.

    a bit more complicated, but a fun database trick, it's possible to generate hash-based artificial primary keys based on composite natural keys like you have. it relies on those 5 columns being indeed unique, and they must not require any fuzzy matching (such as the quantity sometimes being an integer and sometimes a float - if this were the case you'd need to normalize the values, but this would be a good practice anyway).

    for example:

    import hashlib
    
    h = hashlib.blake2b(digest_size=16)
    h.update(row.customer)
    h.update(row.date)
    h.update(row.contract)
    h.update(row.product)
    h.update(row.quantity)
    row_id = h.hexdigest()
    

    this computes row_id as a 16-byte (32 hex characters) ID, similar to a UUID, which you could use as your artificial primary key.

    I'm handwaving over some type-conversion in that code because update requires a bytes-like object, so you would need to convert each value to something the hash will accept. you can also do this by creating a dictionary with those 5 keys, serializing it to JSON (with sort_keys=True so that the output is deterministic) and then feeding that JSON into the hash function.

    Dealing with JSON files which have multiple layers of related data.

    the way I'd structure this is by separating out the JSON file parsing from the database insertions.

    I'd write some Python dataclasses that represent the data you're parsing out of the file, and have a function that parses the JSON file and returns a bunch of those dataclass objects, without caring about whether they exist in the database or not. then a separate function that reconciles those objects with the database state. that second function can do a single transaction with all the inserts/updates.

    (this also has the side benefit that you can much more easily write unit tests of the two independent functions)

    3 votes
    1. [2]
      ebonGavia
      Link Parent
      This is interesting stuff. Is there any possibility of, for example, a hash collision on the output of h.hexdigest()?

      This is interesting stuff. Is there any possibility of, for example, a hash collision on the output of h.hexdigest()?

      1 vote
      1. spit-evil-olive-tips
        Link Parent
        there's always a possibility - there are 2128 possible hash outputs and more than that number of possible inputs, so by the pigeonhole principle collisions must exist. but with 128 bits of output,...

        there's always a possibility - there are 2128 possible hash outputs and more than that number of possible inputs, so by the pigeonhole principle collisions must exist.

        but with 128 bits of output, the likelihood of an accidental collision is essentially zero, especially with the number of rows that OP is talking about (less than a million, certainly). you can play around with the probabilities if you're curious (using a calculator like this one)

        with that calculator, you have to ask about 1 trillion items before it even shows the probability of a 128-bit collision. and even then, the chance of a collision among those 1 trillion items is approximated to be 0.00000000000014432%.

        with the blake2b hash function I used in that example, it supports variable output length, so you could use more bits if you were worried about collisions - but 128 is more than sufficient for this. even 64 bits would likely be enough for OP's use case - 100,000 items and a 64 bit hash function has a 0.000000027104% probability of collisions. and each additional bit halves the collision probability - so doubling the hash size from 64 to 128 makes collisions 18446744073709551616 (264) times less likely.

        this analysis also relies on blake2b being a cryptographic hash function, which has a more complicated definition but for these purposes means its output can be treated as effectively random (but of course it's not actually random, because it's deterministic for a given input). non-cryptographic hash functions might be more prone to collisions, but 128 bits provides enough of a safety margin that even a non-cryptographic hash function, as long as it's well-written, would be very unlikely to have any accidental collisions.

        3 votes
    2. imperator
      Link Parent
      Interesting. Lot of good information here. I'll bet honest it's a bit above my head, but probably not so much that additional searches and maybe some follow up questions I could probably get there...

      Interesting. Lot of good information here. I'll bet honest it's a bit above my head, but probably not so much that additional searches and maybe some follow up questions I could probably get there

      I've not used decorators nor data classes (well I've not defined them myself, I'm sure I've indirectly used them).

      I like the hashing idea for the composite/ natural primary key. That way I can easily store reconciled items in a database instead of in memory then dumping to an excel file. I then had to very carefully extract new data to not overlap data I already pulled. This would help eliminate that... Very cool.

      1 vote
  4. [4]
    Comment deleted by author
    Link
    1. edoceo
      Link Parent
      PostgreSQL supports this functionality as well.

      PostgreSQL supports this functionality as well.

      1 vote
    2. [2]
      imperator
      Link Parent
      I did find this yesterday, so I'll definitely take a look. Are you aware of best practice for committing all of this data with sub tables more efficiently? I was thinking about this and I could in...

      I did find this yesterday, so I'll definitely take a look. Are you aware of best practice for committing all of this data with sub tables more efficiently? I was thinking about this and I could in theory break up the JSON file into 3 separate arrays while creating a temporary primary key between the 3, commit to a temp table, then migrate that over to the main table somehow.

      Right now, I insert the header record, the do a lookup to get the primary key, then insert the item (line level) into the line level table with the primary key of the header. Again, this works, but it's slow and doesn't allow me to rollback on issues. It's very hacky/scripty.

      1. FluffyKittens
        Link Parent
        The most performant pattern is typically to throw the raw JSON into the database and manipulate it there. Toss the JSON in a staging table, begin a transaction, upsert into the derivative tables,...

        The most performant pattern is typically to throw the raw JSON into the database and manipulate it there. Toss the JSON in a staging table, begin a transaction, upsert into the derivative tables, then commit.

        If you find JSON queries too unwieldy though (which is perfectly natural - they're not easy), then start by batch processing all your records in python, and use insertmany to store the results in SQLite. You can wrap the inserts to all your tables in a single transaction so that they all get processed, or fully unwind. Batching with insertmany will be on the order of ~100x faster than inserting single rows with autocommit on.

        1 vote
  5. FluffyKittens
    Link
    Other users have covered the particulars of your examples pretty well, but to address your "metaknowledge" question of how to master this stuff, here's the best advice I can offer that's not...

    Other users have covered the particulars of your examples pretty well, but to address your "metaknowledge" question of how to master this stuff, here's the best advice I can offer that's not self-evident:

    • Read PEP249. It's the python "DBAPI" spec, and defines the interface that all standard python database drivers use.
    • Know the non-python tricks of whichever database you want to master. For every DB I can think of, there's a better option for fast data import than whatever the python driver will offer you. For sqlite, it's the .import pragma; for postgres, it's psql's \copy; for mssql, bulk copy; etc. Use them via the subprocess module instead of inserting via DB drivers when you need performance and where practical.
    • Learn how to use INFORMATION_SCHEMA tables or the db-specific equivalent. Doing so helps you write idempotent DDL for your schema setup process, and, e.g., reuse INSERT queries with similar input patterns across multiple tables.
    2 votes