11 votes

Dealing with databases, inserts, updates, etc. in Python

Posted June 23, 2023 by imperator (edited June 23, 2023)

Current Library: built in sqlite
Current db: sqlite (but will have access to Snowflake soon for option 1 below)

Wondering if anyone here has some advise or a good place to learn about dealing with databases with Python. I know SQL fairly well for pulling data and simple updates, but running into potential performance issues the way I've been doing it. Here are 2 examples.

Dealing with Pandas dataframes. I'm doing some reconciliation between a couple of different datasources. I do not have a primary key to work with. I have some very specific matching criteria to determine a match (5 columns specifically - customer, date, contract, product, quantity). The matching process is all built within Python. Is there a good way to do the database commits with updates/inserts en masse vs. line by line? I've looked into upsert (or inserts with clause to update with existing data), but pretty much all examples I've seen rely on primary keys (which I don't have since the data has 5 columns I'm matching on).
Dealing with JSON files which have multiple layers of related data. My database is built in such a way that I have a table for header information, line level detail, then third level with specific references assigned to the line level detail. As with a lot of transactional type databases there can be multiple references per line, multiple lines per header. I'm currently looping through the JSON file starting with the header information to create the primary key, then going to the line level detail to create a primary key for the line, but also include the foreign key for the header and also with the reference data. Before inserting I'm doing a lookup to see if the data already exists and then updating if it does or inserting a new record if it doesn't. This works fine, but is slow taking several seconds for maybe 100 inserts in total. While not a big deal since it's for a low volume of sales. I'd rather learn best practice and do this properly with commits/transactions vs inserting an updating each record individually within the ability to rollback should an error occur.

15 comments

[7]
unkz
June 23, 2023
Link
sqlite supports bulk insert via VALUES, eg. INSERT INTO table (column1, column2) VALUES ('value1', 'value4'), ('value2', 'value5'), ('value3', 'value6') I would recommend moving to an ORM of some...
sqlite supports bulk insert via VALUES, eg.
```
INSERT INTO table 
    (column1, column2) 
    VALUES 
        ('value1', 'value4'),
        ('value2', 'value5'),
        ('value3', 'value6')
```
I would recommend moving to an ORM of some kind, like SQLAlchemy so you don't end up hand-coding SQL (blech).
6 votes
1. [6]
  imperator (OP)
  June 23, 2023
  Link Parent
  Yeah, I'll need to look into this. SQL was the first language I learned so I'm quite comfortable with it.
  
  Yeah, I'll need to look into this. SQL was the first language I learned so I'm quite comfortable with it.
  1. [5]
    Minty
    June 23, 2023
    Link Parent
    Hand-coding SQL is not so much about SQL itself being bleh, but sanitizing and otherwise properly handling all values to avoid security issues.
    
    Hand-coding SQL is not so much about SQL itself being bleh, but sanitizing and otherwise properly handling all values to avoid security issues.
    
    2 votes
    
    [4]
    imperator (OP)
    June 23, 2023
    Link Parent
    ah that makes sense. Definitely want best practice. I'm the only one running it so not too concerned about security. There is no user input it's just pulling data from eBay and putting it into a...
    
    ah that makes sense. Definitely want best practice. I'm the only one running it so not too concerned about security. There is no user input it's just pulling data from eBay and putting it into a database.
    
    Minty
    June 23, 2023
    Link Parent
    It's all nice and dandy until Bobby Tables tries to sell something ;)
    
    It's all nice and dandy until Bobby Tables tries to sell something ;)
    
    3 votes
    
    [2]
    wcedmisten
    June 23, 2023
    Link Parent
    Personally I wouldn't consider using ORM a best practice. It's an abstraction layer that many devs choose to avoid (myself included). I would say it's more about how much control your want over...
    
    Personally I wouldn't consider using ORM a best practice. It's an abstraction layer that many devs choose to avoid (myself included). I would say it's more about how much control your want over your DB queries.
    
    As for security, most sql libraries warn you to use prepared statements or similar to prevent SQL injection. As long as you're following those guidelines and not inserting user input directly into your queries you should be fine.
    
    3 votes
    
    imperator (OP)
    June 23, 2023
    Link Parent
    Interesting, thank you.
    
    Interesting, thank you.
    
    1 vote
[2]
unkz
June 23, 2023
Link
It would be helpful if you indicated what database library you are using (SQLAlchemy, Django ORM, raw drivers?) and what database you are using (MySQL, Postgres, MS-SQL, ?). Bulk insert is...

It would be helpful if you indicated what database library you are using (SQLAlchemy, Django ORM, raw drivers?) and what database you are using (MySQL, Postgres, MS-SQL, ?). Bulk insert is possible in most cases.

5 votes
1. imperator (OP)
  June 23, 2023
  Link Parent
  Sure. I'll also update the post. I was using the raw sqlite driver, but I'm not opposed to changing. In the JSON instance the database is sqlite. The pandas one is currently sqlite, but I'm going...
  
  Sure. I'll also update the post.
  
  I was using the raw sqlite driver, but I'm not opposed to changing. In the JSON instance the database is sqlite. The pandas one is currently sqlite, but I'm going to have access to snowflake soon as well.
  
  1 vote
[3]

Comment deleted by author
Link
1. ebonGavia
  June 23, 2023
  Link Parent
  This is interesting stuff. Is there any possibility of, for example, a hash collision on the output of h.hexdigest()?
  
  This is interesting stuff. Is there any possibility of, for example, a hash collision on the output of h.hexdigest()?
  
  1 vote
2. imperator (OP)
  June 23, 2023
  Link Parent
  Interesting. Lot of good information here. I'll bet honest it's a bit above my head, but probably not so much that additional searches and maybe some follow up questions I could probably get there...
  
  Interesting. Lot of good information here. I'll bet honest it's a bit above my head, but probably not so much that additional searches and maybe some follow up questions I could probably get there
  
  I've not used decorators nor data classes (well I've not defined them myself, I'm sure I've indirectly used them).
  
  I like the hashing idea for the composite/ natural primary key. That way I can easily store reconciled items in a database instead of in memory then dumping to an excel file. I then had to very carefully extract new data to not overlap data I already pulled. This would help eliminate that... Very cool.
  
  1 vote
[4]

Comment deleted by author
Link
1. edoceo
  June 23, 2023
  Link Parent
  PostgreSQL supports this functionality as well.
  
  PostgreSQL supports this functionality as well.
  
  1 vote
2. [2]
  imperator (OP)
  June 23, 2023
  Link Parent
  I did find this yesterday, so I'll definitely take a look. Are you aware of best practice for committing all of this data with sub tables more efficiently? I was thinking about this and I could in...
  
  I did find this yesterday, so I'll definitely take a look. Are you aware of best practice for committing all of this data with sub tables more efficiently? I was thinking about this and I could in theory break up the JSON file into 3 separate arrays while creating a temporary primary key between the 3, commit to a temp table, then migrate that over to the main table somehow.
  
  Right now, I insert the header record, the do a lookup to get the primary key, then insert the item (line level) into the line level table with the primary key of the header. Again, this works, but it's slow and doesn't allow me to rollback on issues. It's very hacky/scripty.
  1. FluffyKittens
    June 23, 2023
    Link Parent
    The most performant pattern is typically to throw the raw JSON into the database and manipulate it there. Toss the JSON in a staging table, begin a transaction, upsert into the derivative tables,...
    
    The most performant pattern is typically to throw the raw JSON into the database and manipulate it there. Toss the JSON in a staging table, begin a transaction, upsert into the derivative tables, then commit.
    
    If you find JSON queries too unwieldy though (which is perfectly natural - they're not easy), then start by batch processing all your records in python, and use insertmany to store the results in SQLite. You can wrap the inserts to all your tables in a single transaction so that they all get processed, or fully unwind. Batching with insertmany will be on the order of ~100x faster than inserting single rows with autocommit on.
    
    1 vote
FluffyKittens
June 23, 2023
Link
Other users have covered the particulars of your examples pretty well, but to address your "metaknowledge" question of how to master this stuff, here's the best advice I can offer that's not...
Other users have covered the particulars of your examples pretty well, but to address your "metaknowledge" question of how to master this stuff, here's the best advice I can offer that's not self-evident:
- Read PEP249. It's the python "DBAPI" spec, and defines the interface that all standard python database drivers use.
- Know the non-python tricks of whichever database you want to master. For every DB I can think of, there's a better option for fast data import than whatever the python driver will offer you. For sqlite, it's the .import pragma; for postgres, it's psql's \copy; for mssql, bulk copy; etc. Use them via the subprocess module instead of inserting via DB drivers when you need performance and where practical.
- Learn how to use INFORMATION_SCHEMA tables or the db-specific equivalent. Doing so helps you write idempotent DDL for your schema setup process, and, e.g., reuse INSERT queries with similar input patterns across multiple tables.
2 votes