Dealing with databases, inserts, updates, etc. in Python
Current Library: built in sqlite
Current db: sqlite (but will have access to Snowflake soon for option 1 below)
Wondering if anyone here has some advise or a good place to learn about dealing with databases with Python. I know SQL fairly well for pulling data and simple updates, but running into potential performance issues the way I've been doing it. Here are 2 examples.
-
Dealing with Pandas dataframes. I'm doing some reconciliation between a couple of different datasources. I do not have a primary key to work with. I have some very specific matching criteria to determine a match (5 columns specifically - customer, date, contract, product, quantity). The matching process is all built within Python. Is there a good way to do the database commits with updates/inserts en masse vs. line by line? I've looked into upsert (or inserts with clause to update with existing data), but pretty much all examples I've seen rely on primary keys (which I don't have since the data has 5 columns I'm matching on).
-
Dealing with JSON files which have multiple layers of related data. My database is built in such a way that I have a table for header information, line level detail, then third level with specific references assigned to the line level detail. As with a lot of transactional type databases there can be multiple references per line, multiple lines per header. I'm currently looping through the JSON file starting with the header information to create the primary key, then going to the line level detail to create a primary key for the line, but also include the foreign key for the header and also with the reference data. Before inserting I'm doing a lookup to see if the data already exists and then updating if it does or inserting a new record if it doesn't. This works fine, but is slow taking several seconds for maybe 100 inserts in total. While not a big deal since it's for a low volume of sales. I'd rather learn best practice and do this properly with commits/transactions vs inserting an updating each record individually within the ability to rollback should an error occur.
sqlite supports bulk insert via VALUES, eg.
I would recommend moving to an ORM of some kind, like SQLAlchemy so you don't end up hand-coding SQL (blech).
Yeah, I'll need to look into this. SQL was the first language I learned so I'm quite comfortable with it.
Hand-coding SQL is not so much about SQL itself being bleh, but sanitizing and otherwise properly handling all values to avoid security issues.
ah that makes sense. Definitely want best practice. I'm the only one running it so not too concerned about security. There is no user input it's just pulling data from eBay and putting it into a database.
It's all nice and dandy until Bobby Tables tries to sell something ;)
Personally I wouldn't consider using ORM a best practice. It's an abstraction layer that many devs choose to avoid (myself included). I would say it's more about how much control your want over your DB queries.
As for security, most sql libraries warn you to use prepared statements or similar to prevent SQL injection. As long as you're following those guidelines and not inserting user input directly into your queries you should be fine.
Interesting, thank you.
It would be helpful if you indicated what database library you are using (SQLAlchemy, Django ORM, raw drivers?) and what database you are using (MySQL, Postgres, MS-SQL, ?). Bulk insert is possible in most cases.
Sure. I'll also update the post.
I was using the raw sqlite driver, but I'm not opposed to changing. In the JSON instance the database is sqlite. The pandas one is currently sqlite, but I'm going to have access to snowflake soon as well.
This is interesting stuff. Is there any possibility of, for example, a hash collision on the output of
h.hexdigest()
?Interesting. Lot of good information here. I'll bet honest it's a bit above my head, but probably not so much that additional searches and maybe some follow up questions I could probably get there
I've not used decorators nor data classes (well I've not defined them myself, I'm sure I've indirectly used them).
I like the hashing idea for the composite/ natural primary key. That way I can easily store reconciled items in a database instead of in memory then dumping to an excel file. I then had to very carefully extract new data to not overlap data I already pulled. This would help eliminate that... Very cool.
PostgreSQL supports this functionality as well.
I did find this yesterday, so I'll definitely take a look. Are you aware of best practice for committing all of this data with sub tables more efficiently? I was thinking about this and I could in theory break up the JSON file into 3 separate arrays while creating a temporary primary key between the 3, commit to a temp table, then migrate that over to the main table somehow.
Right now, I insert the header record, the do a lookup to get the primary key, then insert the item (line level) into the line level table with the primary key of the header. Again, this works, but it's slow and doesn't allow me to rollback on issues. It's very hacky/scripty.
The most performant pattern is typically to throw the raw JSON into the database and manipulate it there. Toss the JSON in a staging table, begin a transaction, upsert into the derivative tables, then commit.
If you find JSON queries too unwieldy though (which is perfectly natural - they're not easy), then start by batch processing all your records in python, and use insertmany to store the results in SQLite. You can wrap the inserts to all your tables in a single transaction so that they all get processed, or fully unwind. Batching with insertmany will be on the order of ~100x faster than inserting single rows with autocommit on.
Other users have covered the particulars of your examples pretty well, but to address your "metaknowledge" question of how to master this stuff, here's the best advice I can offer that's not self-evident: