Tell me your worst experience with database performance (as a developer)
I’d like your help – and your stories!
I’m working on an article with a tentative title of “Tales of the Crypt: Horror stories (about your past) where database performance caused a real problem.” It’s meant to be schadenfreude nostalgia, about your late nights coping with a performance issue (with, hopefully, a happy ending of “…and this is what we did to fix it”).
So, what happened? Tell me about it.
I do want to quote you, but we can be oblique about the attribution – especially because sometimes these stories are from a previous employer and do not represent any current affiliation. But I do want the verisimilitude that demonstrates that these tales-of-woe come from real people. As a result, I’m fine with writing, “Kim’s first job was as a mainframe programmer at a hotel chain, where database transactions required tape changes. ‘Yada yada story,’ says Kim, who now is CIO of a Midwest insurance firm.” Real person, but you don’t need to worry about getting anyone to approve your words. (Though if you’re happy with full name, company, and role, I’m even happier; send in a private message if you prefer.)
I used an ancient example above, but I’m hoping for more recent database performance stories. Ideally some of the “here’s how we fixed it” become practical suggestions for developers who are enduring such a situation today.
My database story is likely so common as to be boring, but here goes:
(Hint: it involves indexes.)
Was renting a cloud server for around $80/month with dedicated CPU. It was running a straightforward application:
Over time, as requirements evolved, the application code had to be updated. Of course this sometimes introduced performance regressions, including network bottlenecks and CPU bottlenecks. Adding caching logic in the application code often fixed these issues.
The story begins at one point after a large feature update. Even with the 2 dedicated CPUs, sometimes the server was overloaded with constant 100% CPU and the job would fall behind.
Taking prior history into account, I immediately assumed the CPU bottleneck was happening in step #2. So I jumped into the application code and spent a while adding some caching, researched and swapped some dependencies to highly-optimized but uglier-API C implementations. It helped a bit, reducing CPU load by roughly 15%, but that wasn't enough.
By now, I was already tired of this thing sidetracking me from more important work, and considered doubling the server's CPU and monthly cost and calling it a day.
But then, checking
top(as I should have done initially, oops), I noticed the Postgres processes were pegging the CPU.
Doing some web research, I decided I should use Postgres' and pgAdmin's tools to debug performance. IIRC there is a tool that samples the recent SQL queries to quickly determine which ones are taking the most time.
I discovered that the SELECT queries in step #4 were really slow. But why? There were indexes on all the columns, I thought.
Double- and triple-checking the database, and related code/scripts to create the tables, I found that, yes, there were indexes covering all the columns. Compound indexes to be precise.
(Experienced devs can probably guess the answer by now.)
I then took an example query and ran it through the query analyzer in pgAdmin. To my horror, it was ignoring the index on the column and doing a sequential scan! The worst word to say to a database developer is "sequential scan." That was absolutely why Postgres was struggling with the query.
What was the underlying problem though? Why was it ignoring the index? There were only 3 columns on the table, and the query itself was as simple as it gets! Why couldn't it see the index!?
The table was something as simple as:
id, category_name, another_id
The index was a
The query was something like:
SELECT another_id FROM the_table WHERE category_name = ?
After a while playing around, adding and removing indexes, and confirming with web searches, I learned that the order of fields in the compound index was the problem. Since the query was not using the first field of the index, Postgres was skipping the compound index entirely.
My solution was to add one more simple index on the
category_namefield. I quickly verified that Postgres was now doing an index-based query which dropped CPU by roughly 90%. Now instead of doubling my server specs, this gave me leeway to downgrade the server to one with half the CPU and half the price!
Probably the worst one I can think of was an application I used to be a developer for in the mid-2000s.
The application used a royalty-free clone of dBASE. To understand the rest of this story, you need to understand something about the state of that database technology: it was not relational. It couldn't do things that we expect of modern databases, like joins. If you wanted to join two tables, you were obtaining cursors for each of those tables and then rolling your own nested-loop join within your application.
Our application, like most, had the concept of users. Each user could be assigned various attributes.
Our application was also very enterprisey, so it was typically sold through enterprise sales channels or indirectly by managed service providers. In this particular situation, the sale occurred through multiple middlemen, so we as the vendor did not get a very good contractual say as to how the product would be deployed.
Unfortunately for my employer, that contract did release them from obligations to provide tech support. Soon enough, support started receiving calls, and I am the developer who eventually received the escalation.
It turns out that an MSP hired by another MSP to install our product decided to hack around our per-seat licensing by only creating one single user, and then attaching all the attributes across thousands of physical
users onto that one single logical user in our app.
As I mentioned before, our database did not have a query optimizer or proper execution engine. It could create individual tables, slap indexes on them, and query them individually. Complex queries were hand-rolled in the app itself, completely unequipped to deal with substantial changes to the shape of the data. There were no column statistics, there was no way to change query plans to accommodate data whose cardinality differed from what we expected. In other words, the only way to fix the customer's problem was to manually rewrite every single affected query!
As you can imagine, a situation like that was intractable in terms of responding to an urgent support request. IIRC that client was eventually "fired."
I was originally going to leave this as the conclusion of the story, until I realized that it kicked off a series of events that led to even more database performance problems in the future...
Various experiences such as the one above finally convinced upper management that we had to switch to a relational database. Since the application contained nothing but hand-rolled queries, this changeover would necessitate a massive rewrite of most of the application so that we could decouple data access from business logic.
The edict from the VP of development was that queries would be placed into stored procedures. There would be a hand-rolled ORM that would call the stored procedures and then deserialize their results into appropriate data structures in memory.
You might have noticed the preceding paragraph mentioning things like an "edict" from an executive containing specific ideas as to how to do the rewrite. You might have also thought, "uh-oh," and you would be correct.
This development team possessed a culture of distrust and micromanagement. Only managers were considered to be competent; the ICs were expected to just be drones who did what they were told. This resulted in developers who often were afraid to think for themselves; to do so would just invite a world of hurt. I don't think that the developers were stupid people, but the culture was so stifling that they were coerced into doing stupid things.
Remember how this rewrite was needed to eliminate the hand-rolled queries? Our developers were too shellshocked to recognize that sometimes they needed to do multiple passes over the business logic: they would remove the first layer of joins, but then take those results and join them together in business logic.
That all performed just fine under dBASE, when there was virtually no latency. It performed decently on a developer's freshly installed local VM which was running the DBMS inside of it.
It did not perform well when any amount of latency was introduced between the application and the DBMS, or when there was actual data populated in the database.
Once again, nobody realized any of this until the new version of the app was being used in production.
As people started to smarten up about the situation, one developer added telemetry to track how many round-trips were required between the application and the DBMS to render a single web page in its admin panel. Each page needed dozens.
Of course, management learned the wrong lesson from this: it reinforced their views that the developers were all a bunch of mouth-breathing idiots who needed even more supervision. But that's another story for another day...
TBH I haven't had any for a long time. Database and cloud management tools have gotten really nice. The last time I had issues was a decade ago when we were using old-fashioned shared hosting, which meant slow database speeds. The worst thing that ever happened was when a table holding log data ballooned to the point where we ran out of storage space, which broke everything else. And that was a case of just not being aware that the software we were using was storing the log data that way.
It was a long time ago and I'm not sure I remember the details right, and it's not about performance, so I don't think this story is usable.
But there was a job when I was quite young and joined a startup. The database guy was on vacation. I thought the table and column names were pretty hard to understand, so I tried renaming a few things for clarity.
When he got back, we restored from backup.
But there was another time, later on, when I discovered that the daily database backup was not actually working, because the tape had gotten ejected from the drive.