There are tons of materials around there on how to optimize selects, how plans work and so, but almost no information on how to optimize insert/update/delete operations. At least I haven’t found what I needed so I wish to tell my story.
First of all a bit of environment: I have a large part of database for search (well it counts in tens of millions of records) with need for
a) fast search (so there must be some indexes).
b) fast update with lots of data (again several millions of records of updates a day are usual).
c) some sort of uniqueness so that cache contains only latest record for some condition.
Actually insert/update/delete optimization is pretty simple: every index you have slows updates down, so don’t make too much of them. But there is another case to consider: how much pages of index would be updated in case of data update.
In my situation the goal c) was achieved by using md5(of some fields), this is quite a good key it has perfect distribution (there are no repeating numbers and also they are spred very well). It’s also quite usefull in operations: single field in index, single equality in join.
But there was one problem: when I tried to insert let’s say 100K rows with new equality keys it distributed almost to every page of an index, so operation was very sloowww. Thanks to good distribution of equality_key
. Sadly situation get only worse over time as overall size grows, because updates are still very random.
Solution in my case was to «group» this index in a manner it is updated (in my case it is hotel_id, departure_date, and only then equality_key). This gives much better locality in updates and little impact in search perfomance (actually due to locality impcat should be even positive). I have some numbers but they wouldn’t say much and it’s not a purpose of this post. In my case update perfomance was boosted by an order of magnitude, and it should degrade so much with growth of DB size, so now I’m happy
.
The overall rule to consider in optimizing inserts/updates/deletes is to look on how many index pages would be updated in typical update situation (i.e. index could be thought as table records sorted in index order (actually it’s a lot different, but still model is very nice in many situations)). Ideally only pages, related to new data should be updated almost fully, and other should not be touched.
This might be very very simple and obvious when you know about it. But some people (like me) don’t know this. I hope this would help.