MySQL Database optimization

Just to have them all in one place, and for posterity’s sake, I’ve pulled together some points on how to optimise your MySQL programming.

First off, the result of Session 1 of MySQL Camp 1:

  1. Use EXPLAIN to profile the query execution plan
  2. Use Slow Query Log (always have it on!)
  3. Don’t use DISTINCT when you have or could use GROUP BY
  4. Insert performance
    1. Batch INSERT and REPLACE
    2. Use LOAD DATA instead of INSERT
  5. LIMIT m,n may not be as fast as it sounds
  6. Don’t use ORDER BY RAND() if you have > ~2K records
  7. Use SQL_NO_CACHE when you are SELECTing frequently updated data or large sets of data
  8. Avoid wildcards at the start of LIKE queries
  9. Avoid correlated subqueries and in select and where clause (try to avoid in)
  10. No calculated comparisons — isolate indexed columns
  11. ORDER BY and LIMIT work best with equalities and covered indexes
  12. Separate text/blobs from metadata, don’t put text/blobs in results if you don’t need them
  13. Derived tables (subqueries in the FROM clause) can be useful for retrieving BLOBs without sorting them. (Self-join can speed up a query if 1st part finds the IDs and uses then to fetch the rest)
  14. ALTER TABLE…ORDER BY can take data sorted chronologically and re-order it by a different field — this can make queries on that field run faster (maybe this goes in indexing?)
  15. Know when to split a complex query and join smaller ones
  16. Delete small amounts at a time if you can
  17. Make similar queries consistent so cache is used
  18. Have good SQL query standards
  19. Don’t use deprecated features
  20. Turning OR on multiple index fields (<5.0) into UNION may speed things up (with LIMIT), after 5.0 the index_merge should pick stuff up.
  21. Don’t use COUNT * on Innodb tables for every search, do it a few times and/or summary tables, or if you need it for the total # of rows, use SQL_CALC_FOUND_ROWS and SELECT FOUND_ROWS()
  22. Use INSERT … ON DUPLICATE KEY update (INSERT IGNORE) to avoid having to SELECT
  23. use groupwise maximum instead of subqueries
  24. Avoid using IN(…) when selecting on indexed fields, It will kill the performance of SELECT query.

Scaling Performance Tips:

  1. Use benchmarking
  2. isolate workloads don’t let administrative work interfere with customer performance. (ie backups)
  3. Debugging sucks, testing rocks!
  4. As your data grows, indexing may change (cardinality and selectivity change). Structuring may want to change. Make your schema as modular as your code. Make your code able to scale. Plan and embrace change, and get developers to do the same.

Network Performance Tips:

  1. Minimize traffic by fetching only what you need.
    1. Paging/chunked data retrieval to limit
    2. Don’t use SELECT *
    3. Be wary of lots of small quick queries if a longer query can be more efficient
  2. Use multi_query if appropriate to reduce round-trips
  3. Use stored procedures to avoid bandwidth wastage

OS Performance Tips:

  1. Use proper data partitions
    1. For Cluster. Start thinking about Cluster *before* you need them
  2. Keep the database host as clean as possible. Do you really need a windowing system on that server?
  3. Utilize the strengths of the OS
  4. pare down cron scripts
  5. create a test environment
  6. source control schema and config files
  7. for LVM innodb backups, restore to a different instance of MySQL so Innodb can roll forward
  8. partition appropriately
  9. partition your database when you have real data — do not assume you know your dataset until you have real data

MySQL Server Overall Tips:

  1. innodb_flush_commit=0 can help slave lag
  2. Optimize for data types, use consistent data types. Use PROCEDURE ANALYSE() to help determine the smallest data type for your needs.
  3. use optimistic locking, not pessimistic locking. try to use shared lock, not exclusive lock. share mode vs. FOR UPDATE
  4. if you can, compress text/blobs
  5. compress static data
  6. don’t back up static data as often
  7. enable and increase the query and buffer caches if appropriate
  8. config params — http://docs.cellblue.nl/2007/03/17/easy-mysql-performance-tweaks/ is a good reference
  9. Config variables & tips:
    1. use one of the supplied config files
    2. key_buffer, unix cache (leave some RAM free), per-connection variables, innodb memory variables
    3. be aware of global vs. per-connection variables
    4. check SHOW STATUS and SHOW VARIABLES (GLOBAL|SESSION in 5.0 and up)
    5. be aware of swapping esp. with Linux, “swappiness” (bypass OS filecache for innodb data files, innodb_flush_method=O_DIRECT if possible (this is also OS specific))
    6. defragment tables, rebuild indexes, do table maintenance
    7. If you use innodb_flush_txn_commit=1, use a battery-backed hardware cache write controller
    8. more RAM is good so faster disk speed
    9. use 64-bit architectures
  10. –skip-name-resolve
  11. increase myisam_sort_buffer_size to optimize large inserts (this is a per-connection variable)
  12. look up memory tuning parameter for on-insert caching
  13. increase temp table size in a data warehousing environment (default is 32Mb) so it doesn’t write to disk (also constrained by max_heap_table_size, default 16Mb)
  14. Run in SQL_MODE=STRICT to help identify warnings
  15. /tmp dir on battery-backed write cache
  16. consider battery-backed RAM for innodb logfiles
  17. use –safe-updates for client
  18. Redundant data is redundant

Storage Engine Performance Tips:

  1. InnoDB ALWAYS keeps the primary key as part of each index, so do not make the primary key very large
  2. Utilize different storage engines on master/slave ie, if you need fulltext indexing on a table.
  3. BLACKHOLE engine and replication is much faster than FEDERATED tables for things like logs.
  4. Know your storage engines and what performs best for your needs, know that different ones exist.
    1. ie, use MERGE tables ARCHIVE tables for logs
    2. Archive old data — don’t be a pack-rat! 2 common engines for this are ARCHIVE tables and MERGE tables
  5. use row-level instead of table-level locking for OLTP workloads
  6. try out a few schemas and storage engines in your test environment before picking one.

Database Design Performance Tips:

  1. Design sane query schemas. don’t be afraid of table joins, often they are faster than denormalization
  2. Don’t use boolean flags
  3. Use Indexes
  4. Don’t Index Everything
  5. Do not duplicate indexes
  6. Do not use large columns in indexes if the ratio of SELECTs:INSERTs is low.
  7. be careful of redundant columns in an index or across indexes
  8. Use a clever key and ORDER BY instead of MAX
  9. Normalize first, and denormalize where appropriate.
  10. Databases are not spreadsheets, even though Access really really looks like one. Then again, Access isn’t a real database
  11. use INET_ATON and INET_NTOA for IP addresses, not char or varchar
  12. make it a habit to REVERSE() email addresses, so you can easily search domains (this will help avoid wildcards at the start of LIKE queries if you want to find everyone whose e-mail is in a certain domain)
  13. A NULL data type can take more room to store than NOT NULL
  14. Choose appropriate character sets & collations — UTF16 will store each character in 2 bytes, whether it needs it or not, latin1 is faster than UTF8.
  15. Use Triggers wisely
  16. use min_rows and max_rows to specify approximate data size so space can be pre-allocated and reference points can be calculated.
  17. Use HASH indexing for indexing across columns with similar data prefixes
  18. Use myisam_pack_keys for int data
  19. be able to change your schema without ruining functionality of your code
  20. segregate tables/databases that benefit from different configuration variables

The second set of tips (32, he says), comes from Alexander Skakunov over at AjaxLine:

  1. Use persistent connections to the database to avoid connection overhead.
  2. Check all tables have PRIMARY KEYs on columns with high cardinality (many rows match the key value). Well,`gender` column has low cardinality (selectivity), unique user id column has high one and is a good candidate to become a primary key.
  3. All references between different tables should usually be done with indices (which also means they must have identical data types so that joins based on the corresponding columns will be faster). Also check that fields that you often need to search in (appear frequently in WHERE, ORDER BY or GROUP BY clauses) have indices, but don’t add too many: the worst thing you can do is to add an index on every column of a table (I haven’t seen a table with more than 5 indices for a table, even 20-30 columns big). If you never refer to a column in comparisons, there’s no need to index it.
  4. Using simpler permissions when you issue GRANT statements enables MySQL to reduce permission-checking overhead when clients execute statements.
  5. Use less RAM per row by declaring columns only as large as they need to be to hold the values stored in them.
  6. Use leftmost index prefix — in MySQL you can define index on several columns so that left part of that index can be used a separate one so that you need less indices.
  7. When your index consists of many columns, why not to create a hash column which is short, reasonably unique, and indexed? Then your query will look like:
    1. SELECT *
    2. FROM table
    3. WHERE hash_column = MD5( CONCAT(col1, col2) )
    4. AND col1=’aaa’ AND col2=’bbb’;
  8. Consider running ANALYZE TABLE (or myisamchk –analyze from command line) on a table after it has been loaded with data to help MySQL better optimize queries.
  9. Use CHAR type when possible (instead of VARCHAR, BLOB or TEXT) — when values of a column have constant length: MD5-hash (32 symbols), ICAO or IATA airport code (4 and 3 symbols), BIC bank code (3 symbols), etc. Data in CHAR columns can be found faster rather than in variable length data types columns.
  10. Don’t split a table if you just have too many columns. In accessing a row, the biggest performance hit is the disk seek needed to find the first byte of the row.
  11. A column must be declared as NOT NULL if it really is — thus you speed up table traversing a bit.
  12. If you usually retrieve rows in the same order like expr1, expr2, …, make ALTER TABLE … ORDER BY expr1, expr2, … to optimize the table.
  13. Don’t use PHP loop to fetch rows from database one by one just because you can — use IN instead, e.g.
    1. SELECT *
    2. FROM `table`
    3. WHERE `id` IN (1,7,13,42);
  14. Use column default value, and insert only those values that differs from the default. This reduces the query parsing time.
  15. Use INSERT DELAYED or INSERT LOW_PRIORITY (for MyISAM) to write to your change log table. Also, if it’s MyISAM, you can add DELAY_KEY_WRITE=1 option — this makes index updates faster because they are not flushed to disk until the table is closed.
  16. Think of storing users sessions data (or any other non-critical data) in MEMORY table — it’s very fast.
  17. For your web application, images and other binary assets should normally be stored as files. That is, store only a reference to the file rather than the file itself in the database.
  18. If you have to store big amounts of textual data, consider using BLOB column to contain compressed data (MySQL’s COMPRESS() seems to be slow, so gzipping at PHP side may help) and decompressing the contents at application server side. Anyway, it must be benchmarked.
  19. If you often need to calculate COUNT or SUM based on information from a lot of rows (articles rating, poll votes, user registrations count, etc.), it makes sense to create a separate table and update the counter in real time, which is much faster. If you need to collect statistics from huge log tables, take advantage of using a summary table instead of scanning the entire log table every time.
  20. Don’t use REPLACE (which is DELETE+INSERT and wastes ids): use INSERT … ON DUPLICATE KEY UPDATE instead (i.e. it’s INSERT + UPDATE if conflict takes place). The same technique can be used when you need first make a SELECT to find out if data is already in database, and then run either INSERT or UPDATE. Why to choose yourself — rely on database side.
  21. Tune MySQL caching: allocate enough memory for the buffer (e.g. SET GLOBAL query_cache_size = 1000000) and define query_cache_min_res_unit depending on average query resultset size.
  22. Divide complex queries into several simpler ones — they have more chances to be cached, so will be quicker.
  23. Group several similar INSERTs in one long INSERT with multiple VALUES lists to insert several rows at a time: quiry will be quicker due to fact that connection + sending + parsing a query takes 5-7 times of actual data insertion (depending on row size). If that is not possible, use START TRANSACTION and COMMIT, if your database is InnoDB, otherwise use LOCK TABLES — this benefits performance because the index buffer is flushed to disk only once, after all INSERT statements have completed; in this case unlock your tables each 1000 rows or so to allow other threads access to the table.
  24. When loading a table from a text file, use LOAD DATA INFILE (or my tool for that), it’s 20-100 times faster.
  25. Log slow queries on your dev/beta environment and investigate them. This way you can catch queries which execution time is high, those that don’t use indexes, and also — slow administrative statements (like OPTIMIZE TABLE and ANALYZE TABLE)
  26. Tune your database server parameters: for example, increase buffers size.
  27. If you have lots of DELETEs in your application, or updates of dynamic format rows (if you have VARCHAR, BLOB or TEXT column, the row has dynamic format) of your MyISAM table to a longer total length (which may split the row), schedule running OPTIMIZE TABLE query every weekend by crond. Thus you make the defragmentation, which means more speed of queries. If you don’t use replication, add LOCAL keyword to make it faster.
  28. Don’t use ORDER BY RAND() to fetch several random rows. Fetch 10-20 entries (last by time added or ID) and make array_random() on PHP side. There are also other solutions.
  29. Consider avoiding using of HAVING clause — it’s rather slow.
  30. In most cases, a DISTINCT clause can be considered as a special case of GROUP BY; so the optimizations applicable to GROUP BY queries can be also applied to queries with a DISTINCT clause. Also, if you use DISTINCT, try to use LIMIT (MySQL stops as soon as it finds row_count unique rows) and avoid ORDER BY (it requires a temporary table in many cases).
  31. When I read “Building scalable web sites”, I found that it worth sometimes to de-normalise some tables (Flickr does this), i.e. duplicate some data in several tables to avoid JOINs which are expensive. You can support data integrity with foreign keys or triggers.
  32. If you want to test a specific MySQL function or expression, use BENCHMARK function to do that.

Now some of the points need to be read together — eg using the IN() function, depending on your Indexing.
But some good pointers, none the less! Also read this one, over at mysqlperformancetuning.com