Re: MySQL Paritioning

From: Gordon Burditt <gordonb.3g8g3_at_burditt.org>
Date: Tue, 23 May 2017 16:36:26 -0500
Message-ID: <5KGdnQ_4z5jHNrnEnZ2dnUU7-cvNnZ2d_at_posted.internetamerica>


> I'm currently load testing different partitioning schemes but
> while that was waiting to finish, I wanted to see if anyone has any
> opinions on the different schemes that I'm testing and whether or
> not there will be gotchas in the future that I should watch for.
 

> We have a huge table that we're partitioning out (more to come)

Should I assume that this means 1,000 multi-terabyte disk drives, minimum? Or half of all the drives Google owns? There are many, many different opinions of what "huge" is when it comes to databases.

> and the two schemes we were deciding on were database sharding and
> table sharding.
 

> For database sharding, we would have many logical databases on
> one MySQL instance. Is there anything about this I should be worried
> about? Overhead of databases? I'm not too familiar with the internals
> but maybe the mapping between databases and tables is sufficient
> enough overhead that this might be an issue?

There is some overhead in having many databases but unless you really abuse it, it's probably minor compared to other things you don't mention. MyISAM distinguishes databases on disk by making a subdirectory for each database in a common parent directory (I think there is a way to distribute them manually).

If you put a LOT of files/directories (e.g. 500,000) in a single OS directory, this may cause a performance problem, such as making UUCP (yes, it's way obsolete) grind to a halt, as searching for a job to run can take longer than the communication timeout. However, that was decades ago, using much slower disks than are available today, and it was modified to use separate directories for each partner system. I note that current software, such as mail transport agents, use levels of subdirectories to avoid (or at least reduce) this problem with on-disk queues.

MyISAM also uses several files in an OS directory per table, and the same issue applies there with lots of tables in a database. I suspect that one database with 100,000,000 tables and 100,000,000 databases with one table each are both worse than, say, 10,000 databases with 10,000 tables each.

However, I suspect that there won't be much difference between one database with 100 tables, 100 databases with one table each, or 10 databases with 10 tables each. Here, you might worry more about wind resistance on the user's fingernails slowing down his typing speed (yes, this is a real problem if the user is floating in mid-air hanging on to a door in a tornado). All three of those will probably result in OS directory sizes of at most 4096 bytes for the directory of databases and maybe 16384 bytes for the directory containing an individual database if the names are kept reasonably short.

Yes, you can cache the contents of directories, but you have to expect that the contents will occasionally change.

I do not know what InnoDB uses: it could use an indexed table internally (it does not use OS directories for individual databases). This could be slow if there is a lot of dynamic creation/deletion of databases and tables.

> For table sharding, we will have one database on one MySQL instance
> that will have many tables representing the shards. Are there certain
> overheads in accessing many of these different tables that I should
> worry about because they're all on one database?

Some considerations you didn't mention, which may be much more important:

Can you split this data up on several drives, permitting simultaneous I/O to different parts of the data on each drive?

How much database writing is done relative to database reading? If the database is mostly reading, if you mirror a table on N drives, you can have N reads in progress on the same table at the same time. This does not apply to writes, however, you have to update each mirror for each change (the updates could be done in parallel).

You didn't say anything about the query load, and what kind of load it is. 10 queries per second? 1,000,000 queries per second? Is this web traffic, where it's important that 99% of the queries are answered in under 1 second, or is this an overnight batch job, where it's critical that the whole job finishes in 8 hours but speedy response to individual queries is not an issue, or is this real-time use in a nuclear-powered self-driving car, where even rare 5-second delays may result in a flaming crash? Or a mixture of each?

Can you put the database on SSD? Or at least the parts of the database that don't change a lot? Received on Tue May 23 2017 - 23:36:26 CEST

Original text of this message