The Database Scalability Blog - Doron Levari

I've lived around databases all my life, 21st century is challenging for them: big data, throughput, complexity, virtualization, global distribution - it's all scalability. I'm the founder and CTO of ScaleBase, solving this problem is a workoholic's heaven, so I'm having great time!
My agenda is to stay technical, no marketing and sales BS, give my summarized set of views and opinions to urgent topics, events and latest news in database scalability.

Thursday, November 14, 2013

Will AWS plans for PostgreSQL RDS help it finally pick up?

"Amazon to add Postgres to its most-favored database list" says GigaOM:

http://gigaom.com/2013/11/12/amazon-to-add-postgres-to-its-most-favored-database-list/

"To many this is no-brainer. Amazon wants to support the databases that its developer audiences want to use. This is simply a case of Amazon responding to user demand and oh-by-the-way making its cloud infrastructure more attractive to a specific target audience. Some say Postgres has gained traction since Oracle’s acquisition of MySQL via its Sun buyout a few years back."

Some people I know said "yea, the writing was on the wall...". Well, was it?? Really?

AWS finally got the time to "plan" for supporting Postgres now? After supporting MySQL, Oracle and SQL Servers for almost 3 years?! Writing was on the wall? Where can I find a wall this old?

PostgreSQL has not picked up.

This is why it is a far 4th on Amazon's list. The writer of the text above also makes clear efforts not to pick a side here... "to many this is a no-brainer" or "some say Postgres has gained traction".

It has been around for ages, thru many "oh! it's now happening!" events, such as the acquisition by of MySQL by Sun, then by Oracle...

Technically, PostgreSQL's few superior capabilities, especially around schema online modifications (which gets more important these days!), probably could not change its fate, and it's still being held back by too many inferior capabilities, around performance, robustness, ecosystem...

So - with plans for RDS, will Postgres now pick up?

Feel free to Share your thoughts...

Wednesday, April 24, 2013

Concurrency is not parallelism

No so new, but still good piece of reading: http://blog.golang.org/2013/01/concurrency-is-not-parallelism.html

"Concurrency is the composition of independently executing processes, while Parallelism is the simultaneous execution of (possibly related) computations"

As I wrote several times in the past, in OLTP, throughput is king, concurrency is the main thing that is put into the test.

Concurrency is where Facebook has a million "Like"s every second, each "Like" is independent, and they need to be processed concurrently.

Parallelism, is where few concurrent activities, say a few analytic reports run in Oracle Exadata, Vertica or GreenPlum. Every report is is sliced into many related computations that execute simultaneously.

Are these the same?

From 50,000 feet, we see many things running in the same time, in parallel, concurrently, maybe even distributed. But we need to be accurate, there is a huge difference, and it is in the source: how many "original" transactions we had to process? A million "Like"s vs. a few big analytic report. In both cases I see million operations coming out of them at the back, but:
In the "Like"s use case - those are the real transactions, concurrently running, distributed.
In the report use case - those are million pieces of the same initial single job.

Important! Not to be confused! Big difference! One is great for throughput scalability and one is not. More in my next post.

Wednesday, April 3, 2013

MySQL thread pool and scalability examples

Nice article about SimCity outage and ways to defend databases: http://www.mysqlperformanceblog.com/2013/03/16/simcity-outages-traffic-control-and-thread-pool-for-mysql/

The graphs showing throughput with and without the thread pool are taken from the benchmark performed by Oracle and taken from here:
http://www.mysql.com/products/enterprise/scalability.html

The main take away is this graph (all rights reserved to Oracle, picture original URL):
20x Better Scalability: Read/Write

Scalability is where throughput can grow and grow, as demand grows. I need to get more from the database, the question is: "can it scale to give it to me?". Scalability is where the response time remains "acceptable" while the throughput grows and grows.

Every database has a "knee point".

In the best case scenario, in this knee-point, throughput will go into a flat plateau, and On the same point BTW, response time will start climbing, passing the non-acceptable point.
In a worse case scenario, in this knee-point, throughput, instead of a flat plateau, it will take a plunger. On the same point BTW, response time will start climbing fast to the roof.

Actually, the red best case scenario, is actually pretty bad... There's NO scalability there, throughput has a hard limit! It's around 6,500 transactions per second. I need to do more on my DB, there are additional connections - but the DB is not giving even 1 inch of additional throughput. It doesn't scale.

The thread pool feature is no more than a defense mechanism. It doesn't break the scalability limit of a single machine, rather its job is to defend the database from death.

Real scalability is when throughput graph is neither dropping or becoming flat - it goes up and up and up with a stable response time. This can be achieved only by Scale Out. Getting 7,500 TPS with 1 database with 32 connections, then add an additional database and the straight line going up will reach, say, 14,000. A system with 3 database can support 96 connections and 21,000 TPS... and on and on it goes...

Data needs to be distributed across those databases, so the load can be distributed as well. Maintaining this distributed data on the scaled-out database is the key... I'll touch that in future posts.

Tuesday, March 26, 2013

They say: "Relational Databases Aren't Dead"

This is a good read, claiming: "Relational Databases Aren't Dead. Heck, They're Not Even Sleeping", http://readwrite.com/2013/03/26/relational-databases-far-from-dead. A key quote:

"While not comprehensive, the uses for NoSQL databases center around the acquisition of fast-growing data or data that does not easily fit within uniform structures."

There were 2 parts in the statement about NoSQL's uses. I'll start with the latter:

"data that does not easily fit within uniform structures" - NoSQL is probably the right choice, hmm although I always encourage thinking and architecting in advance. And also online structure changes do exist in the RDBMS world and recently in MySQL: http://dev.mysql.com/doc/refman/5.6/en/innodb-online-ddl.html...
I would definitely warn about the caveats of NoSQL when it comes to actually use and query the data that is so easily stored there...

"acquisition of fast-growing data" - is no longer a no-go for RDBMS and MySQL database. Distributed RDBMS solutions do exist today and they can exploit performance and scalability from the good old MySQL itself

What do you think?

Friday, January 4, 2013

Partial partitioning and sharding

I came across this: http://stackoverflow.com/questions/14136633/difference-between-partial-replication-and-sharding

I was wondering if sharding is an alternate name for partial replication or not. What I have figured out that --

Partial Repl. – each data item has only copies at some but not all of the nodes (‘Sharding’?)

Pure Partial Repl. – has only copies of a subset of the data item but no node contains a full copy of the database

Hybrid Partial Repl. – a set of nodes are full replicas and another set of nodes are partial replicas

I thought it was a good topic, I write a really nice answer, but there was a problem when I pressed "post this answer", probably some error or mistake on their side. Anyway - this is what I have to say about partial partitioning and sharding:

Partial replication is an interesting way, in which you distribute the data with replication from a master to slaves, each contains a portion of the data. Eventually you get an array of smaller DBs, read only, each contains a portion of the data. Reads can very well be distributed and parallelized.

But what about the writes?

Those are still clogged, in 1 big fat lazy master database, tasks as buffer management, locking, thread locks/semaphores, and recovery tasks - are the real bottleneck of the OLTP, they make writes impossible to scale... As I wrote in many previous posts, for example here: http://database-scalability.blogspot.com/2012/08/scale-up-partitioning-scale-out.html.

Sharding is where every piece of data lives only in one place, within an array of DBs. Each database is the complete owner of the data: data is read from there, data is written to there. This way, reads and writes are distributed and parallelized, real scale-out can be achieved.

The idea behind sharding is great, the ultimate scaling solution (ask Facebook, Google, Twitter and all the other big guys) but it's a mess to handle, to maintain. It's hard as hell if done by yourself, ScaleBase enables an automatic transparent scale-out machine - that does all that, so you won't have too...