Daniel Allen Deutsch

What We Talk About When We Talk About Scaling

June 10, 2018

I had a conversation with a co-worker recently about the different phases of scaling at companies. I thought it was interesting, because my current opinion on these phases is very different than what I would have said a year ago. Additionally, it is common for developers to say "I want to work on scaling problems". This attempts to get at what that actually means.

Phase 1: Ship as quickly as possible

This has been well documented other places, but the goal of an early-stage startup is to find product-market fit. Without revenue or customers, there is no company. So, although it hurts my soul, deploying code to production that isn't performant and doesn't have tests is not unreasonable.

Yes—future developers will come in and say "I can't believe how bad all the code written 2 years ago is!" But 1. what should be optimized for changes with the stage of the company and 2. where would the fun be if we weren't complaining about the developers that came before us?

Phase 2: Improve your application code

Do you know the difference in Rails between Event.all.sort(&:created_at).last.id and Event.order(:created_at).last.id? The first instantiates an object for every Event and loads it into memory, then compares the created_at time in ruby across all records to sort them, then grabs the last one's id. The second queries the database to get the last created_at record and then grabs the id. The point: these look very similar, but one can be *very* slow and one should always be fast.

I have a suspicion devs are quick to say "We now have X million records, everything is getting slow. We need to do {{ insert fancy idea }} to scale. The truth is, there is probably a significant amount of low-hanging fruit in the application. Examples include `n + 1` queries, executing redundant queries within the same request, loading unnecessary objects into memory, poorly indexed database tables, etc.

To find these, determining which requests are slow with application logs and database logs (slow query log) is the best place to start.

Phase 3: Fancy things

One level up is to actually introduce fancy things. The two most obvious are caching and sharding.

Caching holds a strong place in my heart, mostly because many strategies are not very difficult to implement and provide huge gains. As an example, counting is very expensive in MySQL. If you need to keep displaying the total number of messages in a user's inbox, you'll hit performance issues when you cross over a certain threshold. Instead, shove the total count in Redis (or somewhere) with a TTL (time to live) of 5 minutes the first time the user asks for it. When the total is needed again 2 seconds later, grab the count from Redis rather than re-computing.

Obviously this gets at all sorts of Product questions. But the idea of writing something down to use again later is very appealing, when the alternative is many expensive calculations. (There are even fancy ways to maintain exactness with caches. What if every time a user got an email you incremented the counter in Redis and every time a user deleted an email you decremented the counter?)

Sharding is likely a bigger undertaking, but gets at the Holy Gail of supporting infinite growth with horizontal scaling. The word 'shard' means sliver or fragment—like a shard of glass. What if instead of 1 relational database with all the data, you had 5 relational databases with one-fifth of the data? That would make queries faster!

This gets complicated because, for example, how to you do a database join across db instances? (I don't have the answer, but there are some off-the-shelf-tools.) But if you are clever, you can often circumvent the difficult problems. If you are a SaaS company, maybe you can shard by customer. Put all your data related to customers A-E on one database, customers F-J on another database, etc.

This will require a tricky migration and changes to application code, but it's reasonable. Another gotcha is that costs will go up: 5 databases will probably be more expensive than 1.

Phase 4: Running forks

At Google, Facebook, Amazon, etc. scale, all bets are off. It's my understanding that Facebook, for example, maintains its own fork of MySQL. This seems reasonable—they need to process things faster and differently than the rest of the world, so they require their own tools for their specific use-cases. My only call-out would be that you should need to be pretty big before something like this happens.

Of course, with smaller libraries/gems/crates/packages this may be necessary sooner. Most libraries aim to work with so many different shapes of code, that it is likely to make trade-offs that work against your application. Maintaining a fork or monkey-patch comes with its own set of headaches, but may be worth investigating if it helps support scaling.

Thoughts

I have spent a decent amount of time working on Phase 2 and Phase 3. I wouldn't be surprised if a whole class of things exists between Phases 3 and 4 that I just don't know about. Additionally, the Phases don't have to be linear. Introducing caching early is unlikely to hurt anyone.

But please please please, don't skip Phase 2.

Have anything to say? Questions or feedback? Tweet at me @cmmn_nighthawk!