Scale up or scale out?
I often advocate scale-out architectures. When it comes to choice of technology I often side with the guys at YouTube (http://highscalability.com/youtube-architecture) and state that you should choose the technology that allows you to be as productive and creative as possible. There is a reason we do not build web based systems in assembler or pure C.
Granted, using a low level languages may be faster but there is usually a high cost to pay in the development phase. Besides, most time are spent on remote calls anyway, right?
Even though I consider all of the above to be true, I want to balance the discussion a bit by talking a bit about the impact of performance, or rather the lack of performance, in medium to large systems. By medium to large I mean about 10 to 100’s of servers.
For the sake of argument and simplicity I will define performance as the possibility to execute the same amount of meaningful work to the end user with a lesser number of physical servers. This may not be the most stringent or correct definition, but it will do for this post.
Complexity
Consider hosting, monitoring and maintaining a system that consists of 4 servers (e.g. 2 frontend/business servers and 2 database servers). Now consider the same system scaled out to 100 servers (e.g. 80 frontend/business servers and 20 database servers). What are the difference in running the smale scale system compared to the large one? More specifically;
What does this mean to:
- Deployment? 2 servers are easy to do manually, but 80?
- Rollbacks of releases?
- Monitoring? 2 processes may fit nice in a screen, but 80?
- Hardware failure?
- Redundancy?
- Network routing?
How does it compare to monitor and check the pulse on 4 servers and 100 servers?
Obviously it will be more complex to care for a larger system, but my argument is that the complexity grows quickly and it grows in more than one aspect. Increased complexity will also affect many aspects of the daily routines and project cycles in the company. Costs for maintenance will certainly go up, but most likely, project throughput will also decline. Releases and infrastructural changes must suddenly be coordinated and carefully planned for. New functionality and added services to the system must consider a more intricate integration. More constraints, such as network bandwidth and increased RPC’s will start to play a part. What is the total cost for the company?
Hardware
What about machine failure? According to this blog, http://www.linesave.co.uk/google_search_engine.html, Google has about 60 000 servers and predict that 60 machines will fail everyday. This means that a server has a predicted failure chance of 0.001 every day. Below is a chart for the chance of machine failure within a month.
As you add servers to the system, the chances for a single server downtime increases. This will put additional load on the operational personel.
Real life example
Our primary product at Cubeia is Firebase, a gameserver tailored for casual games. If we look at one of our competitors (whose name I will not mention here), we can compare our deployment requirements for a poker network setup targeted for 25 000 concurrent users. Running on Firebase we could almost host this on a single server (v1.7, octocore, 4G RAM, cost approx €2000), but lets scale out to four servers for redundancy (i.e. a server failure will not bring down the system).
Our competitor states a need for:
- 13 Lobby servers
- 50 Poker game servers
All in all, 63 servers for running the same functionality (assumingly since we cannot compare every detailed aspect).
What are the costs of running a system with 4 servers versus a system with 63 servers?
Predictions for monthly machine failure:
- 4 servers: 11.31%
- 63 servers: 84.91%
As a sidenote, according to this Gartner press release, http://www.gartner.com/it/page.jsp?id=1015715, a single x86 server costs about $400 per year in power only. Just the power saved with a 2 server system would be about $23 600 per year.
Some Last Words
I am not advocating that you should spend an insanely amount of man years to polish every function call and algorithm to achieve performance in it’s most glorious perfection. If you are a startup or a small scale company then agility and release speed is probably the most important thing to you right now. But as with everything in life, there is another side to consider as well and if your system is growing this side will become increasingly important.
So, my point is; buy that extra core-server, go for the SSD-disks in your database, do remove unnecessary CPU intensive algorithms, work out contentions and bottlenecks in your implementation! And be proud of it!
Keeping complexity and deployment sizes down will be important as you grow.
You can contact him at: fredrik.johansson(at)cubeia.com