Cubeia Firebase comes with a very good High Availability support right out from the box: a Firebase cluster aren’t supposed to drop any events at all within reasonable, pragmatic, boundaries. For reference, our release testing is loading a cluster with 1000 events per second when testing fail-over. This corresponds to at least 20 000 poker players (depending on player speed), so it should be pretty safe even for you.
The Problem
Internally Firebase replicates events and state to minimize the risk of data loss, however, there is one instance where this is particular hard, and that is between the clients and the server. And here’s why:
Because Firebase is not one-to-one when mapping incoming events to players, but rather one-to-many we have chosen to not extend the transaction boundaries to the clients themselves as this would be extremely costly.
This means that if a client looses its TCP connection unexpectedly, there may be messages lost. This might come about from several scenarios, for example:
- A client role node in the Firebase cluster crashes. This will normally disconnect all players from the node, and upon reconnection they will be load balanced to another server.
- The client application crashes. When opened again, the client will reconnect to the Firebase cluster.
- A glitch in the routing between the client and the server disconnects the client.
In any of the cases we’ll look at today, the client will be able to realize it has lost the connection and reconnect.
Preconditions
The only thing that is really necessary to start handling reconnects gracefully is for the client to recognise it is reconnecting. If it is handled in-process this should be trivial, and when a client starts it may look to a cookie or a saved state to see if it was closed correctly, ie. a “dirty state” flag.
Also, you need to mark crucial events with ID:s and optionally a flag which can tell you if an event is “resent” (more on this below), and also make sure such events have corresponding answers from the server. Remember that Firebase is an asynchronous system, so you’ll send a critical event, then some time later, you will get an answer, hence the importance of an event ID.
And what events are “critical”? It’s up to you really, but probably those that can significantly alter the outcome of the game. The events that you really don’t want to get lost. But the key word is “significantly”, you probably don’t want to enforce a call/response-pattern on all events in all games. In gambling most events such as bet, call, fold etc are significant, whereas in a social shooter perhaps very few are.
Idempotency to the Rescue
In event processing idempotency is usually meant as the ability to resend events without changing the initial outcome more than once. In an idempotent system, if event X is sent ten times, only the first received will be executed upon, the rest will be ignored. For Firebase this means that the client should be able to send a critical event several times and the server will simply respond as if the event is executed if it is already acted upon.
Let’s break it down:
- The client must keep a back log of sent “critical” events which it has yet to receive answers to. So for each critical event it will save an ID in a map together with the entire event, and for each response to a critical event the ID will be removed from the map. If the client is not awaiting any response the map will be empty.
- When a reconnection occur, the client should check if the above map is empty, if not any event therein may theoretically have been dropped. Each event in the map should be resent, optionally with a flag detailing that this is indeed a resent event.
- The server should keep a map of ID to response objects for a reasonable amount of time, or be able to calculate a response without changing the internal state. So if it receives an event which it has already executed upon, it should recalculate the response, or fetch the response from a map.
Example: Load Test
The load test we’re using internally is very simple: Each client sends a sequence of integers and the server simply echoes the integers right back. So the client may send “5” and will wait for the server to send “5” back. It then waits for X milliseconds before increasing the sequence to “6” and repeat. This simple game allows us not only to determine the exact load at any given time (ie. events per second) but also verify that Firebase honours its FIFO behaviour.
So the client will keep a queue of integers it expects the server to respond to, and it also has an integer sequence to create new events from. The queue of integers acts as the client side map: if it is empty there’s no response outstanding. On reconnect the client checks the queue and resends any integers found in it. The server simply echoes the integers back to the client, but also checks that the sequence is intact (again checking FIFO ordering) except for the case of reconnects when the sequence is allowed to jump back a few steps.
Finally
Hopefully this should have straightened out a few questions. Keeping strict idempotency in any system is tricky, but it will give you a very robust platform, and paired off with Firebase will be damn near bullet-proof!