You might already have heard that we’ve been working a lot with the Google App Engine platform.
When it first came out, it was kind of a revolutionary solution. It lets you reach application development on a cloud-native platform similar to what Google uses internally for their public services. The key advantage is scalability, but that comes at a price; you have to be very conscious about the underlying paradigms.
In case you’re not familiar with App Engine, these are the basics:
- It was serverless before it was cool: you don’t manage servers, and while there are server (container) instances automatically provisioned for you, you primarily develop HTTP request handlers and you don’t have to care about the rest.
- It provides a scalable NoSQL database (now called Cloud Datastore, part of GCP).
- It provides other, similarly scalable services for cache, search, task queues, and so on. You get all the elements that you get from a regular enterprise platform, for example with Java EE. While these elements look similar, there are very important differences.
A trivial example is the NoSQL database. In Cloud Datastore, the amount of data you store doesn’t affect the performance of your CRUD operations and queries. The advantage here is that you don’t have to bother with capacity planning. Everything scales automatically and you pay for what you use. If your application goes viral and needs to scale tenfold, you don’t really have to do anything about it. On the other hand, there are very strict constraints on queries and consistency (among other things). For example, while you can use transactions with this database, you can only modify or access 25 entities or so-called entity groups inside a single transaction.
Don’t break scalability
A very important task of the application development is stability, especially in a cloud-native environment. A traditional SQL database will not scale significantly better even if you’re just using them for simple use cases. Cloud Datastore scales well, but it doesn’t let you do complex things. (Note: If you missed Cloud Spanner, make sure to check it out.)
You can still break your application’s scalability by adding bottlenecks, for instance a globally unique sequence counter for certain business entities that you create. Scalability problems can also be caused by features relying on low-throughput queues, or background batch-processing jobs that are expected to finish in a certain time, but bottlenecks sometimes come from very surprising places.
When you start creating an application or add a new feature, you’ll make certain assumptions about the cardinality of entities and their relations. It’s reasonable not to optimize prematurely, not to create a fully scalable system upfront when you don’t know if your application will ever hit a thousand users. But if the application continues to grow, you might start facing limitations, like the maximum size of a datastore entity, the maximum size of a queue task, or the number of entities you can process in a single request or that fits safely in memory.
Failures in a cloud-native environment
As Google also emphasizes in their SRE guidelines, there’s no such thing as 100% uptime. In a cloud-native, distributed, scalable system, this has implications for application development, too, not just for operations.
This is quite hard to get used to when coming from a Java EE mindset.
Any request can fail at any time, including yours. A network connection error to your database shouldn’t be something extraordinary, but part of regular business. In a large enough system it will just happen too often to let it bother you.
An important implication of this is that you have to make all your external calls (including datastore, cache and search) with sufficient retries and error handling. In the case of the cache for example, your application should function properly even if the cache is fully inaccessible; this is a best practice anyway.
A tricky case of an RPC failure is when your request is fulfilled on the service side but you fail to get the response due to a network error. If you retry, and the operation is not idempotent and fails, you might break your request even though the operation actually succeeded. In other exotic cases, some external systems might return HTTP 200 and still not perform the requested operation. But your request can also fail unexpectedly at any time. There might be problems in your application that causes an OutOfMemoryException that ultimately breaks other concurrent requests.
In such cases App Engine might forcibly terminate the instance. Basically your Java code can terminate at any place, without executing the finally blocks, just as if the VM or the entire server crashed. This doesn’t happen very often of course, but it does happen.
With the introduction of Cloud Firestore, the new generation of Cloud Datastore (not to be confused with Firebase), new database instances now also get higher consistency guarantees than previous database instances. Prior to this, queries were only eventually consistent, so it took a non-defined time until entity updates were reflected in query results. Queries are now consistent in this sense, but there can still be other surprising scenarios. For example, if you’re processing a set of entities, querying for them in decreasing order of last modification time and iterating on the results with a cursor, your query might miss certain entities if they’ve been modified in the meantime.
Consistency problems can also arise from the lack of referential integrity. In short, while you can make any kind of reference between a PostalAddress and a User entity by adding ‘id’ or ‘key’ fields on them, the database won’t ensure that the referred entities are actually there or that the references are up-to-date.
Cloud Datastore, for example, lets you work with Entity Groups to better handle consistency. There are great articles in the documentation about this topic. Other NoSQL databases provide similar solutions, but the point is that you still won’t get the full solution you’re used to getting in traditional databases.
This implies that when you read data from the datastore, you have to be sufficiently prepared for missing entities and outdated references.
For example, the user registration request might have partially failed and didn’t create the default PostalAddress that is usually created. It’s natural that a postal service will fail if there’s no postal address, but you probably want the user to be able to view their profile information even if the default postal address is missing, so you shouldn’t throw a NullPointerException.
Depending on your use case, you can of course achieve higher consistency guarantees, but the point is that this comes at a certain price in terms of scalability and robustness. The SRE mindset here will encourage you not to try to stick to full consistency all the time at any price, but rather consider the required, reasonable level of consistency for the given use case.
In a cloud-native environment, if you want application development, you have to be prepared for a scenario where many of your requests fail, and the ones that succeed return invalid results 🙂 Okay, the reality is not that harsh, but in systems large enough, unexpected scenarios tend to occur regularly.
Now let’s see some important tasks you should keep in mind during the process of application development in a cloud-native environment. A simpler aspect of this is that when you implement an API to return a certain resource, you have to carefully decide which related entities should cause an error if they’re missing, and which ones are just nice to have, at best-effort. You also have to consider how you can let the end-user try to manually fix a certain inconsistency that has occurred. In the case of other inconsistencies, you might want the customer workflow to be blocked to avoid further data corruption.
Things get more complicated when designing update operations, especially if external systems are involved. In App Engine, you can use a single transaction for datastore and task queue operations, but this transaction won’t extend to other services or to external calls, so you have to forget the kind of distributed transactions that you get from traditional enterprise systems.
In the case of a business update operation, you have to consider each request to see what happens if it fails or what happens if the execution is terminated at that point. You don’t necessarily have to handle all errors gracefully. It’s okay to sometimes return HTTP 500, even to end users, but you should try not to leave the system in an inconsistent state. The worst kind of error is when business data becomes inaccessible to the end user in a way that they cannot manually fix it.
If you use locking, make sure that all your locks come with a specific expiration and that the expirations are relatively short. As the request can terminate at any point or fail with an unexpected error, the lock might not be properly released. A stuck lock with an unnecessarily long expiration might block user operation for an annoyingly long time.
When you modify the database, try to find an order of operations that is unlikely to leave the data inconsistent. You may attempt to fake the tenets of a traditional system by using transactions and entity groups, but they can easily introduce bottlenecks inadvertently, so use these features wisely.
Reentrancy and idempotency
In the same way that you retry external calls if they fail with specific response codes, you should expect that your requests will also be retried if they fail.
This is the default case, for example, with Cloud Push Tasks. With Cloud Tasks it’s also important that while there are guarantees of durability and delivery, the service guarantees to execute a task at least once. This, in practice, means that you have to expect any background operation to be invoked multiple times. I’ve rarely seen this happen, but this can’t only be caused by the queue system. The operation you implemented can also fail just before the very end of the request, when all business operations have already been performed and the request is just adding some diagnostic logs to certain queues; due to a small programming mistake and a rare condition, it throws an exception. Then the task will be retried by the queue, according to the configured retry policy.
This means that any lock or mutex mechanism that you use has to be sufficiently re-entrant. By sufficient I mean that even if a queue task retry cannot re-enter the mutex instantly, the lock timeout and the task queue retries are configured in a way that the mutex can be re-entered in a reasonable time.
Idempotency can be more tricky. You have to carefully evaluate the options depending on the business case. It’s not a problem if a forum comment is posted twice, but be more careful with money transfers. Re-evaluating all initial conditions might make the main success scenario more costly just to handle very rare failures. Other operations can be implemented to be idempotent by design.
Things get more interesting when you operate with external systems. External systems tend to respond more slowly. There are more chances of network errors and they usually also provide a lower availability than GCP platform services. Slower response times can seriously limit the number of operations you can do in a single server request, so there will be less time to do operation retries. Ensuring idempotency can also come at a higher price. While doing an additional datastore read taking 10ms can be a reasonable tradeoff to make a critical operation idempotent, taking several hundred ms can severely affect user experience.
For me, coming from a Java EE background, it was difficult to get used to handle application development in a cloud-native environment this way. And, of course, this is not really about Java EE. You can develop scalable and resilient applications on top of Java EE, but industry practices and the way big application servers are constructed all guide you in a direction to take certain things for granted. If those things break, system operators come and fix it.
Having gotten used to the cloud approach, it’s very rewarding to see an application effortlessly scale and see how resiliently it handles certain incidents.