This is a post rewrite. The original post can found at the bottom of this post.
The reason why I am rewriting this post is because Patrick Chanezon (from Google), has added a kind and respectful comment to this post. Given the huge amount of traffic this post has generated (never expected nor wanted) I don’t want to damage the GAE project which can be a great platform in the future. People haven’t understand my point apparently because of the dramatic writing style of the original post.
The reason why we left the GAE platform is solely because of its unreliability as of september-octuber-november 2010. The last 3 months we saw the site down more times that we considered reasonable and we couldn’t find a way to avoid it so eventually we decided to abandon the platform. When the GAE team be able to stabilize the platform, it will perhaps be a great platform for scalability.
I emphasized the GAE limitations because some of them were hard to work around when we were developing although we did a great job making everything work, but when the platform let me down because of its many 500 errors,I got the feeling that we had been loosing our time with a platform not enough mature.
As a young and small entrepreneur (I am the one who fund our projects training developers all around the country), the development effort was big and I calculated the amount of money and wrote it in the post, but any experienced developer knows that the quantity I wrote is not that much. Some of my readers are friends who have invested money in the project and I wanted to give them an idea of the effort we made in terms of money in the GAE platform.
When we started developing on GAE one year ago, there was less documentation that there are today. But even today, I haven’t found any document saying… “if the application can’t load into memory within 1 second, it might not load and return 500 error codes”. This is something that should be written in the documentation as soon as possible to help others understand if GAE is a good fit for their apps or not. Django can load into memory in less than a second in my local machine I guess but apparently not in GAE and we build our app in Django because Webapp framework miss many things like security. With Django we don’t have to care about XSS and i18n is easy while in Webapp it not that easy. So if GAE can’t run with Django we had to migrate. But we were using Django Helper, developed but Google engineers if I recall correctly so it is quite disappointing to read from another Google engineer that now we have to load under a second. Sounds contradictory. Sounds not professional.
Regarding the GAE for Business, I believe this is really new. I remember joining the business plan/account when it was kind of beta and I thought it was still in beta because I never got news about it. So yes, we wanted to pay for better support and SLA and at that point we couldn’t. As a developer I can’t be checking this things every day because I’ve got many things to do so I guess Google have to inform me if the business account is now working or what.
The support from GAE engineers have been really bad and this is a mistake they know but I hope in the future it gets better for those who are investing in the platform.
In our current hosting we still don’t do JOINs because we don’t need them and we’ve learned a lot of things about scalability thanks to GAE restrictions. Data DEnormalization is one of them and we keep it like that. I wrote about the limitations because wanted to warn users who haven’t read about No-SQL. When I said “normalization” in the original post I meant that before saving strings, we transform them in a way that we can search for them (uppercase and things like that). When the application was running, it wasn’t slow. We managed to optimize our algorithms and developed a caching architecture which works really well and have a nice API.
Although many people thought we have lost money for not reading docs, the reality is that we left GAE for its unreliability and nothing more.
We enjoyed certain things from the GAE platform like the feature of changing versions with a single click and the ability to add new fields to Kinds dynamically.
It is nice if GAE success and I am happy to see users who are enjoying the platform but for us, I have had enough for now. I also don’t want to try Azure. The more experience I gain, the less I trust platforms/frameworks which code I can’t see.
Thanks for reading and understanding my English which I think should be understandable.
——— Original post written on 21 nov 2010:
Choosing GAE as the platform four our project is a mistake which cost I estimate in about 15000€. Considering it’s been my money, it is a “bit” painful.
GAE is not exactly comparable to Amazon, Rackspace or any of this hosting services, because it is a PaaS (Platform as a Service) rather than just a cluster of machines. This means that you just use the platform and gain scalability, high availability and all those things we want for our websites, without any other software architecture. Cool, isn’t it.
You do not pay until you get a lot of traffic to your site so it sounds very appealing for a small startup. I read about this in a book called “The web success startup guide” by Bob Walsh.
So yes, everything looked cool and we decided to go with this platform a couple of months ago.
It supports Python and Django (without the ORM) which we love so we tried it out. We made a spike, kind of ‘hello world’ and it was easy and nice to deploy.
When we started developing features we started realizing the hard limitations imposed but the platform:
1. It requires Python 2.5, which is really old. Using Ubuntu that means that you need a virtualenv or chroot with a separate environment in order to work with the SDK properly: Ok, just a small frustration.
2. You can’t use HTTPS with your own domain (naked domain as they called) so secure connections should go though yourname.appspot.com: This just sucks.
3. No request can take more than 30 secons to run, otherwise it is stopped: Oh my god, this has been a pain in the ass all the time. When we were uploading data to the database (called datastore a no-sql engine) the uplaod was broken after 30 seconds so we have to split the files and do all kind of difficult stuff to manage the situation. Running background tasks (cron) have to be very very well engineered too, because the same rule applies. There are many many tasks that need to take more than 30 secons in website administration operations. Can you imagine?
4. Every GET or POST from the server to other site, is aborted if it has not finished within 5 seconds. You can configure it to wait till 10 seconds max. This makes impossible to work with Twitter and Facebook many times so you need intermediate servers. Again, this duplicates the time you need to accomplish what seemed to be a simple task
5. You can’t use python libraries that are build on C, just libraries written in python: Forget about those great libraries you wanted to use.
6. There are no “LIKE” operator for queries. There is a hack that allows you to make a kind of “starts with” operator. So forget about text search in database. We had to work around this normalizing and filtering data in code which took as 4 times the estimated time for many features.
7. You can’t join two tables. Forget about “SELECT table1, table2…” you just can’t. Now try to think about the complexity this introduces in your source code in order to make queries.
8. Database is really slow. You have to read about how to separate tables using inheritance so that you can search in a table, get the key and then obtain its parent in order to avoid deserialization performance and all kind and weird things.
9. Database behavior is not the same in the local development server than in google servers. So you need some manual or selenium tests to run them against the cloud, because your unit and integration tests won’t fail locally. Again this means more money.
10. If you need to query on a table asking for several fields, you need to create indexes. If those fields are big, it is easy to get the “Too many indexes” runtime exception. That will not happend to you in a “hello world” application, but enterprise applications that need to query using several parameters are easy to run into this trouble. If you use a StringListProperty, you might not find this problem until one user populates this field with a long list. Really nasty thing. We had to re-engineer our search engine once in production becase we didn’t know this till it happened.
11. No query can retrieve more than 1000 records. If there are more than 1000, you have to deal with offsets in your code to paginate results. Taking into account that you have to filter results in code sometimes because of the the limitations of GQL (the query language), this means again more complexity for your code. More and more problems, yeiiii!
11. You can’t access the file system. Forget about saving uploads to filesystem and thouse tipical things. You have to read and learn about the inconvenient blobs. Again, more development time wasted.
12. Datastore and memcache can fail sometimes. You have to defend your app against database failues. What? Yep, make sure you know how to save user data when this happens, at any time in the request process, but remember… you can’t use the filesystem. Isn’t it fun?
13. Memcache values max size is 1megabyte. Wanted to cache everything? You were just dreaming mate, just 1 megabyte. You need your own software architecture to deal with caching.
There are some more things that make the development hard with this platform but I am starting forgeting about them. Which is great.
So why did we stay with this platform? Because we were finding the problems as we go, we didn’t know many of them. And still, once you overcome all the limitations with your complex code, you are suppoused to gain scalabilty for millions of users. After all, you are hosted by Google. This is the last big lie.
Since the last update they did in september 2010, we starting facing random 500 error codes that some days got the site down 60% of the time. 6 times out of 10, users visiting the site couln’t register or use the site. Many people complained about the same in the forums while google engineers gave us no answer. After days they used to send and email saying that they were aware of the problem and were working on it, but no solutions where given. In november, an engineer answered in a forum that our website had to load into memory in less than 1 second, as pretty much every request was loading a completelly new instance of the site. 1 second? Totally crazy. We have 15000 lines of code and we have to load django. How can we make it under 1 second?
They encourage people to use the light webapp framework. Webapp is great for a “hello world” but it is bullshit for real applications.
So once we suffered this poor support from google and we understood that they were recommending GAE just for “hello world” applications, we decided to give up on it finally.
This is not serious nor professional from google. We are more than dissapointed with GAE.
For us, GAE has been a failure like Wave or Buzz were but this time, we have paid it with our money. I’ve been too stubborn just because this great company was behind the platform but I’ve learned an important lesson: good companies make mistakes too. I didn’t do enough spikes before developing actual features. I should have performed more proofs of concept before investing so much money. I was blind.
After this we wanted to control the systems so we have gone to a classical hosting based on PostgreSQL, Nginx, Apache2 and all that stuff. We relay on great system administrators who give us great support and make us feel confident. We are really happy to be able to make SQL queries and dump databases and all those great tools that we were missing in the dammed GAE.
Because we are software craftmen, we designed all our code using TDD and the rest of XP practices. Thanks to this, we migrated 15000 lines of code in just 1 week (and just a pair of developers) from one platform to another which has been a deep change. From No-SQL back to SQL. From one database access API to another. If we didn’t had these great test bateries, migration would have taken months.
Migrating data took us 1 week because of encoding problems and data transformations.
You might agree or not with me on this post but there is one certain reality: developing on GAE introduced such a design complexity that working around it pushes us 5 months behind schedule. Now that we had developed tools and workarounds for all the problems we found in GAE, we were starting being fast with the development of features eventually, and at that point we found the cloud was totally unstable, doggy.