Trading Performance and QualityPosted: March 1, 2011
In a tweet to Michael Feathers (of Working Effectively with Legacy Code fame), I asked if performance is an excuse for writing legacy code. The answer: “performance is a case where you’d better have tests”, and I agree wholeheartedly with him. Performance improvement in non-trivial and existing business applications is mostly accomplished by measuring and refactoring code, since the programming language and the platform are already chosen. Unfortunately, when confronted with legacy code in an application with non-trivial complexity, measuring is difficult but feasible with a lot of patience and resources, but refactoring is almost out of reach.
This is exactly why unlegacity is important when performance is an issue. The key is control. Automated testing and quality helps a team control their software, making it easier to improve performance by measuring and refactoring. If traffic to your website doubles, throwing twice as many servers in the pool might not work as well as expected, but will probably stop the hemorrhaging. If traffic quadruples in a matter of months or weeks, trouble is glooming. Now, on top of that, imagine not having control on the software. All of a sudden, that guy you’re paying to maintain the website cannot handle all the requests thrown at him. Management sends more soldiers to war and chaos slowly builds. You realize nobody except this guy knows the caprices and the sinuous paths leading to the hidden secrets of the application. Even he opens a few Pandora boxes on the way, and all hell breaks loose in a matter of days. Downtime, pissed off bosses and investors, money lost, etc.: you know the drill.
Now, if a team has control over the software, the story is a bit different. You can send your best programmers into the battlefield and, considering they have a short feedback loop, they’ll work faster and make less mistakes. Furthermore, implementing that new shiny caching solution now seems very easy. Since your software is tested, inserting an additional layer is not a matter of surgical precision but simply moving furniture around. In the end, two teams, one working with legacy software and the other with solid software, will both successfully bring the web site back up, although one can sleep at night and the other waits for the next stampede. You can guess which one did it faster and for less money.
The three monoliths
I am going to throw a personal story to illustrate this. Back in summer 2009, I worked full-time as a free-lancer on an high traffic website generating buckets of gold to its owners. Of course, the buckets of gold would stop coming if the website went down, meaning keeping it up was their main priority. After a breakthrough in a new market, the traffic was multiplied by 10 in a matter of weeks. This promised more buckets of gold. That is, if response time were not abysmal.
Well, long story short, the website was inherited from previous owners who felt time to market should be close to zero, dragging quality down at the same time. Robert C. Martin says it all: “They had rushed the product to market and had made a huge mess in the code. As they added more and more features, the code got worse and worse until they simply could not manage it any longer. It was the bad code that brought the company down.” [Martin 2008]. Truckloads of freelancers and outsourced development shops were given the keys to the SVN repository and out of it grew not one, not two but three different and completely independent web applications, acting to the outside world as one. They interacted through a shared database. Two of the three applications were erected in good ol’ PHP4-styled transaction scripts with absolutely nasty hidden interactions through global variables, key-value caches and “special, hidden” tables. The third one was built by a knowledgeable team of professional developers from a small development shop in New York City, which I had the chance to be working with. Although the third application was independent of the two other messes, it still had to interact with them. Fortunately enough, the model was simple and this was not a problem. This was important, as it was running one of the most critical parts of the infrastructure, namely a personal messaging system interacting through various schemes with half a dozen APIs. Despite that, from time to time, various invariants were not respected and stuff exploded.
During development, our team had come up with a way to isolate persistence, using an home-made Database Abstraction Layer (DBAL) and the Table Data Gateway pattern. Although I was not familiar with the pattern taxonomy at the time, below is an approximate schema of what it looked like.
In short, we stuck with transaction scripts but isolated them in very small, specific services, making the controllers very thin. Database queries all had to go through the gateway, which in turn relied on the DBAL for abstraction, making the transaction scripts ignorant of the actual implementation used to store the data.
When presented with a similar schema in a video-conference three weeks before the rapid traffic increase, the programmers in the two other teams barked very loud. “Too complicated” one said. “Not performant” the other said. And so it went. I kinda sided with them on complexity, although I was not convinced by the performance argument. I cheated a bit (sequence diagrams weren’t meant to represent procedural programming) and came up with a small schema of a classic way to approach persistence in transaction scripts in PHP.
By comparing both schemas, it appears the two other teams made their point that their approach was less complex. On the other hand, our team argued that this design permitted an absolute separation of concerns, hence making testing easier. In fact, our monolith had a test coverage of about 90%, whilst tests were unheard of in the two other. In the end, all three teams parted ways and continued to develop through a shared database, much to our despair.
Thanks to the wonders of marketing, traffic started growing exponentially and it was time to react. The lead of the NYC team I was working with had us all sit through an hour long video-conference brainstorm on how to cope with it. Our “Ivory Tower” (as other developers started calling it) mostly ran without any sort of caching. Heavy database queries were cached in memory tables or views. The problem was, we could not rely on the database servers anymore, since the two other monoliths were drowning it with useless queries, despite their extensive use of caching (another example of lack of control). A solution had to be found to cache vast amounts of denormalized data, so as to not rely on the database anymore. As explained earlier, all queries went through a domain-specific table data gateway. The solution was fairly simple: transparently introduce caching in the gateway classes. The new process is described in the activity diagram below.
Simply put, the table gateway would query the cache façade. If there was a hit, this value was returned. If there was a miss, the value would be fetched from the database, inserted in the cache through the façade and returned. Furthermore, since consistency was a requirement, cache had to be invalidated explicitly when a change happened in the source of truth, i.e. the relational database. This was also solved easily; since all CRUD was handled by the same gateway, keeping tracks of keys was trivial. Implementing this far from perfect but reasonable solution took 1 (long) day and was up and running in no time, with very satisfying results. Considering our monolith was not legacy, only minor defects appeared in production since we could test our changes on the way. Liberated from the task of keeping alive our monolith, we could restart working on money-earning features.
At the same time, the two other teams were struggling. After a week of agony, they finally had some results which promised to alleviate the performance problems, but it’d take them another week to implement – they had to go through hundreds of weirdly interacting, hidden, and almost forgotten transaction scripts to implement Memcached calls. At the end of the second week, they pushed their changes to the production web site and other inexplicable interactions started to emerge. By the end of the third week, a miracle happened: they could handle all the traffic. It is needless to say that by then, a lot of buckets of gold had been lost.
What is complexity ?
Now enlightened by this experience I became interested in defining and measuring complexity. My first impression, after the persistence meeting 3 weeks before troubles began, was that our solution was much more complex. In the end, our code was much more simple to modify. On the other hand, the two other teams’ approach seemed simple at first, yet their code was much more complex to modify. Perceived complexity is just that: a perception. On the other hand, measured, defined and experienced complexity is a science. Measurements confirmed my first impression was false: cyclomatic complexity per method with the “classic” approach was twice as high as the one of our approach, while cyclomatic complexity per module was 5 to 10 times higher. Experience also confirmed my first impression was false: perceived simplicity does not mean meaningful simplicity, nor valuable simplicity.
Under the veil of complexity avoidance and performance tuning, the two other teams actually ran a shady operation of spaghetti cooking. In the end, three weeks of bad performance and unavailability shied away VC funds and the project ran out of funds, just like what happened in Martin’s story. The sad thing is, our profession often does not seem to learn from its past mistakes.
What is quality ?
I personally measure internal quality as the ease and speed with which a team of competent developers can respond to changing business needs. Lightning fast is how business is done in 2011 and it is how it will be done in foreseeable future. For one thing, its velocity increases as you are reading this. Startups from emerging markets are eager to take a bite into your market share. The software developers of recently established high-tech companies provide an insane amount of value to their business by using cutting-edge software engineering tools and practices. Managers expect more with less because that is what the market expects from them. The definition above is intrinsically linked to the needs of business, not some Ivory Tower elitism. Remember who pays you and why. Make every penny of your salary worth it. Making your code hard to maintain is not a fast-track to job security but merely a fool-proof way of making your employer lose money, and in the end lowering the incentive to have you on board in the first place.
An invaluable lesson
I was once given an invaluable lesson by a friend and mentor: being a good programmer is delivering value in the short, medium and long-run. Internal quality is the measure of long-term value since it makes it easier to respond to changing business needs and increase profitability. Quality helped my team to respond to an urgent business problem at almost no cost and with very little disturbance. Inferiority dragged the two other teams in the mud and in the end, probably drowned the business altogether.
Performance seems to be an easy excuse for poor design. I’ll be honest with you, I am yet to encounter a website performance problem that absolutely and without a single doubt must lead to poor internal quality. In the rare cases where good internal quality cannot be achieved with PHP, you owe it to yourself and to your company to move to some other, more suitable technology instead of butchering your way through hundreds of thousand lines of unmaintainable code.
References and further reading:
- Beck 2000: Extreme Programming Explained: Embrace Change
- Fowler 2011: The Tradable Quality Hypothesis
- Fowler 2002: Patterns of Enterprise Application Architecture
- Martin 2008: What is clean code ? (article)