How much storage is unlimited storage?

I am always very puzzled by unlimited storage offerings, because there is, of course, no such thing as unlimited storage. So really, how much is unlimited ?

A company rightfully calling itself Idealize Hosting offers the following, for the modest sum of $5.89 per month.

IdealizeHostingBusinessPro

I am so glad I found these guys. At last, a hosting company that can fulfill my untarnishable thirst for storage. Considering that $5.89 only buys me 62 gigabytes of 99.99999% reliable storage on S3 and 589 gigabytes of storage on Glacier, Idealize’s offer is a bargain.

Registration was easy. Luckily, I have a spare domain to use for this trial.

FirstRegistrationStepIdealize

And I definitely did not forget to tick that box (warning: if your password is deemed too weak, you will need to re-tick this box again).

TickThisBoxIdealize

The last step is to agree to the TOS. Reading through, one of my worst fears comes true. It is prohibited to:

Hosting large amounts of data not specifically tied (“linked”) to your hosting account.

Well, enterthebatcave.com is now a website dedicated to the download of large chunks of random data.

At last, the registration is completed, and Idealize sure wants me to know it. I have received no less than five email messages from them in less than a minute.

FiveEmailsFromIdealize

I find all the information I need in the message titled New Account Information. Their system is based on CPanel. Again, I get the confirmation that I have unlimited storage space.

CpanelStatsIdealize

It’s now time to design my website. A quick phpinfo tells me PHP 5.3.21 is installed (at the time this article was written, the latest available version of the 5.3.x branch was 5.3.25). I am honestly surprised by the fact that their PHP installation is so up to date.

30 minutes later, the bat cave is finally ready.

This is probably the shittiest code I have ever written but meh, whatever. Using ab(1) (Apache Benchmark), I shall now generate a few files. Strangely, after a not so long moment, no data was inserted into the newly created files. My stats in cPanel are now:

StatsAfterBatcave

Isn’t it odd that my “unlimited” storage cannot expand beyond “1000” mb? That number is too round to be a coincidence. Is 1000 mb the definition of unlimited for Idealize? To answer this question, I contacted Idealize’s tech support.

InitialTechSupportInquiry

I never heard back from them, but I got the following email.

IdealizeDiskUsageWarning

Oh well. I guess unlimited means 1 gigabyte.


Femtosecond optimizations

In 1974, Donald Knuth, which we can safely say is one of the founding fathers of modern Computer Science, said:

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.

The PHP ecosystem is plagued with premature optimizers. Not that they really intend to do evil on purpose, but they probably have lost some perspective.

Some PHP developers seem so convinced that calling functions is evil, they avoid it at all costs and program in big chunks of unimaginable cyclomatic complexity instead. Hence, I dared to measure this factor by implementing the same functionality in two different manners. Implementation (1) is written using many functions and implementation (2) is written in one big chunk (13 user land function calls versus 0).

Implementation 1:

function calculateDistanceMatrix(array $strings) {
	$distances = array();
	for ($l1 = 0; $l1 < 10; $l1++)
	{
		$distances[$l1] = array();
		for ($l2 = 0; $l2 < 10; $l2++)
		{
			$distances[$l1][$l2] = levenshtein($strings[$l1], $strings[$l2]);
		}
	}
	return $distances;
}

function generateArrayOfRandomStrings($length, $num) {
	$strings = array();
	for ($i = 0; $i < $num; $i++)  {
		$strings[] = generateRandomString($length);
	}
	return $strings;
}

function generateRandomString($length) {
	$str = '';
	for ($e = 0; $e < $length; $e++) {
		$chr = chr(mt_rand(97, 122));
		if (mt_rand(0, 1)) {
			$chr = strtoupper($chr);
		}
		$str{$e} = $chr;
	}
	return implode('', $str);
}

function printDistanceMatrix(array $distances) {
	foreach ($distances as $distancesRow)
	{
		foreach ($distancesRow as $distance)
		{
			printf("%02d ", $distance);
		}
		echo PHP_EOL;
	}
}

$strings = generateArrayOfRandomStrings(20, 10);
$distances = calculateDistanceMatrix($strings);
printDistanceMatrix($distances);

Implementation 2:

$strings = array();
for ($i = 0; $i < 10; $i++)  {
	$str = '';
	for ($e = 0; $e < 20; $e++) {
		$chr = chr(mt_rand(97, 122));
		if (mt_rand(0, 1)) {
			$chr = strtoupper($chr);
		}
		$str{$e} = $chr;
	}
	$strings[] = implode('', $str);
}

$distances = array();
for ($l1 = 0; $l1 < 10; $l1++)
{
	$distances[$l1] = array();
	for ($l2 = 0; $l2 < 10; $l2++)
	{
		$distances[$l1][$l2] = levenshtein($strings[$l1], $strings[$l2]);
		printf("%02d ", $distances[$l1][$l2]);
	}
	echo PHP_EOL;
}

To create a conclusive benchmark, I used real HTTP requests on an Apache server running PHP5.3 with APC. The two implementations were benchmarked using the “Apache Benchmark” tool, with 50 000 requests and a concurrency of 50 connections. The IS implementation is just a static page, containing the same output data as one of the two implementations, and is used for reference.

Impl. Avg. Throughput (#/s) Avg. Response Time (ms) Response Time 99th percentile (ms) Highest Response Time (ms)
IS 5659.09 8.835 12 16
I1 450.20 111.063 171 247
I2 466.68 107.139 153 229

What does this amount to ? In terms of response time, (2) does 3.66% better than (1). In terms of throughput, (2) also does 3.66% better than (1). Then comes the usual victory speech:

Don’t get me wrong, I’d prefer to maintain the first implementation but when it comes to performance, a 4% gain really warrants creating maintainability problems.

The difference is significant, but is it that important ? Probably not. If we were talking about an operating system or an embedded software to direct missiles on Soviet Russia (no harm intended to my fellow humans from this part of the world), my opinion would be slightly different, but a web site is quite a different beast.

Usually, the cost of operating a web site is not hardware renting. In this fast-paced world, business needs change all the freaking time. How much it costs to adapt to these changes is the real cost driver. Hardware is cheap. Good programmers are not. Saving 4% on hardware costs but paying a developer thousands of dollars to fix issues is a bad business decision.

These kinds of operations are what I call femtosecond optimizations. Of course, the formulation is a bit hyperbolic since we can’t really go as far as saving femtoseconds, but you get the idea: they are useless, premature optimizations which sole accomplishment is to diminish the quality of the code and increase maintenance cost. They are toxic, costly and difficult to manage. They are the root of (almost) all evil.

Trust good design. Make it work. Make it work right. And then, make it work fast. Not the other way around.


Trading Performance and Quality

There is a common saying among high-traffic website developer circles (especially in the PHP world) that quality is tradable with performance, as if one cannot exist with the other, or as if a bit of one can be, for a nominal exchange rate, diminished or increased. It is in fact a common misconception that perceived internal quality can lead to performance problems in business applications. Although this is true in certain domains (embedded software, real-time software, etc.) and in immature platforms (think Javascript on Microsoft Firefox Downloader), it is, in my opinion, almost all the time an intellectual shortcut, used to justify poor practices.

In a tweet to Michael Feathers (of Working Effectively with Legacy Code fame), I asked if performance is an excuse for writing legacy code. The answer: “performance is a case where you’d better have tests”, and I agree wholeheartedly with him. Performance improvement in non-trivial and existing business applications is mostly accomplished by measuring and refactoring code, since the programming language and the platform are already chosen. Unfortunately, when confronted with legacy code in an application with non-trivial complexity, measuring is difficult but feasible with a lot of patience and resources, but refactoring is almost out of reach.

This is exactly why unlegacity is important when performance is an issue. The key is control. Automated testing and quality helps a team control their software, making it easier to improve performance by measuring and refactoring. If traffic to your website doubles, throwing twice as many servers in the pool might not work as well as expected, but will probably stop the hemorrhaging. If traffic quadruples in a matter of months or weeks, trouble is glooming. Now, on top of that, imagine not having control on the software. All of a sudden, that guy you’re paying to maintain the website cannot handle all the requests thrown at him. Management sends more soldiers to war and chaos slowly builds. You realize nobody except this guy knows the caprices and the sinuous paths leading to the hidden secrets of the application. Even he opens a few Pandora boxes on the way, and all hell breaks loose in a matter of days. Downtime, pissed off bosses and investors, money lost, etc.: you know the drill.

Now, if a team has control over the software, the story is a bit different. You can send your best programmers into the battlefield and, considering they have a short feedback loop, they’ll work faster and make less mistakes. Furthermore, implementing that new shiny caching solution now seems very easy. Since your software is tested, inserting an additional layer is not a matter of surgical precision but simply moving furniture around. In the end, two teams, one working with legacy software and the other with solid software, will both successfully bring the web site back up, although one can sleep at night and the other waits for the next stampede. You can guess which one did it faster and for less money.

The three monoliths

I am going to throw a personal story to illustrate this. Back in summer 2009, I worked full-time as a free-lancer on an high traffic website generating buckets of gold to its owners. Of course, the buckets of gold would stop coming if the website went down, meaning keeping it up was their main priority. After a breakthrough in a new market, the traffic was multiplied by 10 in a matter of weeks. This promised more buckets of gold. That is, if response time were not abysmal.

Well, long story short, the website was inherited from previous owners who felt time to market should be close to zero, dragging quality down at the same time. Robert C. Martin says it all: “They had rushed the product to market and had made a huge mess in the code. As they added more and more features, the code got worse and worse until they simply could not manage it any longer. It was the bad code that brought the company down.” [Martin 2008]. Truckloads of freelancers and outsourced development shops were given the keys to the SVN repository and out of it grew not one, not two but three different and completely independent web applications, acting to the outside world as one. They interacted through a shared database. Two of the three applications were erected in good ol’ PHP4-styled transaction scripts with absolutely nasty hidden interactions through global variables, key-value caches and “special, hidden” tables. The third one was built by a knowledgeable team of professional developers from a small development shop in New York City, which I had the chance to be working with. Although the third application was independent of the two other messes, it still had to interact with them. Fortunately enough, the model was simple and this was not a problem. This was important, as it was running one of the most critical parts of the infrastructure, namely a personal messaging system interacting through various schemes with half a dozen APIs. Despite that, from time to time, various invariants were not respected and stuff exploded.

During development, our team had come up with a way to isolate persistence, using an home-made Database Abstraction Layer (DBAL) and the Table Data Gateway pattern. Although I was not familiar with the pattern taxonomy at the time, below is an approximate schema of what it looked like.

In short, we stuck with transaction scripts but isolated them in very small, specific services, making the controllers very thin. Database queries all had to go through the gateway, which in turn relied on the DBAL for abstraction, making the transaction scripts ignorant of the actual implementation used to store the data.

When presented with a similar schema in a video-conference three weeks before the rapid traffic increase, the programmers in the two other teams barked very loud. “Too complicated” one said. “Not performant” the other said. And so it went. I kinda sided with them on complexity, although I was not convinced by the performance argument. I cheated a bit (sequence diagrams weren’t meant to represent procedural programming) and came up with a small schema of a classic way to approach persistence in transaction scripts in PHP.

By comparing both schemas, it appears the two other teams made their point that their approach was less complex. On the other hand, our team argued that this design permitted an absolute separation of concerns, hence making testing easier. In fact, our monolith had a test coverage of about 90%, whilst tests were unheard of in the two other. In the end, all three teams parted ways and continued to develop through a shared database, much to our despair.

Thanks to the wonders of marketing, traffic started growing exponentially and it was time to react. The lead of the NYC team I was working with had us all sit through an hour long video-conference brainstorm on how to cope with it. Our “Ivory Tower” (as other developers started calling it) mostly ran without any sort of caching. Heavy database queries were cached in memory tables or views. The problem was, we could not rely on the database servers anymore, since the two other monoliths were drowning it with useless queries, despite their extensive use of caching (another example of lack of control). A solution had to be found to cache vast amounts of denormalized data, so as to not rely on the database anymore. As explained earlier, all queries went through a domain-specific table data gateway. The solution was fairly simple: transparently introduce caching in the gateway classes. The new process is described in the activity diagram below.

Simply put, the table gateway would query the cache façade. If there was a hit, this value was returned. If there was a miss, the value would be fetched from the database, inserted in the cache through the façade and returned. Furthermore, since consistency was a requirement, cache had to be invalidated explicitly when a change happened in the source of truth, i.e. the relational database. This was also solved easily; since all CRUD was handled by the same gateway, keeping tracks of keys was trivial. Implementing this far from perfect but reasonable solution took 1 (long) day and was up and running in no time, with very satisfying results. Considering our monolith was not legacy, only minor defects appeared in production since we could test our changes on the way. Liberated from the task of keeping alive our monolith, we could restart working on money-earning features.

At the same time, the two other teams were struggling. After a week of agony, they finally had some results which promised to alleviate the performance problems, but it’d take them another week to implement – they had to go through hundreds of weirdly interacting, hidden, and almost forgotten transaction scripts to implement Memcached calls. At the end of the second week, they pushed their changes to the production web site and other inexplicable interactions started to emerge. By the end of the third week, a miracle happened: they could handle all the traffic. It is needless to say that by then, a lot of buckets of gold had been lost.

What is complexity ?

Now enlightened by this experience I became interested in defining and measuring complexity. My first impression, after the persistence meeting 3 weeks before troubles began, was that our solution was much more complex. In the end, our code was much more simple to modify. On the other hand, the two other teams’ approach seemed simple at first, yet their code was much more complex to modify. Perceived complexity is just that: a perception. On the other hand, measured, defined and experienced complexity is a science. Measurements confirmed my first impression was false: cyclomatic complexity per method with the “classic” approach was twice as high as the one of our approach, while cyclomatic complexity per module was 5 to 10 times higher. Experience also confirmed my first impression was false: perceived simplicity does not mean meaningful simplicity, nor valuable simplicity.

Under the veil of complexity avoidance and performance tuning, the two other teams actually ran a shady operation of spaghetti cooking. In the end, three weeks of bad performance and unavailability shied away VC funds and the project ran out of funds, just like what happened in Martin’s story. The sad thing is, our profession often does not seem to learn from its past mistakes.

What is quality ?

If quality is not the antithesis of performance, achieving quality is in itself should be one of the ultimate goals of all developers, since it provides a great quantity of value to the business, especially when performance is an issue. But in the end, how can we define quality ? Business users tend to measure quality in terms of user experience, i.e. external quality, which is their specialty and responsibility. Our responsibility, on the other hand, is to measure and increase internal quality [Beck 2000].

I personally measure internal quality as the ease and speed with which a team of competent developers can respond to changing business needs. Lightning fast is how business is done in 2011 and it is how it will be done in foreseeable future. For one thing, its velocity increases as you are reading this. Startups from emerging markets are eager to take a bite into your market share. The software developers of recently established high-tech companies provide an insane amount of value to their business by using cutting-edge software engineering tools and practices. Managers expect more with less because that is what the market expects from them. The definition above is intrinsically linked to the needs of business, not some Ivory Tower elitism. Remember who pays you and why. Make every penny of your salary worth it. Making your code hard to maintain is not a fast-track to job security but merely a fool-proof way of making your employer lose money, and in the end lowering the incentive to have you on board in the first place.

An invaluable lesson

I was once given an invaluable lesson by a friend and mentor: being a good programmer is delivering value in the short, medium and long-run. Internal quality is the measure of long-term value since it makes it easier to respond to changing business needs and increase profitability. Quality helped my team to respond to an urgent business problem at almost no cost and with very little disturbance.  Inferiority dragged the two other teams in the mud and in the end, probably drowned the business altogether.

Performance seems to be an easy excuse for poor design. I’ll be honest with you, I am yet to encounter a website performance problem that absolutely and without a single doubt must lead to poor internal quality. In the rare cases where good internal quality cannot be achieved with PHP, you owe it to yourself and to your company to move to some other, more suitable technology instead of butchering your way through hundreds of thousand lines of unmaintainable code.

References and further reading:


In Soviet Russia, the trunk breaks you

I was kind of skeptical about all the hype surrounding Git (I know, I must be the last geek to join the bandwagon). And then, it struck me: everyone says it’s awesome, maybe I should teach myself more than cloning Git repositories ? I was very pleased with my time investment; let me say my heart went pitter-patter and tears of geek-joy wrestled their way to my cheeks when I got a real measure of the power of this beast.

One of the immediate advantages is that I can now commit broken code to the trunk ! I’m often forced to do just that when I actually when to leave my desk at work, come back home and enjoy a few hours of programming from my living room whilst watching some Greek tragedy unfold (America’s Next Top Model, Rock of Love, A Shot at love with Tila Tequila, etc.) and enjoying a slice of pizza.

I'm in your repository, breaking your trunk

GIT is 100% distributed, which means that one travels with a copy of the whole repository and pushes it to an external repository when needs be. I’ll never break trunk again, I promise !