How much storage is unlimited storage?

I am always very puzzled by unlimited storage offerings, because there is, of course, no such thing as unlimited storage. So really, how much is unlimited ?

A company rightfully calling itself Idealize Hosting offers the following, for the modest sum of $5.89 per month.

IdealizeHostingBusinessPro

I am so glad I found these guys. At last, a hosting company that can fulfill my untarnishable thirst for storage. Considering that $5.89 only buys me 62 gigabytes of 99.99999% reliable storage on S3 and 589 gigabytes of storage on Glacier, Idealize’s offer is a bargain.

Registration was easy. Luckily, I have a spare domain to use for this trial.

FirstRegistrationStepIdealize

And I definitely did not forget to tick that box (warning: if your password is deemed too weak, you will need to re-tick this box again).

TickThisBoxIdealize

The last step is to agree to the TOS. Reading through, one of my worst fears comes true. It is prohibited to:

Hosting large amounts of data not specifically tied (“linked”) to your hosting account.

Well, enterthebatcave.com is now a website dedicated to the download of large chunks of random data.

At last, the registration is completed, and Idealize sure wants me to know it. I have received no less than five email messages from them in less than a minute.

FiveEmailsFromIdealize

I find all the information I need in the message titled New Account Information. Their system is based on CPanel. Again, I get the confirmation that I have unlimited storage space.

CpanelStatsIdealize

It’s now time to design my website. A quick phpinfo tells me PHP 5.3.21 is installed (at the time this article was written, the latest available version of the 5.3.x branch was 5.3.25). I am honestly surprised by the fact that their PHP installation is so up to date.

30 minutes later, the bat cave is finally ready.

This is probably the shittiest code I have ever written but meh, whatever. Using ab(1) (Apache Benchmark), I shall now generate a few files. Strangely, after a not so long moment, no data was inserted into the newly created files. My stats in cPanel are now:

StatsAfterBatcave

Isn’t it odd that my “unlimited” storage cannot expand beyond “1000″ mb? That number is too round to be a coincidence. Is 1000 mb the definition of unlimited for Idealize? To answer this question, I contacted Idealize’s tech support.

InitialTechSupportInquiry

I never heard back from them, but I got the following email.

IdealizeDiskUsageWarning

Oh well. I guess unlimited means 1 gigabyte.


The unknown value of Value Objects

Domain-Driven Design (Evans 2003, Nilsson 2006) established a succinct taxonomy to describe classes, in the context of the upper layers of an application (i.e. non-infrastructure):

  1. Entities: objects that have a unique identity and a non-trivial life cycle (e.g. an invoice in an accounting application)
  2. Services: objects which handle non-trivial operations spanning many objects (e.g. a cargo router)
  3. Value objects: simple objects with no identity (e.g. a phone number)

One of the main rules of DDD is that value objects are immutable. This is often not as self-evident as it seems, as many programmers are not even aware of the state they leak and create. Sadly, state is often positively correlated with entropy. Hence, taking steps to contain and to limit state is one of the keys to taming complexity in an application.

Even if the conceptual complexity of your projects does not warrant an approach as advanced as DDD, using value objects (often called immutable objects outside of DDD) has incredible benefits. Sometimes, most of the complexity in identity-less objects is accidental and related to the way simple operations are handled. To demonstrate this, I will use an example I’ve seen in a project recently. The RationalNumber class represents a rational number with such operations as adding, substracting, multiplying and dividing (only adding and dividing are shown for brevity). Here is a typical use case.

$half = new RationalNumber(1, 2);
$third = new RationalNumber(1, 3);

$half->add($third);
$half->divide($third);

var_dump((string) $half);
var_dump((string) $third);

Without looking at the actual implementation, what would you expect the output to be ? In real-world arithmetic, these are always true:

\dfrac{1}{2} + \dfrac{1}{3} = \dfrac{5}{6}
\dfrac{1}{2} \div \dfrac{1}{3} = \dfrac{3}{2}

In other words, “adding” does not change the nature of the addends, nor does “dividing” change the nature of the divisor and the dividend. It seem fairly intuitive that our implementation of arithmetics follow the same rule, i.e. that the previous code outputs (we’ll come back to how the results of the addition and division are made available a bit later):

string(3) "1/2"
string(3) "1/3"

Well, in PHP, it all depends on how you implement the RationalNumber class. Let’s start with the classic mutable implementation:

class RationalNumber
{
    private $numerator;
    private $denominator;

    public function __construct($numerator, $denominator) {
        $this->setValue($numerator, $denominator);
    }

    public function setValue($numerator, $denominator) {
        $this->numerator = $numerator;

        if ($denominator == 0) {
            throw new DomainException('Denominator cannot be 0');
        }
        $this->denominator = $denominator;
    }

    public function add(RationalNumber $addend) {
        $numerator = $this->numerator * $addend->denominator + $addend->numerator * $this->denominator;
        $denominator = $this->denominator * $addend->denominator;
        $this->setValue($numerator, $denominator);
        return $this;
    }

    public function divide(RationalNumber $divisor) {
        $numerator = $this->numerator * $divisor->denominator;
        $denominator = $this->denominator * $divisor->numerator;
        $this->setValue($numerator, $denominator);
        return $this;
    }

    public function __toString() {
        return sprintf('%d/%d', $this->numerator, $this->denominator);
    }
}

Using the previous code, the output is:

string(3) "15/6"
string(3) "1/3"

The $half object had its state mutated twice, which yields a strange result. Why isn’t the other addend also modified ? Doesn’t it seem odd that an object which we represent using a meaningful name ($half) now means something completely different ? It seems modifying the order of the operations (i.e. calling the add() method on one addend rather than the other) will result in a change in the state of the program. Is this really what we want ?

A programmer which is not well acquainted to the implementation of the RationalNumber class will use the design decision communicated by the public interface to make rational assumptions. Is it that far fetched to represent these arithmetic operations:

a = \dfrac{1}{3} \div \dfrac{1}{2}
b = \dfrac{1}{2} \div \dfrac{1}{3}

Using the code below ?

$half = new RationalNumber(1, 2);
$third = new RationalNumber(1, 3);

$a = $third->add($half);
$b = $half->divide($third);

var_dump((string) $a);
var_dump((string) $b);
var_dump((string) $half);
var_dump((string) $third);

Outputs:

string(3) "5/6"
string(4) "6/10"
string(4) "6/10"
string(3) "5/6"

This all seems contrary to the laws of arithmetic and is far from cohesive and intuitive. What if you do not want to lose the value contained in one of the variables ? For example, representing the following equation is clearly a PITA if the mutable route is used:

(a + b + c) \times a \div b

In fact, a quite profound knowledge of the internals of RationalNumber is necessary to implement the equation correctly. A correct implementation would be:

$aCopy = clone $a;

$a->add($b);
$a->add($c);
$a->multiply($aCopy);
$a->divide($b);

Does that seem right ? The more complex the arithmetic operation to represent is, the more problems arise.

To add to our grief, the implementation has a fundamental flaw. Take a look at it. Twice. Thrice. You probably have not found it. Let me help you:

$d = new RationalNumber(3, 7);
try {
    $d->setValue(3, 0);
}

catch (Exception $e) {
    var_dump((string) $d);
}

Produces output:

string(3) "3/7"

The object ends up in an inconsistent state. Of course, there is an easy way to fix this by moving the guard in setValue() before the first variable assignment. However, the problem is much more profound. This is a typical case of defect introduction/fault isolation effort dichotomy. It is very easy for a small change to create an execution branch which can put the object in an inconsistent state and it is very hard to discover that possibility, even with a test. If that error is to ever happen in staging or even worse, in production, it would percolate very high in the subsequent layers and may have incredible repercussions. The very nature of the implementation is responsible for this dichotomy.

In itself, the mutable implementation is not outright wrong, but it certainly adds accidental complexity to the application and can cause unexpected state inconsistence. Why bother if all this can be avoided if an immutable implementation is used instead ?

class RationalNumber
{
    private $numerator;
    private $denominator;

    public function __construct($numerator, $denominator) {
        if ($denominator == 0) {
            throw new DomainException('Denominator cannot be 0');
        }

        $this->numerator = $numerator;
        $this->denominator = $denominator;
    }

    public function add(RationalNumber $addend) {
        $numerator = $this->numerator * $addend->denominator + $addend->numerator * $this->denominator;
        $denominator = $this->denominator * $addend->denominator;
        return new self($numerator, $denominator);
    }

    public function divide(RationalNumber $number) {
        $numerator = $this->numerator * $number->denominator;
        $denominator = $this->denominator * $number->numerator;
        return new self($numerator, $denominator);
    }

    public function __toString() {
        return sprintf('%d/%d', $this->numerator, $this->denominator);
    }
}

Now, the arithmetic operations:

(a + b + c) \times a \div b

Can be implemented intuitively as:

$result = $a->add($b)->add($c)->multiply($a)->divide($b); 

No cloning is necessary as a new object is created on each step. Also, since we got rid of state mutations, inconsistency can never occur. Any of the initial values can be reused as is, as their state has not mutated.

Always remember that having a mutable state is very heavy in consequences. Permitting state mutation unnecessarily often causes accidental complexity which makes your application more vulnerable to change and introduces completely avoidable risk. The value of value objects lies in their cohesiveness, intuitiveness and simplicity. Use them as often as possible to avoid the pitfalls of accidental mutable state.


Htaccess considered slow

Are .htacccess configuration files slow ? One could guess so, considering they are checked for changes on every single request. To verify this, I tested three scenarios:

  1. No rewrite rule.
  2. One rewrite rule, matched, rule is in .htaccess file.
  3. One rewrite rule, matched, rule is in Apache configuration file.

Here are the results, with concurrency = 100 and 100 000 requests:

Scenario Avg. throughput (#/s) Avg. response time (ms)
1 5304.41 18.852
2 4321.07 23.142
3 4644.17 21.532

This roughly means that not using an htaccess file for a simple rewrite rule is means a 7% increase in throughput and 7% decrease in response time, which is non-negligible and easy to obtain, especially since it does not affect code quality at all.

Even more surprising is the slowness of the rewrite engine. Granted, regular expressions are not free, but I would guess that Apache caches the matches in a hashmap, on which lookups are practically free (complexity is O(1)). As such, how can you explain the 14% difference in throughput between the scenarios 1 and 3 ?


Femtosecond optimizations

In 1974, Donald Knuth, which we can safely say is one of the founding fathers of modern Computer Science, said:

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.

The PHP ecosystem is plagued with premature optimizers. Not that they really intend to do evil on purpose, but they probably have lost some perspective.

Some PHP developers seem so convinced that calling functions is evil, they avoid it at all costs and program in big chunks of unimaginable cyclomatic complexity instead. Hence, I dared to measure this factor by implementing the same functionality in two different manners. Implementation (1) is written using many functions and implementation (2) is written in one big chunk (13 user land function calls versus 0).

Implementation 1:

function calculateDistanceMatrix(array $strings) {
	$distances = array();
	for ($l1 = 0; $l1 < 10; $l1++)
	{
		$distances[$l1] = array();
		for ($l2 = 0; $l2 < 10; $l2++)
		{
			$distances[$l1][$l2] = levenshtein($strings[$l1], $strings[$l2]);
		}
	}
	return $distances;
}

function generateArrayOfRandomStrings($length, $num) {
	$strings = array();
	for ($i = 0; $i < $num; $i++)  {
		$strings[] = generateRandomString($length);
	}
	return $strings;
}

function generateRandomString($length) {
	$str = '';
	for ($e = 0; $e < $length; $e++) {
		$chr = chr(mt_rand(97, 122));
		if (mt_rand(0, 1)) {
			$chr = strtoupper($chr);
		}
		$str{$e} = $chr;
	}
	return implode('', $str);
}

function printDistanceMatrix(array $distances) {
	foreach ($distances as $distancesRow)
	{
		foreach ($distancesRow as $distance)
		{
			printf("%02d ", $distance);
		}
		echo PHP_EOL;
	}
}

$strings = generateArrayOfRandomStrings(20, 10);
$distances = calculateDistanceMatrix($strings);
printDistanceMatrix($distances);

Implementation 2:

$strings = array();
for ($i = 0; $i < 10; $i++)  {
	$str = '';
	for ($e = 0; $e < 20; $e++) {
		$chr = chr(mt_rand(97, 122));
		if (mt_rand(0, 1)) {
			$chr = strtoupper($chr);
		}
		$str{$e} = $chr;
	}
	$strings[] = implode('', $str);
}

$distances = array();
for ($l1 = 0; $l1 < 10; $l1++)
{
	$distances[$l1] = array();
	for ($l2 = 0; $l2 < 10; $l2++)
	{
		$distances[$l1][$l2] = levenshtein($strings[$l1], $strings[$l2]);
		printf("%02d ", $distances[$l1][$l2]);
	}
	echo PHP_EOL;
}

To create a conclusive benchmark, I used real HTTP requests on an Apache server running PHP5.3 with APC. The two implementations were benchmarked using the “Apache Benchmark” tool, with 50 000 requests and a concurrency of 50 connections. The IS implementation is just a static page, containing the same output data as one of the two implementations, and is used for reference.

Impl. Avg. Throughput (#/s) Avg. Response Time (ms) Response Time 99th percentile (ms) Highest Response Time (ms)
IS 5659.09 8.835 12 16
I1 450.20 111.063 171 247
I2 466.68 107.139 153 229

What does this amount to ? In terms of response time, (2) does 3.66% better than (1). In terms of throughput, (2) also does 3.66% better than (1). Then comes the usual victory speech:

Don’t get me wrong, I’d prefer to maintain the first implementation but when it comes to performance, a 4% gain really warrants creating maintainability problems.

The difference is significant, but is it that important ? Probably not. If we were talking about an operating system or an embedded software to direct missiles on Soviet Russia (no harm intended to my fellow humans from this part of the world), my opinion would be slightly different, but a web site is quite a different beast.

Usually, the cost of operating a web site is not hardware renting. In this fast-paced world, business needs change all the freaking time. How much it costs to adapt to these changes is the real cost driver. Hardware is cheap. Good programmers are not. Saving 4% on hardware costs but paying a developer thousands of dollars to fix issues is a bad business decision.

These kinds of operations are what I call femtosecond optimizations. Of course, the formulation is a bit hyperbolic since we can’t really go as far as saving femtoseconds, but you get the idea: they are useless, premature optimizations which sole accomplishment is to diminish the quality of the code and increase maintenance cost. They are toxic, costly and difficult to manage. They are the root of (almost) all evil.

Trust good design. Make it work. Make it work right. And then, make it work fast. Not the other way around.


Trading Performance and Quality

There is a common saying among high-traffic website developer circles (especially in the PHP world) that quality is tradable with performance, as if one cannot exist with the other, or as if a bit of one can be, for a nominal exchange rate, diminished or increased. It is in fact a common misconception that perceived internal quality can lead to performance problems in business applications. Although this is true in certain domains (embedded software, real-time software, etc.) and in immature platforms (think Javascript on Microsoft Firefox Downloader), it is, in my opinion, almost all the time an intellectual shortcut, used to justify poor practices.

In a tweet to Michael Feathers (of Working Effectively with Legacy Code fame), I asked if performance is an excuse for writing legacy code. The answer: “performance is a case where you’d better have tests”, and I agree wholeheartedly with him. Performance improvement in non-trivial and existing business applications is mostly accomplished by measuring and refactoring code, since the programming language and the platform are already chosen. Unfortunately, when confronted with legacy code in an application with non-trivial complexity, measuring is difficult but feasible with a lot of patience and resources, but refactoring is almost out of reach.

This is exactly why unlegacity is important when performance is an issue. The key is control. Automated testing and quality helps a team control their software, making it easier to improve performance by measuring and refactoring. If traffic to your website doubles, throwing twice as many servers in the pool might not work as well as expected, but will probably stop the hemorrhaging. If traffic quadruples in a matter of months or weeks, trouble is glooming. Now, on top of that, imagine not having control on the software. All of a sudden, that guy you’re paying to maintain the website cannot handle all the requests thrown at him. Management sends more soldiers to war and chaos slowly builds. You realize nobody except this guy knows the caprices and the sinuous paths leading to the hidden secrets of the application. Even he opens a few Pandora boxes on the way, and all hell breaks loose in a matter of days. Downtime, pissed off bosses and investors, money lost, etc.: you know the drill.

Now, if a team has control over the software, the story is a bit different. You can send your best programmers into the battlefield and, considering they have a short feedback loop, they’ll work faster and make less mistakes. Furthermore, implementing that new shiny caching solution now seems very easy. Since your software is tested, inserting an additional layer is not a matter of surgical precision but simply moving furniture around. In the end, two teams, one working with legacy software and the other with solid software, will both successfully bring the web site back up, although one can sleep at night and the other waits for the next stampede. You can guess which one did it faster and for less money.

The three monoliths

I am going to throw a personal story to illustrate this. Back in summer 2009, I worked full-time as a free-lancer on an high traffic website generating buckets of gold to its owners. Of course, the buckets of gold would stop coming if the website went down, meaning keeping it up was their main priority. After a breakthrough in a new market, the traffic was multiplied by 10 in a matter of weeks. This promised more buckets of gold. That is, if response time were not abysmal.

Well, long story short, the website was inherited from previous owners who felt time to market should be close to zero, dragging quality down at the same time. Robert C. Martin says it all: “They had rushed the product to market and had made a huge mess in the code. As they added more and more features, the code got worse and worse until they simply could not manage it any longer. It was the bad code that brought the company down.” [Martin 2008]. Truckloads of freelancers and outsourced development shops were given the keys to the SVN repository and out of it grew not one, not two but three different and completely independent web applications, acting to the outside world as one. They interacted through a shared database. Two of the three applications were erected in good ol’ PHP4-styled transaction scripts with absolutely nasty hidden interactions through global variables, key-value caches and “special, hidden” tables. The third one was built by a knowledgeable team of professional developers from a small development shop in New York City, which I had the chance to be working with. Although the third application was independent of the two other messes, it still had to interact with them. Fortunately enough, the model was simple and this was not a problem. This was important, as it was running one of the most critical parts of the infrastructure, namely a personal messaging system interacting through various schemes with half a dozen APIs. Despite that, from time to time, various invariants were not respected and stuff exploded.

During development, our team had come up with a way to isolate persistence, using an home-made Database Abstraction Layer (DBAL) and the Table Data Gateway pattern. Although I was not familiar with the pattern taxonomy at the time, below is an approximate schema of what it looked like.

In short, we stuck with transaction scripts but isolated them in very small, specific services, making the controllers very thin. Database queries all had to go through the gateway, which in turn relied on the DBAL for abstraction, making the transaction scripts ignorant of the actual implementation used to store the data.

When presented with a similar schema in a video-conference three weeks before the rapid traffic increase, the programmers in the two other teams barked very loud. “Too complicated” one said. “Not performant” the other said. And so it went. I kinda sided with them on complexity, although I was not convinced by the performance argument. I cheated a bit (sequence diagrams weren’t meant to represent procedural programming) and came up with a small schema of a classic way to approach persistence in transaction scripts in PHP.

By comparing both schemas, it appears the two other teams made their point that their approach was less complex. On the other hand, our team argued that this design permitted an absolute separation of concerns, hence making testing easier. In fact, our monolith had a test coverage of about 90%, whilst tests were unheard of in the two other. In the end, all three teams parted ways and continued to develop through a shared database, much to our despair.

Thanks to the wonders of marketing, traffic started growing exponentially and it was time to react. The lead of the NYC team I was working with had us all sit through an hour long video-conference brainstorm on how to cope with it. Our “Ivory Tower” (as other developers started calling it) mostly ran without any sort of caching. Heavy database queries were cached in memory tables or views. The problem was, we could not rely on the database servers anymore, since the two other monoliths were drowning it with useless queries, despite their extensive use of caching (another example of lack of control). A solution had to be found to cache vast amounts of denormalized data, so as to not rely on the database anymore. As explained earlier, all queries went through a domain-specific table data gateway. The solution was fairly simple: transparently introduce caching in the gateway classes. The new process is described in the activity diagram below.

Simply put, the table gateway would query the cache façade. If there was a hit, this value was returned. If there was a miss, the value would be fetched from the database, inserted in the cache through the façade and returned. Furthermore, since consistency was a requirement, cache had to be invalidated explicitly when a change happened in the source of truth, i.e. the relational database. This was also solved easily; since all CRUD was handled by the same gateway, keeping tracks of keys was trivial. Implementing this far from perfect but reasonable solution took 1 (long) day and was up and running in no time, with very satisfying results. Considering our monolith was not legacy, only minor defects appeared in production since we could test our changes on the way. Liberated from the task of keeping alive our monolith, we could restart working on money-earning features.

At the same time, the two other teams were struggling. After a week of agony, they finally had some results which promised to alleviate the performance problems, but it’d take them another week to implement – they had to go through hundreds of weirdly interacting, hidden, and almost forgotten transaction scripts to implement Memcached calls. At the end of the second week, they pushed their changes to the production web site and other inexplicable interactions started to emerge. By the end of the third week, a miracle happened: they could handle all the traffic. It is needless to say that by then, a lot of buckets of gold had been lost.

What is complexity ?

Now enlightened by this experience I became interested in defining and measuring complexity. My first impression, after the persistence meeting 3 weeks before troubles began, was that our solution was much more complex. In the end, our code was much more simple to modify. On the other hand, the two other teams’ approach seemed simple at first, yet their code was much more complex to modify. Perceived complexity is just that: a perception. On the other hand, measured, defined and experienced complexity is a science. Measurements confirmed my first impression was false: cyclomatic complexity per method with the “classic” approach was twice as high as the one of our approach, while cyclomatic complexity per module was 5 to 10 times higher. Experience also confirmed my first impression was false: perceived simplicity does not mean meaningful simplicity, nor valuable simplicity.

Under the veil of complexity avoidance and performance tuning, the two other teams actually ran a shady operation of spaghetti cooking. In the end, three weeks of bad performance and unavailability shied away VC funds and the project ran out of funds, just like what happened in Martin’s story. The sad thing is, our profession often does not seem to learn from its past mistakes.

What is quality ?

If quality is not the antithesis of performance, achieving quality is in itself should be one of the ultimate goals of all developers, since it provides a great quantity of value to the business, especially when performance is an issue. But in the end, how can we define quality ? Business users tend to measure quality in terms of user experience, i.e. external quality, which is their specialty and responsibility. Our responsibility, on the other hand, is to measure and increase internal quality [Beck 2000].

I personally measure internal quality as the ease and speed with which a team of competent developers can respond to changing business needs. Lightning fast is how business is done in 2011 and it is how it will be done in foreseeable future. For one thing, its velocity increases as you are reading this. Startups from emerging markets are eager to take a bite into your market share. The software developers of recently established high-tech companies provide an insane amount of value to their business by using cutting-edge software engineering tools and practices. Managers expect more with less because that is what the market expects from them. The definition above is intrinsically linked to the needs of business, not some Ivory Tower elitism. Remember who pays you and why. Make every penny of your salary worth it. Making your code hard to maintain is not a fast-track to job security but merely a fool-proof way of making your employer lose money, and in the end lowering the incentive to have you on board in the first place.

An invaluable lesson

I was once given an invaluable lesson by a friend and mentor: being a good programmer is delivering value in the short, medium and long-run. Internal quality is the measure of long-term value since it makes it easier to respond to changing business needs and increase profitability. Quality helped my team to respond to an urgent business problem at almost no cost and with very little disturbance.  Inferiority dragged the two other teams in the mud and in the end, probably drowned the business altogether.

Performance seems to be an easy excuse for poor design. I’ll be honest with you, I am yet to encounter a website performance problem that absolutely and without a single doubt must lead to poor internal quality. In the rare cases where good internal quality cannot be achieved with PHP, you owe it to yourself and to your company to move to some other, more suitable technology instead of butchering your way through hundreds of thousand lines of unmaintainable code.

References and further reading:


Doctrine 1 – Three ways to get a record’s attributes

Working with Doctrine 1 can seriously wreck your nerves if your objects form a complex graph and come in gargantuesque quantities. I recently had to fight my way through the meanders of D1 to bring performance back to acceptable levels. After spending about an hour trying to optimize a rather simple routine working with about 10 000 objects, I realised, thanks to xDebug’s profiler, that one of the bottlenecks was the getter magic method on my Doctrine records. This appeared strange to me so I investigated further.

One of my first stop was the Doctrine_Record object. Click here [github] to see the source: the culprit is around line 1336 as of 1ccee085f49ab0be17ee. Well, a lot of shady, mostly unnecessary (for my use case) stuff happens in the background, each time a property is accessed. Multiply this by a few thousands of accesses and you get a mess.

Screaming lady

Screaming lady, just like me

Analyzing the code and the call graph of the get() function, I came to the conclusion that my first performance killer was the unexpected lazy loading. I tried as hard as I could to explicitly join all the dependencies in my initial DQL query but my objects had many, many associations, meaning I always ended up losing track, forgetting a handful. Since the lazy loading is, by definition, very stealthy, it was hard for me to track all the occurrences. All I knew was that each user query generated about one hundred database queries. My redemption came when I discovered that explicitly using $record->get(‘association’, false) instead of $record->association makes Doctrine squeal if the data is missing since you’re trying to access a property that was not fetched and that you explicitly prohibited to lazy load (hence the false parameter). Changing all the occurences of direct accesses and the getters on my records to use the get method explicitly took time but was definitely worth it since this gave me absolute control over what was loaded and when. I could now go and fix the missing explicit joins.

I still was not satisfied (my hair actually looks like that, sometimes)

Although the gain in speed of execution was humongous, I still was not satisfied. Whilst researching this issue, it appeared evident that the get() method was doing way too much, at least, compared to what I expected/needed it to do. The enormous amount of overhead caused by its usage caused it to become one of the bottlenecks of my code. I figured out that most of it is avoidable when dealing with properties that you do not want lazy-loaded anyway; one simply has to use the rawGet() method instead. NULL will be returned if the property has not been fetched; no questions asked, no relationships verified, no accessors/mutators poking and, most importantly, no lazy loading. As rawGet() cannot be used to access related objects, its usage is limited to properties. Scrolling through my code for hours replacing implicit gets with explicit rawGets eventually paid off; the speed increase was about ~20%.

Oh well,  I just spared you some trouble didn’t I ? Here is a succinct summary of methods available to access properties and associations:

  • $record->property (through PHP’s magic __get method), $record->get(‘property’): will lazy load, works for properties and relations whatever the state of the object is.
  • $record->get(‘property’, false): will not lazy load, works for properties and relations. If your object’s data was only partially fetched and/or your related records were not explicitly joined, this will fail.
  • $record->rawGet(‘property’): works only for properties. Will not lazy load.

This highlights some of the weak points of the Active Record pattern. First, if your domain is relatively complex, chances are you’ll end up crying in foetal position under your desk, contemplating the big ball of mud you’ve just created. I eventually ended up with DQL queries joining 20 different records and had to further abstract the data fetching in a repository to keep my code sane; so much for simplicity. Second, the overhead can sometimes be unbearable. Just try to var_dump a Doctrine_Record derivative and you’ll see what I mean. Pages and pages of intertwined classes interacting with your record directly in ways you can’t imagine. Again, this ends up making your life miserable if you have a relatively complex domain. As Fowler says so well in P of EAA, “Active Record is a good choice for domain logic that isn’t too complex, such as creates, reads, updates and deletes [CRUD]“.

Hence my suggestion, if you have a complex domain and still want to use PHP: try Doctrine 2 and you’ll never come back to Active Record. It implements the Data Mapper, Repository and Entity Manager patterns and lets you manage your object whilst it concentrates on doing what it does the best: persistence.


In Soviet Russia, the trunk breaks you

I was kind of skeptical about all the hype surrounding Git (I know, I must be the last geek to join the bandwagon). And then, it struck me: everyone says it’s awesome, maybe I should teach myself more than cloning Git repositories ? I was very pleased with my time investment; let me say my heart went pitter-patter and tears of geek-joy wrestled their way to my cheeks when I got a real measure of the power of this beast.

One of the immediate advantages is that I can now commit broken code to the trunk ! I’m often forced to do just that when I actually when to leave my desk at work, come back home and enjoy a few hours of programming from my living room whilst watching some Greek tragedy unfold (America’s Next Top Model, Rock of Love, A Shot at love with Tila Tequila, etc.) and enjoying a slice of pizza.

I'm in your repository, breaking your trunk

GIT is 100% distributed, which means that one travels with a copy of the whole repository and pushes it to an external repository when needs be. I’ll never break trunk again, I promise !


Follow

Get every new post delivered to your Inbox.