Mysterious Glogotypes

Something strange about Google's logotype

Google's front page must be the most frequently loaded webpage on the Internet today. With Google's heavy focus on performance issues, one would suspect that this page has been obsessively optimized. If your page is loaded hundreds of millions of times per day, not only is there actually money to be saved on eliminating a single byte that needs transferring, but the small difference in user retention that a microscopic speedup in page load time can result in is also worth a lot of trouble. Marissa Mayer said some interesting things about how important page load time is for Google, in her 2007 presentation at the Seattle conference on scalability. It is a 1-hour presentation but well worth watching. If you don't feel like watching a long video presentation, you can check out Steve Souder's recent article Business impact of high performance.

Anyway, when testing our spanking new web page analyzer recently, we noticed that when we analyzed www.google.com and had our analyzer emulate different user agents, the Google logotype image on that page came in different versions, depending on what user agent we pretended to be. Nothing really surprising about this fact. Older web browsers don't support new image formats, for example, so sending the logotype as a GIF image makes sense when Google detects that an older (or unknown) browser is used, while a PNG image might be appropriate for a newer browser.

But, why on earth chop up the GIF logotype into four different pictures?

To someone who isn't interested in performance issues, this may seem like the most boring non-question since... well, ever actually. But to us others it is a bit perplexing. Or is it?  You be the judge. This is what you get when you load www.google.com with Firefox 3.5.

  The image is clickable if you want to view the result interactively and learn more. It shows the page load diagram for www.google.com in our page analyzer, emulating Firefox 3.5 as the user agent. The first thing that happens is that we get redirected to www.google.se, because our request originated from an IP address Google correctly identified as swedish. Then we load www.google.se/ (the HTML) and finally the Google logotype, which is called logo_plain.png and is 7.4KB in size. A single PNG image in this case. You can see this PNG being rendered in the bottom right corner of the screenshot.

 

Now, if we try and change our user agent emulation and instead tell Google who we really are (the "Load Impact Page Analyzer"), we get something different:

Look at this - Google sends us the logotype as a GIF, chopped up into four different images. The images are called hp0.gif, hp1.gif, hp2.gif and hp3.gif. On the screenshot below the hp0.gif image is shown on the lower right. You can see that it is the "Goo" part of the Google logo.

 

Why do they do this? Compatibility-wise it makes sense to send GIF images when you don't recognize the user agent, because we might be using some old browser that can't support e.g. PNG images. GIF is probably the most widely supported image file format in existence. But why not send a single GIF?  Why the four different parts?

Is it performance-related? Using more than one TCP connection to fetch objects can result in performance gains if there are many objects to fetch, as more objects can be requested concurrently. With HTTP pipelining this might be less of an issue, but older browsers that can only speak HTTP 1.0 do not support pipelining. On the other hand, the total number of objects on the Google front page, is small. The objects themselves are small in size too. Old browsers might not use more than two concurrent connections. Maybe they will only use one. In that case, dividing an image into several parts is most likely bad for performance.

Also, if it makes sense from a performance, or any other, perspective to divide the logo into four separate images, why isn't logo_plain.png delivered as four separate images too, when we use Firefox 3.5 to retrieve the page? Firefox 3.5 is definitely multi-connection capable - Actually, one could call it a multi-connection fetishist, just like most modern browsers. The same thing goes for if it has something to do with Google's needs to frequently change its logo (i.e. if it wants part of it to be cached on the client side, while still being able to alter other parts of it). Why not do that for all modern browsers too?

Then we start testing other Google sites - google.com (without the redirect to google.se), google.cn and google.jp for instance, all deliver a single GIF image, rather than a PNG image. No matter what browser emulation we use. Why is that?  Why do we get PNGs here in Sweden, while americans and asians get GIFs?

We asked a very technically knowledgeable Google employee about all this, but he had no clue as to the reason, so now we're down to guessing. There must be some kind of strategy behind the decision to serve certain browsers a 4-GIF version of the logo, others a single GIF and yet others a PNG, but what is it? If it had been any site but Google, we would have guessed it was an artifact of how the different logotypes (different countries and regions have different Google logotypes on their localized Google sites) are manufactured, but considering how much effort Google spends on performance optimization, it just doesn't seem likely that this is the cause here. It should not be all that difficult to convert all logotype images to the same format, if it made sense from a performance and compatibility perspective.

Can someone who knows please help shed some light on this mystery?

 

CSS woes

Implementing CSS support for the Page Analyzer

Our new web page analyzer is getting more popular by the day. Just after its release we even had a call from a well-known Internet organization who wanted to license it for their internal use, to gather general statistics about web site performance. They were impressed with the analyzer, but quickly noticed that it had a flaw - it would always load all resources that were referred to in a CSS file, no matter if the resources were needed to render the web page in question, or not.

Many sites today have a single CSS file that is used for the whole site. That CSS will contain references to e.g. various images that are used for backgrounds on different pages on the site. However, a web browser that loads the CSS file, will parse it and realize that while the background object a.jpg is needed on the current page, another background object b.jpg isn't needed. It is used on some other page on the site, but not on the page that is currently being loaded and rendered. The web browser will see this and refrain from loading b.jpg.

What this all means is that our page analyzer would often load a lot more objects (images etc) than a real browser would, which of course meant that the total page load time would be slower than you'd experience with a real browser. In short, our browser emulation would not be very accurate in many cases.

An interesting side note here is that no other page analyzers seem to fully support CSS. We have looked around and found that all available analyzers seem to have this same problem.

We decided to see what we could do about implementing true CSS support in our analyzer, and started by looking at available CSS libraries for Python (the analyzer is written mainly in Python). We found two contenders - CSSutils and CSSEngine. After looking at both, we decided to go with CSSEngine as the code base seemed to be a better starting point for what we wanted to do. The drawback with that library was that it hasn't been maintained the last 3 years or so.

We cleaned up the code a bit, made it more forgiving so it wouldn't croak and die when it encountered the sometimes horrible mess that constitutes CSS files on the Internet, and made it CSS3-aware (CSS version 3 is fairly new, so the library had no support for it).

Then we started testing our new CSS parsing capabilities, and found to our disappointment that it was slow. For the sites out there with large CSS files, the parser was really slow. Glacial, in fact. What to do? We started out by using Cython to convert the Python code to C and then compile it. This improved speed 3-4 times, but it was still too slow. Parsing a really large CSS file would take 15 seconds or more, and that wasn't acceptable.

So the hunt went on to improve speed. It turned out that the culprit was the CSS parsing, mainly the execution of all the scattered regular expressions. We found Ply, a python implementation of the popular unix programs Lex and Yacc. After some more searching we found css-py, a lexer and parser-grammar written using Ply. Unfortunately css-py seems to have been unmaintained for over a year and both the lexer and parser-grammar needed alot of modification to support our needs, handling broken CSS and CSS3. In the end we used parts of the lexer but had to rewrite the parser-grammar.

With the help of Ply, we managed to speed up execution enough that we were happy with the performance. The end result then was that the parser/cascader became a mix of code from CSSEngine, py-css and newly written code. It supports the following:

So, while the user agent emulation we had before was very likely better than anything any other analyzer could come up with, it is now better yet. We now emulate most important, performance-affecting characteristics of modern browsers (and other user agents) and can therefore provide a fairly accurate analysis, that tells the user how fast his/her page loads in different browsers. This is something that few, if any, other analyzers can claim being able to do.

 

 

 

 1