CSS woes

Implementing CSS support for the Page Analyzer

Our new web page analyzer is getting more popular by the day. Just after its release we even had a call from a well-known Internet organization who wanted to license it for their internal use, to gather general statistics about web site performance. They were impressed with the analyzer, but quickly noticed that it had a flaw - it would always load all resources that were referred to in a CSS file, no matter if the resources were needed to render the web page in question, or not.

Many sites today have a single CSS file that is used for the whole site. That CSS will contain references to e.g. various images that are used for backgrounds on different pages on the site. However, a web browser that loads the CSS file, will parse it and realize that while the background object a.jpg is needed on the current page, another background object b.jpg isn't needed. It is used on some other page on the site, but not on the page that is currently being loaded and rendered. The web browser will see this and refrain from loading b.jpg.

What this all means is that our page analyzer would often load a lot more objects (images etc) than a real browser would, which of course meant that the total page load time would be slower than you'd experience with a real browser. In short, our browser emulation would not be very accurate in many cases.

An interesting side note here is that no other page analyzers seem to fully support CSS. We have looked around and found that all available analyzers seem to have this same problem.

We decided to see what we could do about implementing true CSS support in our analyzer, and started by looking at available CSS libraries for Python (the analyzer is written mainly in Python). We found two contenders - CSSutils and CSSEngine. After looking at both, we decided to go with CSSEngine as the code base seemed to be a better starting point for what we wanted to do. The drawback with that library was that it hasn't been maintained the last 3 years or so.

We cleaned up the code a bit, made it more forgiving so it wouldn't croak and die when it encountered the sometimes horrible mess that constitutes CSS files on the Internet, and made it CSS3-aware (CSS version 3 is fairly new, so the library had no support for it).

Then we started testing our new CSS parsing capabilities, and found to our disappointment that it was slow. For the sites out there with large CSS files, the parser was really slow. Glacial, in fact. What to do? We started out by using Cython to convert the Python code to C and then compile it. This improved speed 3-4 times, but it was still too slow. Parsing a really large CSS file would take 15 seconds or more, and that wasn't acceptable.

So the hunt went on to improve speed. It turned out that the culprit was the CSS parsing, mainly the execution of all the scattered regular expressions. We found Ply, a python implementation of the popular unix programs Lex and Yacc. After some more searching we found css-py, a lexer and parser-grammar written using Ply. Unfortunately css-py seems to have been unmaintained for over a year and both the lexer and parser-grammar needed alot of modification to support our needs, handling broken CSS and CSS3. In the end we used parts of the lexer but had to rewrite the parser-grammar.

With the help of Ply, we managed to speed up execution enough that we were happy with the performance. The end result then was that the parser/cascader became a mix of code from CSSEngine, py-css and newly written code. It supports the following:

So, while the user agent emulation we had before was very likely better than anything any other analyzer could come up with, it is now better yet. We now emulate most important, performance-affecting characteristics of modern browsers (and other user agents) and can therefore provide a fairly accurate analysis, that tells the user how fast his/her page loads in different browsers. This is something that few, if any, other analyzers can claim being able to do.

 

 

 

 1