A SSL certificate crawler and some CA statistics
There have been several articles presenting statistics about SSL usage on Web sites, e.g. the EFF SSL Observatory, the "TLS Prober" from Opera, or the recent SSL Pulse. This article introduces yet another SSL crawl, which has been performed using a parser written with Qt. Please see below for instructions on how to get the code and run it yourself.
1. Results
The findings presented here result from a crawl of the Alexa top 100 000 sites (see section "Mode of operation" below for details), which found almost 60 000 certificates in a little less than 8 hours. The resulting .csv file can be downloaded from http://www.sendspace.com/file/g295kx.
For this article, the SSL parser has been used to find out which Certificate Authorities (CAs) are used most on the Web. As can be seen in the image below, VeriSign is the biggest CA in terms of the number of web site certificates that chain up to it (on Alexa top 100 000 sites): 13.8% of web sites have a VeriSign certificate; runner up is AddTrust AB with 10.2%. The next in row are not that far off with ValCert 8.1%, GeoTrust 8.0% and Equifax 7.9%.
The diagrams here do not differ between valid trusted chains (Qt trusts the root certificates from the distribution at /etc/ssl/certs on Linux by default) and untrusted (e.g. self-signed ones, incomplete chains or certificates chaining to an unknown root) certificates. This might explain the strange values of "SomeOrganization" or "none" in the Organization attribute of the root certificate subject (which together make up 5.1%), which seem to be coming from self-signed certificates. An explanation might be that those values stem from a default certificate shipping with the Apache Web Server or similar, and many sites just do not bother to change this.
Along the same lines goes a possible interpretation of the big bar labeled "(rest)": These organizations can be partitioned into smaller CAs and a considerable amount of junk values; a closer look on those might be given in the next iteration of the SSL crawl.
Keeping the junk certificates in mind and considering the percentages presented earlier again, the first five Organizations (VeriSign, AddTrust, ValCert, GeoTrust and Equifax) make up almost half of all certificate organizations (48%); subtracting the number of untrusted certificates, this percentage will of course be a lot higher (authors uneducated and wild guess is roughly 75%).
Another interesting thing is the country where these organizations actually come from. But first, let us take a look at where the sites itself come from, or at least their certificates: Many "site" certificates (which in this article is a certificate at the end of the chain containing a DNS name or IP address in its Subject Common Name attribute) contain a country field along with its Common Name attribute, which denotes where the web site is located.
As can be seen in the diagram below and not surprisingly, most sites are located in the US. The empty string as a second value rather questions how reliable this method is, and the next values in the list, Japan, Germany and Great Britain, seem a bit surprising. China, for instance, is far off with less than 800 certificates (Japan: 2808). But again, this diagram might be more a rough hint on where Web sites come from rather than a reliable source.
More interesting might be where the organizations that own the root certificates are located. Here, the diagram below shows that more than half of all root certificates are from the US (56%). Ignoring empty strings and other non-decipherable values ("--"), other countries that site certificates chain up to seem quite surprising: Sweden (SE), South Africa (ZA) and Belgium (BE) are high up in the list. However, this comes from big Certificate Authorities being located in those countries, namely AddTrust AB (Sweden), Thawte (South Africa) and GlobalSign (Belgium). All other top 10 Certificate Authorities are located in the US.
2. Mode of operation
How the crawler works is that for every site in a text file (obtained by the Alexa Top Sites file), it tries to do get the contents of its https:// URL; if that works, it parses the result, i.e. the HTML body, for links to SSL sites and tries to crawl those. If the original site is not available over HTTPS, the crawler gets the contents of its plain http:// URL and also searches the result for more https:// URLs.
The crawler searches the HTML for https:// URLs through a simple regular expression; one problem of this is that it currently misses links that are built up via JavaScript like "'https://ssl' + '.google-analytics.com'". For the first version of the crawler, bringing in Webkit logic to solve this problem was too much overhead; however this step might be necessary to evaluate JavaScript and find more links to https:// sites.
3. Plans for the future
The diagrams above are just a glimpse of what can be crawled, and work on the code is still in progress. Here are some ideas for the future; more ideas and code contributions are of course very much welcome.
crawl Alexa top million sites
store certificates in a database for later processing instead of parsing them on the fly
check for weak hash functions (MD5) and low public key sizes
try different SSL versions: check for old SSL versions (SSL 2), and sites supporting TLS 1.1 and 1.2
distinguish between properly validating certs and others (self-signed, incomplete chain, untrusted root etc.)
run periodically
4. Source code
The code has been tested on Kubuntu Linux 12.04; it might well work on Mac and Windows, but has never been tested. Feel free to contribute code for those platforms.
Getting the code
The code is freely available at GitHub under LGPL: To check it out, do
git clone https://github.com/peter-ha/qt-ssl-crawl.git
Building the code
To build the code, a recent version of the Qt 5 code is needed. Please note that the Qt 5 Alpha is not recent enough to run the code with, as it is missing an important commit. Because of source incompatible changes to the API, it does not work with Qt 4 currently, but would be easy to support.
Running the code
The code can be run via e.g. the following:
./qt-ssl-crawl 1 100 > top-100-sites.csv
This will crawl the sites 1 to 100 from the Alexa top sites, which the crawler will tell you to download before it runs. For the statistics presented earlier the crawler was invoked with "./qt-ssl-crawl 1 10000 > crawl-output.csv". Right now, there will be plenty of debug information on the screen, so it is wise to redirect the standard output to a file. Be aware that the crawler waits until the very end before it writes its results to the file (see also the things to improve above).
5. Summary
This article has presented a free SSL crawler, along with some results from its first usage. Please leave feedback if you have found this interesting, or have suggestions on what to do better or what to crawl next.