Setting Up and Reading in the Data

The data set used here comes mainly from the Measurement Lab data set on Google BigQuery; only the ASNs corresponding to the client IPs were collected using team-cymru's IP to ASN mapping. There were 148524 unique client IP addresses in this Measurement Lab data set, for 768 of these IPs team-cymru's IP to ASN mapping failed to find the corresponding ASNs.
I tried to find missing ASN info from (by scraping their webpage) but without success as I reached my query limit before I could get all the results.

Cleaning up the Data


After the above function is applied to  df  the data frame will contain some extra columns:  daily_labels  and  weekly_labels  that will help with the graphing of the data once it is grouped first by week then by days.

Finding the Daily Throughput

I grouped and summarized the data by day, finding the daily median and mean and variance of throughput.

Plotting the Daily Summaries of the Throughput Data

And finally, using  ggplot to create the plot:



My first question, after seeing the "Daily Median of Throughput in Kbps & Number of Tests per Day in Egypt" graph was is there a correlation between throughput and the number of tests? But, it turns out that there isn’t much:

And just to rule out some kind of lagged correlation, I used the sample cross correlation function (CCF) used for identifying lags of the x-variable that might be useful predictors of y. 



Grouping Data by ASN and Days


And below is  the new data frame.  The problem is that rows of each grouping might not be in correct order, and  there isn't a universal scale set up (besides the dates) that can be used for plotting.  These problems are addressed by sorting the data frame and by adding a new column  day to it.

The first six rows of this final form of the data frame are shown above.  Now we are almost ready to graph, except that some of the data for some of the ASNs is very sparse, they have only a few data points.   In order to have a cleaner, simpler and less misleading graph, the ASNs with fewer than 40 daily data points are eliminated from the dataset.

And, now let's check on the group size of each of the ASNs within the newly created data frame  filtered_daily_by_asn - they all better be larger than 40!

And they are! Yay.

Now we are almost ready to graph, but let's "prettify" the data a bit.  Let's start with adding the prefix "AS" to all the ASN numbers in the data frame:

Next,  in order to have a more informative graph ,let's identify the ASNs not just by number but by name as well.   Here is the list linking each ASN number to its name:

The above information can be found on Hurricane Electric's  website,

Now, to insert this new information into our data frame:

And, just to be sure that all has gone well, lets look at the first few lines of the data frame:

And.  D'oh! The column of interest is not shown.  Oh well, let's move on.  We'll find out sooner or later whether the  ASN_desc column contains the information it should ...

Plotting Daily Throughput grouped by ASN






Egypt's ASNs and Throughput During September 2009 - December 2011

Leave a Reply

Your email address will not be published. Required fields are marked *