Using team-cymru.org's  API

Given a list of unique IP addresses I was interested in finding their corresponding ASNs. This task was made a bit more difficult by the fact that the IP addresses I had were from Egypt and most free IP databases are lacking when it comes to North African countries.

The goal was to find the  ASNs of  126136 unique IP addresses .

I started with accessing team-cymru.org's IP database using their API, following their instructions as found on  http://www.team-cymru.org/IP-ASN-mapping.html#whois.   

Here is what I did:

Now, this text file I just created needed a  a "begin" at the beginning and an "end" at the end
which I added using a Mac's  text editor TextEdit, which can be found using Finder in the Applications folder.  The resulting file can be found here.

I then opened a command line window from the directory containing the userIPs.txt  file and submitted the file to team.cymru.org's site:

and so the output from the website was saved into the text file userIPsToASNs.txt The text file can be found here.

After some trial and error  I ended up reading  in this resulting text file into R using:

Then, I gave the columns names and I looked at the results:

Wow, things don't look so good, there are NA's everywhere.

So, exactly how many of the entries contain NA's?

Well, okay, it is not too bad,  768 out of the 126136 IPs,  0.6 %,  were not identified.

Aside: I must mention that the text file returned by team.cymru.org contained more rows, and more distinct IP addresses then the number of entries of   userIPs. I don't quite understand why this happened.   As far as using the results, I did not run into any trouble having the extra information  since I used "natural join" on the IP  column to add to an already existing data frame the additional columns  ASNand Organization.

It turned out, that the data frame asn_table,  had a lot of extra spaces everywhere.   This made comparing IP addresses impossible, since the string " 163.121.116.217  " does not equal "163.121.116.217".  So the first step, before doing anything else, was to clean up the extra blanks:

Next, I identified and collected the IPs whose ASNs were still unidentified:

Then I collected the IPs with known ASNs into a data frame and saved it as   IP2ASN.csv  so that it can be used later.  The .csv  file can be found here.

Identifying the Remaining IPs

In the previous section, the IPs with still missing ASNs were saved into the vector  missingInfo_IPs.

I searched the web for an IP database that contained the information I needed - for free.

I found Hurricane Electric Internet Services, where the ASN of some of the IPs from the missingInfo_IPs  were identifiable.  However, this website had no API and no way of getting information from it in bulk, except maybe through scraping.  So that is what I tried next.

Let's see what R can get for us from Hurricane Electric's site for this ip :

The readLines command got quite a bit of information from the webpage, specifically 239 lines of it:

[1] "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\""
[2] " \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">"
[3] "<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" lang=\"en\">"
[4] "<head>"

 

The information useful to me, was on lines 103 and 109.  Using some regular expressions one could extract the relevant information and discard the rest.

But before that, I tried to get rid of the IPs for which even Hurricane Electric could not find the ASNs.

So, I looked at the web scraping of a page for an IP on which Hurricane Electric had no information on:

This time, there was 103 lines of information:

 

As it turns out, the phrase "did not return any results" (line 85 in this case) would appear in all cases when the IP address was unrecognized by Hurricane Electric.   So I decided on using "did not return any results" as my pattern that I would search for using regular expressions.

The goal was to identify those IP's for which there was no ASN and then filter out the relevant information for those for which an ASN was returned.  So I wrote the following (inelegant) "for loop":

I printed out the iteration number ( counter ) at each iteration to keep track of progress.   My rough estimate was that it took about 1 to 3 seconds per iteration,  and there were 768 iterations to go through.  It could take quite a bit of time for the loop to finish and I did not want to sit around not knowing whether things were progressing or have stalled at a particular point.

I put off extracting the information for the "knowable" IPs (I was going to write a second loop for the knowableIPs   - and this was a major mistake.

After a few hundred iterations, the website locked me out.  I realized this belatedly when I finally I modified the "for loop" and began to print out the returned values.  As it turns out,  I was told repeatedly that "You have reached your query limit on bgp.he.net" -  I just didn't listen.

All in all,  I never got the ASNs for any of the 768 IPs in the missingInfo_IPs  vector.

 

Finding ASNs from IP Addresses Using R

Leave a Reply

Your email address will not be published. Required fields are marked *