Network Diagrams and Python Web Crawlers

Output 27k 600px 1000zoom invert I'm fascinated by networks and data visualization. I've always wanted to try my hand at making some of the inspiring images I see on blogs like flowing data. This network diagram is my first amateur attempt.

The code

I started by writing a rather simple web crawler in Python. The logic for the bot was:

1. Open a page

2. Create a list of all the links on that page (capture the total number of links)

3. For each link, create a new bot to follow the link and start the whole process again.

This was a great chance to use the Threading module in Python. I am not an expert in threading or multiprocessing. However, threading allowed me to create a new bot for each link I wanted to follow.

Here is the code for my spider class:

'''

Created on Jun 13, 2012

@author: Alex Baker

'''

#imports

import urllib2,BeautifulSoup,time

from threading import Thread

classspider1():

def scan(self,url,mem, f):

# Get the url

usock = urllib2.urlopen(url)

# Your current URL is now your "old" url and

# all the new ones come from the page

old_url = url

# Read the data to a variable

data = usock.read()

usock.close()

# Create a Beautiful Soup object to parse the contents

soup = BeautifulSoup.BeautifulSoup(data)

# Get the title

title = soup.title.string

# Get the total number of links

count = len(soup.findAll('a'))

# For each link, create a new bot and follow it.

for link in soup.findAll('a'):

try:

# Cleaning up the url

url = link.get('href').strip()

# Avoid some types of link like # and javascript

if url[:1] in ['#', '/','','?','j']:

continue

# Also, avoid following the same link

elif url == old_url:

continue

else:

# Get the domain - not interested in other links

url_domain = url.split('/')[2]

# Build a domain link for our bot to follow

url = "http://%s/" % (url_domain)

# Make sure that you have not gone to this domain already

if self.check_mem(url, mem)==0:

try:

# Create your string to write to file

text = "%s,%s,%s\n" % (old_url, url, count)

# Write to your file object

f.write(text)

print text

# Add the domain to the "memory" to avoid it going forward

mem.append(url)

# Spawn a new bot to follow the link

spawn = spider1()

# Set it loose!

Thread(target=spawn.scan, args=(url, mem, f)).start()

except Exception, errtxt:

# For Threading errors print the error.

print errtxt

except:

# For any other type of error, give the url.

print 'error with url %s' % (url)

except:

# Just keep going - avoids allowing the thread to end in error.

continue

def check_mem(self, url,mem):

# Quick function to check in the "member" if the domain has already been visited.

try:

mem.index(url)

return 1

except:

return 0

As you can see, the code is simplistic - it only considers the domain/sub-domain rather than each individual link. Also, because it checks to make sure that no domain is used twice

To run the class, I used something like this:

mem = []

f = open('output.txt', 'w')

url = 'http://justanasterisk.com'# write the url here

s = spider1()

s.scan(url, mem, f)

Once started, it doesn't stop - so kill it after a while (or build that in). Running this on my MacBook, I recorded 27,000 links in about 10 minutes.

The data

The number of data points is small in comparison to some of the sets I've explored using BigQuery or Amazon SimpleDB. However, I wanted to make a visualization and I realized that the number of pixels would really define how many data point were useful. I figured that 10 minutes would give me the structure that I wanted. I used my blog justanasterisk.com as the starting point. I won't attach the data (you can create that yourself) but suffice to say that each line was:

source, destination, # of links on source page

The visualization

Here is where I was out of my element. I browsed a few different tools and the best (read: easiest) solution for my needs was Cytoscape. It is simple to use and has several presets included to make you feel like you've done some serious analysis. For the image above, I used one of the built in layouts (modified slightly) and a custom visual style.

NewImage

Screen Shot 2012 06 18 at 09 38 PM

Screen Shot 2012 06 18 at 09 39 PM

Screen Shot 2012 06 18 at 10 40 AM

I won't underwhelm you with further details, but shoot me an email if you want more. I'll probably add a few more images to this post when I get them rendered.

Best,

~ab

coding, pythonAlexander BakerJune 15, 2012data, python, robot, visualization1 Comment