0xDE (11011110) wrote,

Strong components of the Wikipedia graph

I recently covered strong connectivity analysis in my graph algorithms class, so I've been playing today with applying it to the link structure of (small subsets of) Wikipedia.

For instance, here's one of the strong components among the articles linked from Hans Freudenthal (a mathematician of widely varied interests): Algebraic topology, Freudenthal suspension theorem, George W. Whitehead, Heinz Hopf, Homotopy group, Homotopy groups of spheres, Humboldt University of Berlin, Luitzen Egbertus Jan Brouwer, Stable homotopy theory, Suspension (topology), University of Amsterdam, Utrecht University. Mostly this makes sense, but I'm not quite sure how the three universities got in there. Maybe from their famous faculty members?

It's limited to relatively small graphs by the time it takes to grab the data from the Wikipedia servers, but the code is surprisingly easy. Given a set of articles, here's all it takes to build the graph:
def blueLinksFrom(article):
    """Which Wikipedia articles are linked from the given article?"""
    api = "http://en.wikipedia.org/w/api.php?action=parse&page=" + \
          urllib.quote(article.encode('utf-8')) +\
          "&format=json&prop=links"
    linkdata = json.loads(urllib2.urlopen(api).read())
    linkarray = linkdata['parse']['links']
    return {L['*'] for L in linkarray if 'exists' in L and ':' not in L['*']}

graph = {v:blueLinksFrom(v)&articles for v in articles}
Vaguely relatedly (well, it's about graphs and Wikipedia, if nothing else) I discovered two things about Halin graphs today: (1) Like many things in mathematics, they're named after the wrong person. Halin graphs (or at least, 3-regular Halin graphs) actually go back to the work of Thomas Kirkman, over 100 years before Halin. (2) For about the past 3 1/2 years, Wikipedia may have been using the wrong Halin reference, copied from MathWorld. I can't tell for sure because I don't read German, but the 1964 paper it was citing (and that MathWorld still cites) seems to be about other topics. More reliable publications in this area instead cite a different paper by Halin, published in 1971.
Tags: graph algorithms, wikipedia
  • Error

    default userpic

    Your reply will be screened

    Your IP address will be recorded 

  • 4 comments