0xDE (11011110 ) wrote,

Relational events vs graphs

I suppose I should say something about my paper, "Windows into Relational Events: Data Structures for Contiguous Subsequences of Edges" (with Michael Bannister, Chris DuBois, and Padhraic Smyth), which has been accepted at SODA and just appeared online as arXiv:1209.5791.

The point of the paper is to look at data sets consisting of (directed or undirected) edges connecting pairs of vertices drawn from some vertex set, with timestamps on the edges. For example, edits to Wikipedia can be modeled in this way: the vertices are editors and Wikipedia articles, and each edge represents an edit made at some time by some editor to some article. This kind of data set has been studied in social network analysis for quite some time under several names; in the name we use, "relational events", the "relational" part refers to the pair of vertices and the "event" part refers to the timestamps.

You might think that such a thing is a graph, but it's really not. You can get a graph (or rather a multigraph) from it, by forgetting the time steps and just keeping the vertex pairs, but it's probably not the graph you want, because it aggregates everything together on too coarse a timescale. You don't get a graph for any single instant of time, either, instead you just get a single edge (assuming all timestamps are distinct, which can be made to happen by an appropriate tie-breaking rule). It's not even a dynamic graph in the sense of dynamic graph algorithms, because those have edges that persist over long time periods from an insertion to a later deletion, with the whole graph changing only gradually.

What you probably want to do with this sort of data is to form a sequence of graphs by aggregating together the edge events that occur within certain windows of time, and examine the changing behavior of the graphs in this sequence. And if you do that, for a fixed window size, and you slide the window across the timeline, you really do get a dynamic graph with an insertion event when an edge's timestamp enters the window and a deletion event when it leaves the window. But this requires knowing a fixed window size to use, and we want to allow exploratory forms of data analysis in which the window size is not fixed in advance. So the type of result we show is: you can preprocess the data set in near-linear time and space, in such a way that you can query the properties of the graphs formed by any given window into the event timeline, in near-constant time per query.

Here "query the properties" is a bit vague, but what we mean by it is that we can compute many numerical invariants that are standard in social network analysis: the number of connected components of the graph formed by the windowed set of edges, an approximation to the degree distribution of the graph, the reciprocity (the proportion of connected pairs of vertices that have two-way connections), etc. We also included a couple of properties that make sense only for relational events, and not for graphs without timestamps; for instance, given a starting set of "infected" vertices, we can ask how many other vertices could be reached from these vertices by time-monotonic paths.

The technical details of our algorithms involve matroids and cutting-and-linking trees and orthogonal range counting and path-copying persistence: techniques that may seem a bit fearsome if you don't already know about them, but that are not particularly new. If you do already know these tools then I think our results are best characterized as "simple in retrospect", so for theoreticians the point of the paper is less about technique and more about the data model. But in this context I think that (other than being a potential obstacle to publication) being simple is a good thing: it makes it more likely that these algorithms are usable. And despite some efforts on our part to make the data structures as uncomplicated as possible, my two less-theoretical co-authors still don't seem convinced that they're simple enough, so there may still be room for improvement in that direction.

Incidentally, this is what my earlier blog post binary numbers without the zeros was for. We use the counting scheme described in that post to set up a sequence of updates to a persistent binary tree, which forms the basis of a range counting data structure whose query times depend on the window size rather than on the size of the whole data set.
Tags: data structures, graph algorithms, papers, social networks, wikipedia
  • Post a new comment

    Error

    default userpic

    Your reply will be screened

    Your IP address will be recorded 

  • 4 comments