#19 Add a getPairs() method to the Graph interface

closed
nobody
5
2010-12-01
2010-10-21
Jamie
No

It's nice that edge doesn't need to contain references to source and target vertices, but it can be a bit awkward to have to getEdges and then getEndpoints(). This is especially true for situations (e.g. timestamped edges) where there are multiple edges connecting pairs and the implementing class is likely to know a smarter way to get a set of Pairs. If there isn't a problem with this I'd be fine doing this myself, I just thought I'd post it here first in case there are reasons for not wanting this.

Discussion

  • We are very, very reluctant to add more methods onto the Graph interface.

    That said, I'm not sure exactly what use case you're trying to suggest. Can you post a short code snippet that shows what you'd like to be able to do?

    (Side note: if you're working with a Multigraph and you want to get all the edges connecting two vertices, that's what findEdgeSet() is for. But I'm not sure that's what you were getting at.)

    Joshua

    PS: You know, for someone whose login is 'inspired2apathy' you're sure posting a lot. ;)

     
  • Jamie
    Jamie
    2010-10-22

    I can certainly appreciate the reluctance to clutter the Graph interface. Even with just the add operations, it can be aesthetically dissatisfying to see the different versions e.g. for Hypergraph and Graph.

    My use case is any time I want to get the set of Pairs in a Graph. For instance, if I want to compare two versions of the same network at two different times as represented by two different Graphs. At the most basic level, you might want to compare the set of pairs to see who's connected at one time and not in another. Or you may want to do something comparing the number of edges connecting people at the two time periods. Either way, the starting point in comparing two networks, will frequently be the set of pairs. Of course, you can use the Scorers to compare structure as well, but the set of pairs is a good starting point.

    Alternatively, a lot of graph level scores/indices e.g. density require something more like the number of pairs than the number of edges.

    Now, since I've extended the Graph interface to support a ordered, navigable sequence of (e.g. timestamped) edges for each Pair, the list of edges is much much longer than the number of pairs. In one dataset I'm playing with now, I have 300,000 edges for about 3,000 vertices and 30,000 unique pairs. However, computationally, it's working quite nicely, so I'm planning to scale up to larger datasets. I'd like to get this included in JUNG if you're interested.

    I situations (such as mine) when the set of edges is larger than the set of pairs, it seems silly to have to do something like:

    HashSet<Pair<Integer>> checked = new HashSet<Pair<Integer>>();
    for (ComparableEdge<Integer,Integer> edge : comparableGraph.getEdges()){
    Pair<Integer> pair = comparableGraph.getEndpoints(edge);
    checked.add(pair);
    }
    System.out.println(checked.size());

    to get the set of pairs.

    Although my case is extreme, this crops up even when you allow undirected and directed edges connecting the same pair. It would be very nice to have a direct way to get at the set of unique pairs for any graph object. Otherwise, I need to know something about the specific Graph implementation in order to know if

    HashSet<Pair<V>> checked = new HashSet<Pair<V>>();
    for (E edge : thisGraph.getEdges()){
    Pair<V> pair = thisGraph.getEndpoints(edge);
    if (checked.contains(pair))
    continue;

    doPairAnalysisOrOperation(...)

    checked.add(pair);
    }

    will do the same thing as:

    for (E edge : thisGraph.getEdges()){
    Pair<V> pair = thisGraph.getEndpoints(edge);
    doPairAnalysisOrOperation(...)
    }

    And requiring information about the implementation to know something about the semantics of the only method I can use seems undesirable.

    PS: Indeed, it's the curse of the screenname my high school self thought was clever.

     
  • A couple more comments on this, and then I'm going to close this request. We thank you for your thoughtful request, but don't think that it's a good thing to put into the released API.

    (0) Part of my reluctance to do this, aside from wanting to not make the API more cluttered, is that it exposes some assumptions about the internal storage. While what you describe would be natural with the existing implementations, it would be considerably less so with (for example) a matrix-based implementation.

    (1) The way that you've set up your graph sounds like you're wanting to treat the graph as a multigraph in some cases (when you care about timestamps) and as a non-multigraph in others (when you only care about which pairs are connected). A couple of alternatives you might consider:
    - Create two graphs over the same vertex set (the vertex objects themselves can be shared), one in which you have one edge per timestamp, and one in which you have one edge per
    - Create a single graph, but instead of creating an edge for each timestamp, create a single edge and associate all relevant timestamps with it.

    As a side note, if you really, really want the getPairs() behavior for your own purposes, you can of course add it to your own private version of Graph (or even just override the relevant implementation class and add the method).

    Thanks again for your suggestion. Feel free to keep making them if you notice other things that you think could be better.

    Joshua

     
    • status: open --> closed
     
  • I'm willing to concede defeat, but I thought I'd point out why your solution doesn't really solve my problem. The point of my NavigableMultiGraph or whatever I end up calling it is that you can easily examine subgraphs. For instance, you can efficiently [ie O(k*log(T)) where k is the constant time hash lookup and log(T) is a sorted tree lookup] get a subgraph for one month's activity and compare it to another month's activity. To do what you're saying I would have to complicate the API by adding options for creating a multigraph subgraph and options for creating a simple graph subgraph, whereas right now any subgraph of a NavigableMultiGraph is itself a NavigableMultiGraph.

    A philosophical reason why I disagree with the decision is that I would argue that you should be able to perform the same operations on a MultiGraph as a simple "no-duplicates-edge" graph. Whether you're interested in Pairs or edge Es should a question of the analysis methodology and not the data structure.

    I also really don't see what assumptions it exposes. I don't see how a getPairs() function would be any less natural with alternative Graph implementations than getEdges() already is. I would argue that a getPairs() function would just be a convenience function with the same kind of relationship to getEdges() as getPredecessors(V v) already has to getInEdges(V v).

    While you are certainly right, that I can add it to my own Graph classes, that forces me to choose to either write generic analysis code that will run slowly on my actual data/Graph, or to write special-purpose analysis code that only works on my data/Graph. One of the requested areas for development is graph/network statistics and analysis, and I think that the absence of a getPairs() inhibits certain kinds of general analysis/algorithms using the Graph interface.

     
  • Jamie
    Jamie
    2010-12-04

    Sorry, I forgot to login. That comment was me again.

     
  • Jamie
    Jamie
    2010-12-04

    As an addendum, I think this is especially important because I see a general trend towards multi-graphs across network analysis (esp. social), including evidence-based/behavioral trace/transaction networks (e.g. email, trade) as well as multiple types of relationship (e.g. trust, friendship, authority, respect, co-work, economic, political). I think that in order to be prepared for these kinds of data, JUNG needs to be equipped to permit **efficient** computation of traditional "no-duplicate-edge" graph/network analysis on these kinds of multi-graphs.

     
  • Addressing your comments in reverse order:

    (1) I was doing analysis over multigraphs and multi-modal graphs several years ago; my experiences in that context are part of what has shaped the design of JUNG. I've been a bit out of the research end of things in SNA for a few years, but I'm definitely aware of the need for good support of those types of scenarios. So I appreciate your continued engagement on this issue.

    (2) I think--although I am not certain--that I now understand why you don't think that either of my proposed solutions works for you, i.e., you want to extract constrained subsets of the (timestamped) edges. (Certainly this means that the second proposed solution doesn't work.)

    However, getPairs() doesn't help you in that context, either: if you've specified a time interval for which a particular pair of vertices has no corresponding connecting edges, getPairs() is still nevertheless going to believe that those edges are connected (unless you build the set at call time with a specified Predicate, which both is not compatible with your proposed API and isn't any more efficient than letting the caller build the set themselves). There may be a clever way to solve that problem, but I haven't thought of one yet. So far all I've got involves considering only a specific limited set of (preferably non-overlapping) intervals, and creating a separate graph for each such interval (ick).

    (3) In what sense can you not perform the same operations on a multigraph as you can on a simple graph?

    I look forward to your further comments. (Feel free to move this discussion over to the forum, incidentally, so that it will get somewhat broader visibility.)

    Joshua

     
  • Thanks for following up. I appreciate the comments and I'll try to find a way to package these thoughts up and move them over to the forums.

    (1)Indeed, the MultiGraph implementations seem quite nice. I have found myself wondering recently if there might not be a more flexible way of implementing EdgeType that would allow for user-defined types, but I digress.

    And reversing things slightly:
    (3) I don't know that there are necessarily any things that you can't do on a MultiGraph, but I can think of a number of things that are less convenient/efficient to do on a MultiGraph *if you're forced to treat it like a MultiGraph.* I think it's often probably just things like normalizing constants. For example, Newma-Girvan define their modularity statistic as how much more connections there are within groups than you'd expect in a random graph. The "than you'd expect in a random graph" is just <group size>^2 * average density, with density being [# pairs]/[# nodes]^2. Frankly, I think much of the time where you'd want the total number of pairs would be because you're getting the density and comparing it to some more interesting computed value. Maybe a getDensity() function would be more useful/inoffensive?

    (2) Not to get too bogged down, but here's what I'm doing. Rather than parameterize low level queries like getNeighbors(V v) or getInEdges(V v) with the desired (usually) time interval, I've tried to make it easy to get subsets of the whole network for a particular time interval. This is relatively painless because I'm using TreeMaps to store the edges for each specific pair, and a subMap(..) of a TreeMap is essentially a reference to a particular subset of the parent tree. So there's no duplication of data, plus it's fast, plus you get the not undesirable side-effect of changes to the original network affecting the subnetwork and vice versa. The reason I could have a faster getPairs() implementation is that I'm actually storing a TreeMap for each pair. I can just check whether the TreeMap for each pair is non-null and non-empty and return pair instead of returning all of the contents of the TreeMap as edge objects. I should probably just start a thread in the forums and post what I'm doing.

     
  • Jamie:

    A couple of brief responses:

    (1) What kind of user-defined EdgeTypes would you want? I'm not opposed to adding additional types if there's a compelling use case.

    (2) Are you saying you're actually storing a TreeMap for each pair of vertices in the graph? That sounds really inefficient. Plus it means that to build getPairs() that you're doing O(n^2) null/empty checks, which is actually less efficient than simply building the pairs yourself externally to the graph class.

    I think it would help if you could post some of the code that you've written, to make this more concrete.

    (3) It's been a long time since I looked at the definition of the modularity statistic, so I don't remember the details...but if you're looking at things like the clustering coefficient, you just ask nodes for their neighbors, which is sort of a sidewise way of getting pairs.

    We are definitely not going to put things like getDensity() in the Graph API. That's a derived statistic and anything like that is handled outside that interface (in the algorithms package, generally). It would actually be easier to convince me to add getPairs(). :)

    Joshua