One of the reasons that I like about graph theory so much is because of
how it ties into computer science. Graphs are fundamental to many problems in
computer science, and a lot of the work in graph theory has direct implications for
algorithm design. It also has a lot of resonance for me, because the work I do involves
tons and tons of graphs: I don't think I've gotten through a week of work in the last decade without
implementing some kind of graph code.
Since I've described a lot of graph algorithms, and I'm going to
describe even more, today I'm going to talk a bit about how to represent graphs
in programs, and some of the tradeoffs between different representations.
There are two different basic graph representations: matrix-based representations, and list
The matrix-based representation is called an adjacency matrix. For a graph G with N nodes, an adjacency matrix is an N×N matrix of true/false values, where entry [a,b] is true if and only if there is an edge between a and b. If G is an undirected graph, then the matrix is (obviously)
symmetric: [a,b]=[b,a]. If the graph is directed, then [a,b]=true means that there is an edge from a to b.
As usual, things are clearer with an example. So, over the right, we have a simple
ten node undirected graph, with vertices A through J. Right below, there's a 10×10 boolean
table showing how the graph would be represented as an adjacency matrix. In the matrix,
a value of "1" in a cell means that there's an edge connected the corresponding vertex indices; a "0" indicates that there is not.
represented as an adjacency matrix.
The list-based representation, commonly called an adjacency list, has a list for each vertex in the graph. For
a vertex v, that list contains identifiers for the vertices which are adjacent to v. So again for our example:
A common optimization of the adjacency list is to use a data object to represent a vertex, and to use
pointers/references to vertex objects as the vertex identifiers in the adjacency list for a vertex. Depending
on the application, the adjacency lists could be either arrays, linked lists, or some other similar structure.
To give you some sense of what a simple bit of code looks like, here's a pair of Python functions to check if two vertices x and y are
adjacent in a matrix. The first piece of code is for an adjacency matrix; the second for an adjacency list.
def IsAdjacentInMatrix(matrix, x, y): return matrix[x][y]
def IsAdjacentInList(x, y): for n in x.neighbors: if y == n: return true return false
When you're writing a graph-based algorithm, choosing which representation - and which variation of which representation
- is one of your most important decisions. There's no cut and dried correct answer: which representation is right is dependent on what you want to do, and what constraints you have. How fast you need things to be; how much memory you
can afford to use; how big your graph is; how many edges you graph will have - all contribute to the choice of the best representation.
The boolean adjacency matrix is very compact - you can get away with as little as N2/2 bits of storage to
represent the edges in a graph. Certain algorithms - like transitive closure - are extremely efficient in the adjacency
matrix. If you really only need to know the structure of the edges, and you don't have other data that needs to be
associated with the edges, then an adjacency matrix can be an extremely good choice. If your matrix is moderately sized,
you can often get the entire adjacency matrix into the cache at once, which has a dramatic impact on performance.
Even if you need to associate a little bit of information with an edge, that's often easily doable using an adjacency
matrix. For example, to represent a weighted graph, you can use the edge-weight as the value of the matrix entry if there
is an edge. (This does require that you be able to identify a value which will absolutely never be an edge weight, to use
to mark cells that are not edges. If you're using floating point numbers for weight, then NaN is a good choice.) On the
other hand, if you have a very large matrix with very few edges, then an adjacency matrix will perform very poorly and
waste an enormous amount of space. Consider a graph with 1 million vertices, and 1 million edges. That means that each
vertex has, on average, two edges - but you're using one million bits per vertex. That's very wasteful, and dealing with a
matrix of 1012 bits is not going to be particularly fast.
Using an adjacency list works much better if you have a sparse matrix - it only uses space for edges that are actually
present in the graph. It's also often easier to associate data with vertices and edges in a good way using adjacency lists;
and it can often produce better performance in that case as well, due to good locality. (When you traverse an edge in an
adjacency list using the pointer implementation described above, you're pulling the vertex into your cache; then when you
go to look at the data for the vertex, it's hot in the cache. In an adjacency matrix, you'd use a separate table for that,
which would mean fetching from a different area of memory.)
The adjacency list is often better for certain kinds of graph traversals: if you need to walk through a graph
collecting information about visited vertices, the locality properties of the adjacency list structure can have a great
effect on performance. But if you want to know something about connectivity, or simple pathfinding in a relatively dense
graph, then the adjacency matrix is usually the right choice.