General Graphs in SQL

From: -CELKO- <jcelko212_at_earthlink.net>
Date: 15 Dec 2005 15:49:00 -0800
Message-ID: <1134690540.188164.89030_at_z14g2000cwz.googlegroups.com>



This is long, so bear with me.

I am trying to come up with a way to model a general graph in SQL which does not use the adjacency list model. Here is what I have so far and I'd like to get any ideas on it.
What if we attempt to store all the paths in a directed graph in a single table in an RDBMS? The table for this would look like this, with all the edges directed downward (pardon the ASCII artwork):

  A
 / \
B C
| \ |
F D

    |
    E

So start with a simple table of paths.

CREATE TABLE Path
(path_nbr INTEGER NOT NULL,

 step_nbr INTEGER NOT NULL
   CHECK (path_nbr >= 0),
 node_id CHAR(1) NOT NULL,
  PRIMARY KEY (path_nbr, step_nbr));

Each path is assigned an id number and the steps are numbered from zero
(the start of the path) to (k), the final step. Using the simple six
node graph, the one edge paths are:

1 0 A
1 1 B
2 0 B
2 1 F
3 0 C
3 1 D
4 0 B
4 1 D
5 0 D
5 1 E

Now we can add the two edge paths:

6 0 A
6 1 B
6 2 F
7 0 A
7 1 B
7 2 D
8 0 A
8 1 C
8 2 D
9 0 B
9 1 D
9 2 E

And finally the three Edge paths:

10 0 A
10 1 B
10 2 D
10 3 E
11 0 A
11 1 B
11 2 D
11 3 E

These rows can be generated from the single edge paths using a CTE
(Common Table Expression) or with a loop in a procedural language, such
as SQL/PSM. Obviously, there are fewer longer paths, but as the number of edges increases so does the number of paths. By the time you get to a realistic sized graph, the number of rows is huge.

However, it is easy to find a path between two nodes:

SELECT DISTINCT start_node, :end_node,

              (P2.step_nbr- P1.step_nbr) AS distance   FROM Paths AS P1, Paths AS P2

 WHERE P1.path_nbr = P2.path_nbr
   AND P1.step_nbr <= P2.step_nbr
   AND P1.node_id = :start_node
   AND P2.node_id = :end_node;

Notice the use of SELECT DISTINCT because most paths will be a sub-path of one or more longer paths. Without it, the search for all paths from A to D in this simple graph would return:

7 0 A
7 1 B
7 2 D
8 0 A
8 1 C
8 2 D
10 0 A
10 1 B
10 2 D
11 0 A
11 1 B
11 2 D

However, there are only two distinct paths, namely (A, B, D) and (A, C, D). In a realistic graph with lots of connections, there is a good chance that a large percentage of the table will be returned.

Can we do anything to avoid the size problems? Yes, and no.

In this graph, most of the paths are redundant and can be removed and replaced by a set of sub-paths that cover all of the paths in the original graph. This is easy enough to do by hand for this simple graph:

1 0 A
1 1 B
1 2 F

2 0 A
2 1 B
2 2 D
2 3 E

3 0 A
3 1 C
3 2 D
3 3 E

I have no idea if there is already a name in the literature for such a set of paths, so I will call it a "covering path set" and hope that it sticks. Furthermore, I have no idea how to construct it; no idea how to test that the set has the minimal number of paths or how to find all possible such sets other than brute force.

Here are some random thoughts I have had so far:

  1. The first observations are that if we have a cycle of size (k) in the graph, we would split them into two paths of (k-1) nodes that overlap. If the graph has cycles that overlap or form a lattice, the number of possible paths increases; notice the way that this graph has two paths between node A and D that split and re-converge.
  2. While search queries are easier in this model, dropping, changing or adding a single edge can alter the entire structure, forcing us to re-build the entire table. The combinatory explosion problem again!
  3. Can I use Dijkstra's minimal spanning tree algorithm in some way to get a starting point?
Received on Fri Dec 16 2005 - 00:49:00 CET

Original text of this message