Associative Grouping using tSQL (Part 2)

Posted by Niall's Data Blog on Friday, July 5, 2019

This is part of series of posts about associative grouping:

In part one of this series we looked at how we could use recursive CTE’s to find overlapping groups in two columns in a table, in order to put them into a new super group of associated groups.

Since I wrote that post, SQL Server 2019 CTP 3.1 has been released, and with it comes some enhancements to the graph processing functionality, namely the new SHORTEST_PATH() function.

To use this code yourself you will need to be using SQL Server 2019 CTP 3.1 or later. The simplest way to play with the CTP releases is to use Docker. Here is a good guide to getting setup by DBA From The Cold. The images are available on the Docker Hub, just make sure you pull this image.

mcr.microsoft.com/mssql/server:vNext-CTP3.1-ubuntu

Demo Data

Once we are setup, we can load some sample data into a table like in the last post. The difference here is we will use the AS NODE syntax to make the table act as the nodes in a graph, and a extra columns $node_id is automatically added.

-- Nodes
DROP TABLE IF EXISTS SuppliersGraph;
CREATE TABLE SuppliersGraph
(
    Id INT IDENTITY(1,1) NOT NULL
   ,SupplierName VARCHAR(30) NOT NULL
   ,TaxNumber INT NOT NULL
   ,BankSortCode CHAR(8) NOT NULL
   ,BankAccountNumber INT NOT NULL
)AS NODE;

INSERT INTO SuppliersGraph (SupplierName, TaxNumber, BankSortCode, BankAccountNumber)
VALUES ('AdventureWorks Ltd.',      12345678, '02-11-33', 12345678)
      ,('AdventureWorks',           12345678, '02-55-44', 15161718)
      ,('ADVENTURE WORKS',          23344556, '02-55-44', 15161718)
      ,('ADVENTURE WORKS LTD.',     23344556, '02-77-66', 99887766)
      ,('AW Bike Co',               23344556, '02-88-00', 11991199)
      ,('Big Bike Discounts (AW)',  55556666, '02-88-00', 11991199)
      ,('Contoso Corp',             90001000, '02-99-02', 12341234);

And here is what the data looks like, coloured to show the chain of links between the rows.

IdSupplier NameTax NumberBank Sort CodeBank Account NumberRequired Output
1AdventureWorks Ltd.1234567802-11-33123456781
2AdventureWorks1234567802-55-44151617181
3ADVENTURE WORKS2334455602-55-44151617181
4DVENTURE WORKS LTD.2334455602-77-66998877661
5AW Bike Co2334455602-88-00119911991
6Big Bike Discounts5555666602-88-00119911991
7Contoso Corp9000900002-99-02123412342

This produces the same simple disconnected graphs as last time. We have one graph with 6 nodes, and another graph with a single node.

The Nodes are numbered using the Id’s of the rows
The Nodes are numbered using the Id’s of the rows

Now we need to query the table, and use the groups of tax numbers, and the groups of bank details to build a table of the graph edges.

-- Edges
DROP TABLE IF EXISTS SupplierEdges;
CREATE TABLE SupplierEdges
(
    LinkType VARCHAR(100) NOT NULL
)AS EDGE;

WITH CTE_Edges
AS (-- Add the tax number links
    SELECT s1.Id AS FromId
          ,s2.Id AS ToId
          ,'TaxNumber' AS LinkType
    FROM SupplierNodes AS s1
    INNER JOIN SupplierNodes AS s2
    ON s1.TaxNumber = s2.TaxNumber
    UNION
    -- And the bank account links
    SELECT s1.Id AS FromId
          ,s2.Id AS ToId
          ,'BankSortCode, BankAccountNumber' AS LinkType
    FROM SupplierNodes AS s1
    INNER JOIN SupplierNodes AS s2
    ON s1.BankSortCode = s2.BankSortCode
        AND s1.BankAccountNumber = s2.BankAccountNumber
   )
INSERT INTO SupplierEdges ($from_id, $to_id, LinkType)
SELECT (SELECT $node_id FROM SupplierNodes WHERE Id = e.FromId) AS FromId
      ,(SELECT $node_id FROM SupplierNodes WHERE Id = e.ToId) AS ToId
      ,STRING_AGG(e.LinkType, ' / ')
FROM CTE_Edges AS e
GROUP BY e.FromId
        ,e.ToId;

SELECT *
FROM SupplierEdges;

Associating the Groups

Now that we have the data, we can use the new SHORTEST_PATH function to traverse the graph. The documentation examples use the function to find the shortest path from one node to another. In our case, we want to find all the paths between all connected nodes. This would be a bad idea in a huge graph, such as all the users on LinkedIn, but as this scenario deals with the idea of lots of smaller, disconnected graphs which we want to use to groups nodes, the approach should be fine. Here is the ode to navigate the graph.

SELECT s1.Id
          ,CAST(s1.Id AS VARCHAR(10)) + ':' + STRING_AGG(s2.Id, ':') WITHIN GROUP (GRAPH PATH) AS GraphPath
          ,COUNT(s2.Id) WITHIN GROUP (GRAPH PATH) AS LevelCount
    FROM SupplierNodes AS s1
        ,SupplierEdges FOR PATH AS lnk
        ,SupplierNodes FOR PATH AS s2
    -- The SHORTEST_PATH() function is brand new in SQL Server 2019 CTP3.1
    WHERE MATCH (SHORTEST_PATH ( s1(-(lnk)->s2)+) )

Note: If we wanted to limit the number of nodes traversed, we can as described in the docs.

The output from this looks like this (abbreviated for readability)

IdGraphPathLevelCount
11:11
11:21
22:11
22:3:52
33:2:12
33:5:62
66:5:3:23
11:2:3:5:64
66:5:3:2:14

The Id column represents the start node for the graph iteration. The GraphPath column uses the STRING_AGG function to build a colon delimited list of all of the nodes visited as the graph was traversed.

We can take this output, and use the STRING_SPLIT function to unpivot this path data. If we group on the node id returned from the string split, then MIN(Id) will give us the lowest Id of all the nodes in the graph. As nodes are only a member of a single graph, this produces the associative grouping.

WITH CTE_Links
AS (SELECT s1.Id
          ,CAST(s1.Id AS VARCHAR(10)) + ':' + STRING_AGG(s2.Id, ':') WITHIN GROUP (GRAPH PATH) AS GraphPath
          ,COUNT(s2.Id) WITHIN GROUP (GRAPH PATH) AS LevelCount
    FROM SupplierNodes AS s1
        ,SupplierEdges FOR PATH AS lnk
        ,SupplierNodes FOR PATH AS s2
    -- The SHORTEST_PATH() function is brand new in SQL Server 2019 CTP3.1
    WHERE MATCH (SHORTEST_PATH ( s1(-(lnk)->s2)+) )
   )
,CTE_Groups
AS (SELECT lnk.value AS Id
          ,MIN(lnks.Id) AS GroupId
    FROM CTE_LINKS AS lnks
    CROSS APPLY STRING_SPLIT(lnks.GraphPath, ':') AS lnk
    GROUP BY lnk.value
   )
SELECT s.*
      ,DENSE_RANK() OVER (ORDER BY g.GroupId) AS GraphId
FROM CTE_Groups AS g
INNER JOIN SupplierNodes AS s
  ON s.Id = g.Id
ORDER BY s.Id;

The output of this gives the same answers as the recursive CTE method.

IdSupplierNameTax NumberBank Sort CodeBank Account NumberGraph GroupRequired Output
1AdventureWorks Ltd.1234567802-11-331234567811
2AdventureWorks1234567802-55-441516171811
3ADVENTURE WORKS2334455602-55-441516171811
4ADVENTURE WORKS LTD.2334455602-77-669988776611
5AW Bike Co2334455602-88-001199119911
6Big Bike Discounts (AW)5555666602-88-001199119911
7Contoso Corp9000100002-99-021234123422

Performance

So how does this perform compared to the recursive CTE approach? Well for this simple set of data the performance was pretty much identical at around 16ms. But changing the sample data to add a few more nodes in the existing groups increases the edge count quite a bit.

INSERT INTO SupplierNodes (SupplierName, TaxNumber, BankSortCode, BankAccountNumber)
VALUES ('AdventureWorks Ltd.',      12345678, '02-11-33', 12345678)
      ,('AdventureWorks Limited',   12345678, '02-54-44', 32165487)
      ,('AdventureWorks',           12345678, '02-55-44', 15161718)
      ,('AW Limited',               98768765, '02-55-44', 15161718)
      ,('ADVENTURE WORKS',          98765432, '02-55-44', 15161718)
      ,('AW Bike Co',               98765432, '02-66-00', 11991199)
      ,('AW Bike Co',               12345678, '02-66-00', 11991199)
      ,('Big Bike Discounts (AW)',  55556666, '02-66-00', 11991199)
      ,('Contoso Corp',             90001000, '02-99-02', 12341234);

Which gives us a graph that looks like this.

A bigger graph
A bigger graph

Comparing the performance when processing these two graphs made a lot more difference. Running in a small Docker container, the CTE approach took almost 13 seconds, where the graph approach took only 42ms.

Next Steps…

As I get time in the future I’d like to look at the performance of these solutions, particularly to try and gauge how the approaches scale as we increase the number of rows (and therefore disconnected graphs) in the table. We will also consider how performance changes as we make our graphs:

  • Larger (more hops)
  • Cyclic (more nodes connected in circles, blue and green in the above graph)
  • Complex (more interconnections between nodes, yellow in the above graph)

It will also be interesting to look at what options there are for for people using Azure with the Data Lakehouse architecture (Databricks & synapse). Recursive CTE’s and tSQL graph processing are not supported in Synapse SQL Pools, but spark has support for graph processing using the GraphFrames library.

For anyone wanting to learn more about graph databases in general, VisualGo has some great tutorials.

In the meantime the demo code for this post is available on this Github Gist, and requires SQL Server 2019 CTP 3.1 or higher to run.