Associative Grouping using tSQL (Part 1)

This is part of series of posts about associative grouping:

Recently I was asked by a friend to have a look at an interesting query problem he had been looking at. He was trying to find overlapping groups in two columns, in order to put them into a new super group of associated groups. The simplest way to describe the problem is with an demo, so off we go…

Demo Data

In our scenario, we have a that company has grown through acquisitions, with each one bringing a new set of suppliers. Some of those supplier lists have overlapped, hence the need to cleanse that data as it is consolidated into a single system. However, as is often the case in this situation, the duplicate suppliers do not match on the natural business keys. Lets dive straight into an example…

Here’s the SQL to create the sample table.

CREATE TABLE Suppliers
(
    Id INT IDENTITY(1,1) NOT NULL
   ,SupplierName VARCHAR(30) NOT NULL
   ,TaxNumber INT NOT NULL
   ,BankSortCode CHAR(8) NOT NULL
   ,BankAccountNumber INT NOT NULL
);

INSERT INTO Suppliers (SupplierName, TaxNumber, BankSortCode, BankAccountNumber)
VALUES ('AdventureWorks Ltd.',      12345678, '02-11-33', 12345678)
      ,('AdventureWorks',           12345678, '02-55-44', 15161718)
      ,('ADVENTURE WORKS',          23344556, '02-55-66', 15161718)
      ,('ADVENTURE WORKS LTD.',     23344556, '02-77-44', 99887766)
      ,('AW Bike Co',               23344556, '02-88-00', 11991199)
      ,('Big Bike Discounts',       55556666, '02-88-00', 11991199)
      ,('Contoso Corp',             90009000, '02-99-02', 12341234);

And here’s what the data looks like. The matching tax numbers and bank account details are highlighted, and the final column is added to show the required result.

Id	Supplier Name	Tax Number	Bank Sort Code	Bank Account Number	Required Output
1	AdventureWorks Ltd.	12345678	02-11-33	12345678	1
2	AdventureWorks	12345678	02-55-44	15161718	1
3	ADVENTURE WORKS	23344556	02-55-44	15161718	1
4	DVENTURE WORKS LTD.	23344556	02-77-66	99887766	1
5	AW Bike Co	23344556	02-88-00	11991199	1
6	Big Bike Discounts	55556666	02-88-00	11991199	1
7	Contoso Corp	90009000	02-99-02	12341234	2

We have a table containing supplier names, tax numbers, and bank details for invoicing. It is desired to group the data such that if any records have the same tax number or bank details they are added to a group. In the example above, rows 1 to 5 need to be in the same group, even though row 1 has completely different tax and bank details to rows 4 and 5! Row 6 is for a completely different supplier, and as such should be in it’s own group.

Preparing the Data

We can write a query to add two new columns to that data to hold group numbers, one for the tax number, and another for the combination of bank sort code and account number using the DENSE_RANK() window function. This will simplify the logic of our algorithm as we move forward.

SELECT Id
      ,SupplierName
      ,TaxNumber
      ,BankSortCode
      ,BankAccountNumber
      ,DENSE_RANK() OVER (ORDER BY TaxNumber) AS TaxGroup
      ,10 + DENSE_RANK() OVER (ORDER BY BankSortCode, BankAccountNumber) AS BankGroup
FROM Suppliers
ORDER BY Id;

Note: Adding 10 to the dense rank is just to give different numbers for the two groups in the examples for clarity.

The result set looks like this.

Id	Supplier Name	Tax Number	Bank Sort Code	Bank Account Number	Tax Group	Bank Group
1	AdventureWorks Ltd.	12345678	02-11-33	12345678	1	11
2	AdventureWorks	12345678	02-55-44	15161718	1	12
3	ADVENTURE WORKS	23344556	02-55-44	15161718	2	12
4	ADVENTURE WORKS LTD.	23344556	02-77-66	99887766	2	13
5	AW Bike Co	23344556	02-99-00	11991199	2	14
6	Big Bike Discounts	55556666	02-99-00	11991199	3	14
7	Contoso Corp	90009000	02-99-02	12341234	4	15

Adding the new group columns means we have a single column that represents the multi-column groups of the raw data, and we have a nice small data type to work with instead of strings.

The colour coding of those columns shows how the goops on the two columns act like links of a chain, linking the rows together. This is what a graph database would store - the individual records are nodes, linked together by the group numbers or edges.

Now we can define the problem more clearly. We are looking at a number of disconnected graphs, and we want to assign a unique group number to each of those individual graphs. The associations between the node of the graph are made by the nodes having either the same tax or bank account details. Here is the same data drawn as a graph.

The Nodes are numbered using the Id’s of the rows, and the edges are coloured to match the tables above. — The Nodes are numbered using the Id's of the rows, and the edges are coloured to match the tables above.

Note: Unfortunately the new graph features added to SQL Server 2017 are quite limited, and are unable to help us solve this particular problem.

Instead we will solve the problem using the tools available in tSQL. We start by using the last query to build a query to to define how a bank account group links two different tax groups together.

WITH CTE_Groups
AS (SELECT s.Id
          ,DENSE_RANK() OVER (ORDER BY s.TaxNumber) AS TaxNumberGroup
          ,10 + DENSE_RANK() OVER (ORDER BY s.BankSortCode, s.BankAccountNumber) AS BankAccountGroup
    FROM Suppliers AS s
   )
SELECT g1.TaxNumberGroup
      ,g1.BankAccountGroup
      ,g2.TaxNumberGroup AS TaxNumberGroupLink
      ,( CASE WHEN g1.TaxNumberGroup > g2.TaxNumberGroup THEN 1 ELSE 0 END ) AS IsDupeEdge
FROM CTE_Groups AS g1
LEFT JOIN CTE_Groups AS g2
ON g1.BankAccountGroup = g2.BankAccountGroup
   AND g1.TaxNumberGroup <> g2.TaxNumberGroup
GROUP BY g1.TaxNumberGroup
        ,g1.BankAccountGroup
        ,g2.TaxNumberGroup;

And here are the results, colour coded to match the previous tables.

TaxNumberGroup	BankAccountGroup	TaxNumberGroupLink	IsDupeEdge
1	11	NULL	0
1	12	2	0
2	12	1	1
2	13	NULL	0
2	14	3	0
3	14	2	1
4	15	NULL	0

This has abstracted out the links between tax groups and bank account groups from the actual entities that have those attributes. Where we have a pair of rows that represent directed edges in a graph, we have marked a single one as being a duplicate. This is important as we recursively walk the graph, as we don’t want to walk the same path in different directions. We must however include both directions, or we may not be able to walk all the possible routes through the graph.

Associating the Groups

We can build a recursive CTE (common table expression) to walk through this structure, and assign unique group numbers to the distinct graphs.

WITH CTE_Groups
AS (SELECT s.Id
          ,s.SupplierName
          ,s.TaxNumber
          ,s.BankSortCode
          ,s.BankAccountNumber
          ,DENSE_RANK() OVER (ORDER BY s.TaxNumber) AS TaxNumberGroup
          ,10 + DENSE_RANK() OVER (ORDER BY s.BankSortCode, s.BankAccountNumber) AS BankAccountGroup
    FROM Suppliers AS s
   )
,CTE_Links
AS (SELECT g1.TaxNumberGroup
          ,g1.BankAccountGroup
          ,g2.TaxNumberGroup AS TaxNumberGroupLink
          ,( CASE WHEN g1.TaxNumberGroup > g2.TaxNumberGroup THEN 1 ELSE 0 END ) AS IsDupeEdge
          ,CAST( ':' + CAST(g1.TaxNumberGroup AS VARCHAR(10)) + ':' + CAST(g1.BankAccountGroup AS VARCHAR(10)) + ':' + CAST(g2.TaxNumberGroup AS VARCHAR(10)) + ':' AS VARCHAR(MAX) ) AS FullPath
          ,CAST( CAST(g1.BankAccountGroup AS VARCHAR(10)) + ':' + CAST(g2.TaxNumberGroup AS VARCHAR(10)) + ':' AS VARCHAR(MAX) ) AS SubPath
    FROM CTE_Groups AS g1
    LEFT JOIN CTE_Groups AS g2
    ON g1.BankAccountGroup = g2.BankAccountGroup
        AND g1.TaxNumberGroup <> g2.TaxNumberGroup
    GROUP BY g1.TaxNumberGroup
            ,g1.BankAccountGroup
            ,g2.TaxNumberGroup
   )
,RCTE_PathWalker
AS (SELECT l.TaxNumberGroup
          ,l.BankAccountGroup
          ,l.TaxNumberGroupLink
          ,l.TaxNumberGroup AS GraphGroup -- Reuse as unique per disconnected graph
          ,1 AS IterationCount
          ,l.FullPath
    FROM CTE_Links as l
    WHERE l.IsDupeEdge = 0
    UNION ALL
    SELECT c.TaxNumberGroup
          ,c.BankAccountGroup
          ,c.TaxNumberGroupLink
          ,p.GraphGroup
          ,p.IterationCount + 1
          ,p.FullPath + c.SubPath
    FROM RCTE_PathWalker AS p
    INNER JOIN CTE_Links AS c
      ON c.TaxNumberGroup = p.TaxNumberGroupLink
         -- Guard against circular paths through the graph
         AND p.FullPath NOT LIKE '%' + c.FullPath + '%'
    -- Another gaurd against circular paths through the graph
    WHERE p.IterationCount < 100
   )
SELECT s.Id
      ,s.SupplierName
      ,s.TaxNumber
      ,s.BankSortCode
      ,s.BankAccountNumber
      ,g.GraphGroup
FROM CTE_Groups AS s
LEFT JOIN (SELECT pw.TaxNumberGroup
                 ,DENSE_RANK() OVER (ORDER BY MIN(pw.GraphGroup)) AS GraphGroup
           FROM RCTE_PathWalker AS pw
           GROUP BY pw.TaxNumberGroup
          )AS g
  ON g.TaxNumberGroup = s.TaxNumberGroup;

The first two CTE’s are the bits we already covered. In the recursive CTE we use the IsDupeEdge to filter filter out the rows that would make us search the same path from both directions. We then recursively join to the next link in the chain, checking the path of the child is not already in the full path of the parent to prevent following circular chains.

The output of the recursive part looks like this.

Tax Number Group	Bank Account Group	Tax Number Group Link	Graph Group	Iteration Count	Full Path
1	11	NULL	1	1	NULL
1	12	2	1	1	:1:12:2:
2	13	NULL	2	1	NULL
2	14	3	2	1	:2:14:3:
4	15	NULL	4	1	NULL
3	14	2	2	2	:2:14:3:14:2:
2	12	1	2	3	:2:14:3:14:2:12:1:
1	12	2	2	4	:2:14:3:14:2:12:1:12:2:
2	12	1	1	2	:1:12:2:12:1:
2	14	3	1	2	:1:12:2:14:3:
3	14	2	1	3	:1:12:2:14:3:14:2:
2	12	1	1	4	:1:12:2:14:3:14:2:12:1:

The final part then simply groups by the tax group number we calculated at the beginning, using a DENSE_RANK() to normalise the final graph group numbers. This is joined to the result of the first CTE, which gives us the original table, with the GraphGroup, which associates the groups of tax numbers and bank accounts.

Id	SupplierName	Tax Number	Bank Sort Code	Bank Account Number	Graph Group	Required Output
1	AdventureWorks Ltd.	12345678	02-11-33	12345678	1	1
2	AdventureWorks	12345678	02-55-44	15161718	1	1
3	ADVENTURE WORKS	23344556	02-55-44	15161718	1	1
4	ADVENTURE WORKS LTD.	23344556	02-77-66	99887766	1	1
5	AW Bike Co	23344556	02-88-00	11991199	1	1
6	Big Bike Discounts (AW)	55556666	02-88-00	11991199	1	1
7	Contoso Corp	90001000	02-99-02	12341234	2	2

The final output, with the required output column from the introduction added.

Next Steps…

In part 2 we will also look at the new graph DB features added to SQL 2019 CTP3.1, and see how they can help solve the problem, and how the performance compares to the recursive CTE approach.

We have seen how we can associate two different groupings applied to a set of data. However, using a recursive method can sometimes perform quite slow when using larger data sets. In part 3 we will have a look at how the performance scales with more data, and different complexity graphs.

For anyone wanting to learn more about graph databases in general, VisualGo has some great tutorials.

A full copy of the demo code is available here on GitHub Gist, and should work on SQL 2008 onwards.

Associative Grouping using tSQL (Part 1)

Demo Data

Preparing the Data

Associating the Groups

Next Steps…

CATALOG

FEATURED TAGS