Associative Grouping using tSQL (Part 1)

Posted by Niall's Data Blog on Friday, June 21, 2019

This is part of series of posts about associative grouping:

Recently I was asked by a friend to have a look at an interesting query problem he had been looking at. He was trying to find overlapping groups in two columns, in order to put them into a new super group of associated groups. The simplest way to describe the problem is with an demo, so off we go…

Demo Data

In our scenario, we have a that company has grown through acquisitions, with each one bringing a new set of suppliers. Some of those supplier lists have overlapped, hence the need to cleanse that data as it is consolidated into a single system. However, as is often the case in this situation, the duplicate suppliers do not match on the natural business keys. Lets dive straight into an example…

Here’s the SQL to create the sample table.

CREATE TABLE Suppliers
(
    Id INT IDENTITY(1,1) NOT NULL
   ,SupplierName VARCHAR(30) NOT NULL
   ,TaxNumber INT NOT NULL
   ,BankSortCode CHAR(8) NOT NULL
   ,BankAccountNumber INT NOT NULL
);

INSERT INTO Suppliers (SupplierName, TaxNumber, BankSortCode, BankAccountNumber)
VALUES ('AdventureWorks Ltd.',      12345678, '02-11-33', 12345678)
      ,('AdventureWorks',           12345678, '02-55-44', 15161718)
      ,('ADVENTURE WORKS',          23344556, '02-55-66', 15161718)
      ,('ADVENTURE WORKS LTD.',     23344556, '02-77-44', 99887766)
      ,('AW Bike Co',               23344556, '02-88-00', 11991199)
      ,('Big Bike Discounts',       55556666, '02-88-00', 11991199)
      ,('Contoso Corp',             90009000, '02-99-02', 12341234);

And here’s what the data looks like. The matching tax numbers and bank account details are highlighted, and the final column is added to show the required result.

IdSupplier NameTax NumberBank Sort CodeBank Account NumberRequired Output
1AdventureWorks Ltd.1234567802-11-33123456781
2AdventureWorks1234567802-55-44151617181
3ADVENTURE WORKS2334455602-55-44151617181
4DVENTURE WORKS LTD.2334455602-77-66998877661
5AW Bike Co2334455602-88-00119911991
6Big Bike Discounts5555666602-88-00119911991
7Contoso Corp9000900002-99-02123412342

We have a table containing supplier names, tax numbers, and bank details for invoicing. It is desired to group the data such that if any records have the same tax number or bank details they are added to a group. In the example above, rows 1 to 5 need to be in the same group, even though row 1 has completely different tax and bank details to rows 4 and 5! Row 6 is for a completely different supplier, and as such should be in it’s own group.

Preparing the Data

We can write a query to add two new columns to that data to hold group numbers, one for the tax number, and another for the combination of bank sort code and account number using the DENSE_RANK() window function. This will simplify the logic of our algorithm as we move forward.

SELECT Id
      ,SupplierName
      ,TaxNumber
      ,BankSortCode
      ,BankAccountNumber
      ,DENSE_RANK() OVER (ORDER BY TaxNumber) AS TaxGroup
      ,10 + DENSE_RANK() OVER (ORDER BY BankSortCode, BankAccountNumber) AS BankGroup
FROM Suppliers
ORDER BY Id;
Note: Adding 10 to the dense rank is just to give different numbers for the two groups in the examples for clarity.

The result set looks like this.

IdSupplier NameTax NumberBank Sort CodeBank Account NumberTax GroupBank Group
1AdventureWorks Ltd.1234567802-11-3312345678111
2AdventureWorks1234567802-55-4415161718112
3ADVENTURE WORKS2334455602-55-4415161718212
4ADVENTURE WORKS LTD.2334455602-77-6699887766213
5AW Bike Co2334455602-99-0011991199214
6Big Bike Discounts5555666602-99-0011991199314
7Contoso Corp9000900002-99-0212341234415

Adding the new group columns means we have a single column that represents the multi-column groups of the raw data, and we have a nice small data type to work with instead of strings.

The colour coding of those columns shows how the goops on the two columns act like links of a chain, linking the rows together. This is what a graph database would store - the individual records are nodes, linked together by the group numbers or edges.

Now we can define the problem more clearly. We are looking at a number of disconnected graphs, and we want to assign a unique group number to each of those individual graphs. The associations between the node of the graph are made by the nodes having either the same tax or bank account details. Here is the same data drawn as a graph.

The Nodes are numbered using the Id’s of the rows, and the edges are coloured to match the tables above.
The Nodes are numbered using the Id's of the rows, and the edges are coloured to match the tables above.

Note: Unfortunately the new graph features added to SQL Server 2017 are quite limited, and are unable to help us solve this particular problem.

Instead we will solve the problem using the tools available in tSQL. We start by using the last query to build a query to to define how a bank account group links two different tax groups together.

WITH CTE_Groups
AS (SELECT s.Id
          ,DENSE_RANK() OVER (ORDER BY s.TaxNumber) AS TaxNumberGroup
          ,10 + DENSE_RANK() OVER (ORDER BY s.BankSortCode, s.BankAccountNumber) AS BankAccountGroup
    FROM Suppliers AS s
   )
SELECT g1.TaxNumberGroup
      ,g1.BankAccountGroup
      ,g2.TaxNumberGroup AS TaxNumberGroupLink
      ,( CASE WHEN g1.TaxNumberGroup > g2.TaxNumberGroup THEN 1 ELSE 0 END ) AS IsDupeEdge
FROM CTE_Groups AS g1
LEFT JOIN CTE_Groups AS g2
ON g1.BankAccountGroup = g2.BankAccountGroup
   AND g1.TaxNumberGroup <> g2.TaxNumberGroup
GROUP BY g1.TaxNumberGroup
        ,g1.BankAccountGroup
        ,g2.TaxNumberGroup;

And here are the results, colour coded to match the previous tables.

TaxNumberGroupBankAccountGroupTaxNumberGroupLinkIsDupeEdge
111NULL0
11220
21211
213NULL0
21430
31421
415NULL0

This has abstracted out the links between tax groups and bank account groups from the actual entities that have those attributes. Where we have a pair of rows that represent directed edges in a graph, we have marked a single one as being a duplicate. This is important as we recursively walk the graph, as we don’t want to walk the same path in different directions. We must however include both directions, or we may not be able to walk all the possible routes through the graph.

Associating the Groups

We can build a recursive CTE (common table expression) to walk through this structure, and assign unique group numbers to the distinct graphs.

WITH CTE_Groups
AS (SELECT s.Id
          ,s.SupplierName
          ,s.TaxNumber
          ,s.BankSortCode
          ,s.BankAccountNumber
          ,DENSE_RANK() OVER (ORDER BY s.TaxNumber) AS TaxNumberGroup
          ,10 + DENSE_RANK() OVER (ORDER BY s.BankSortCode, s.BankAccountNumber) AS BankAccountGroup
    FROM Suppliers AS s
   )
,CTE_Links
AS (SELECT g1.TaxNumberGroup
          ,g1.BankAccountGroup
          ,g2.TaxNumberGroup AS TaxNumberGroupLink
          ,( CASE WHEN g1.TaxNumberGroup > g2.TaxNumberGroup THEN 1 ELSE 0 END ) AS IsDupeEdge
          ,CAST( ':' + CAST(g1.TaxNumberGroup AS VARCHAR(10)) + ':' + CAST(g1.BankAccountGroup AS VARCHAR(10)) + ':' + CAST(g2.TaxNumberGroup AS VARCHAR(10)) + ':' AS VARCHAR(MAX) ) AS FullPath
          ,CAST( CAST(g1.BankAccountGroup AS VARCHAR(10)) + ':' + CAST(g2.TaxNumberGroup AS VARCHAR(10)) + ':' AS VARCHAR(MAX) ) AS SubPath
    FROM CTE_Groups AS g1
    LEFT JOIN CTE_Groups AS g2
    ON g1.BankAccountGroup = g2.BankAccountGroup
        AND g1.TaxNumberGroup <> g2.TaxNumberGroup
    GROUP BY g1.TaxNumberGroup
            ,g1.BankAccountGroup
            ,g2.TaxNumberGroup
   )
,RCTE_PathWalker
AS (SELECT l.TaxNumberGroup
          ,l.BankAccountGroup
          ,l.TaxNumberGroupLink
          ,l.TaxNumberGroup AS GraphGroup -- Reuse as unique per disconnected graph
          ,1 AS IterationCount
          ,l.FullPath
    FROM CTE_Links as l
    WHERE l.IsDupeEdge = 0
    UNION ALL
    SELECT c.TaxNumberGroup
          ,c.BankAccountGroup
          ,c.TaxNumberGroupLink
          ,p.GraphGroup
          ,p.IterationCount + 1
          ,p.FullPath + c.SubPath
    FROM RCTE_PathWalker AS p
    INNER JOIN CTE_Links AS c
      ON c.TaxNumberGroup = p.TaxNumberGroupLink
         -- Guard against circular paths through the graph
         AND p.FullPath NOT LIKE '%' + c.FullPath + '%'
    -- Another gaurd against circular paths through the graph
    WHERE p.IterationCount < 100
   )
SELECT s.Id
      ,s.SupplierName
      ,s.TaxNumber
      ,s.BankSortCode
      ,s.BankAccountNumber
      ,g.GraphGroup
FROM CTE_Groups AS s
LEFT JOIN (SELECT pw.TaxNumberGroup
                 ,DENSE_RANK() OVER (ORDER BY MIN(pw.GraphGroup)) AS GraphGroup
           FROM RCTE_PathWalker AS pw
           GROUP BY pw.TaxNumberGroup
          )AS g
  ON g.TaxNumberGroup = s.TaxNumberGroup;

The first two CTE’s are the bits we already covered. In the recursive CTE we use the IsDupeEdge to filter filter out the rows that would make us search the same path from both directions. We then recursively join to the next link in the chain, checking the path of the child is not already in the full path of the parent to prevent following circular chains.

The output of the recursive part looks like this.

Tax Number GroupBank Account GroupTax Number Group LinkGraph GroupIteration CountFull Path
111NULL11NULL
112211:1:12:2:
213NULL21NULL
214321:2:14:3:
415NULL41NULL
314222:2:14:3:14:2:
212123:2:14:3:14:2:12:1:
112224:2:14:3:14:2:12:1:12:2:
212112:1:12:2:12:1:
214312:1:12:2:14:3:
314213:1:12:2:14:3:14:2:
212114:1:12:2:14:3:14:2:12:1:

The final part then simply groups by the tax group number we calculated at the beginning, using a DENSE_RANK() to normalise the final graph group numbers. This is joined to the result of the first CTE, which gives us the original table, with the GraphGroup, which associates the groups of tax numbers and bank accounts.

IdSupplierNameTax NumberBank Sort CodeBank Account NumberGraph GroupRequired Output
1AdventureWorks Ltd.1234567802-11-331234567811
2AdventureWorks1234567802-55-441516171811
3ADVENTURE WORKS2334455602-55-441516171811
4ADVENTURE WORKS LTD.2334455602-77-669988776611
5AW Bike Co2334455602-88-001199119911
6Big Bike Discounts (AW)5555666602-88-001199119911
7Contoso Corp9000100002-99-021234123422

The final output, with the required output column from the introduction added.

Next Steps…

In part 2 we will also look at the new graph DB features added to SQL 2019 CTP3.1, and see how they can help solve the problem, and how the performance compares to the recursive CTE approach.

We have seen how we can associate two different groupings applied to a set of data. However, using a recursive method can sometimes perform quite slow when using larger data sets. In part 3 we will have a look at how the performance scales with more data, and different complexity graphs.

For anyone wanting to learn more about graph databases in general, VisualGo has some great tutorials.

A full copy of the demo code is available here on GitHub Gist, and should work on SQL 2008 onwards.