Monday, May 4, 2009

String and Custom Aggregations using CTE and Recursive Queries

You'll all be familiar with normal aggregation.

For example I can get the number of subcategories per category in AdventureWorks:

SELECT c.Name,
COUNT(*)
as SubCategoryCount
FROM Production.ProductCategory c
JOIN Production.ProductSubCategory s
ON c.ProductCategoryID = s.ProductCategoryID
GROUP
BY c.Name

Which returns:

Name

SubCategoryCount

Accessories

12

Bikes

3

Clothing

8

Components

14


 

SQL 2005 supports 13 aggregation functions: AVG, CHECKSUM, CHECKSUM_AGG, COUNT, COUNT_BIG, GROUPING, MAX, MIN, SUM, STDEV, STDEVP, VAR and VARP but nothing to aggregate strings.
You can create custom CLR aggregation functions but on some occasions, a simple SQL recursive query may be able to do what you need.

Let's have a look first at a case that doesn't require recursive queries.
Let's create a comma delimited string of all the available categories.
The data comes from the Product.ProductCategory table:

select
*
from Production.ProductCategory

And returns

ProductCategoryID

Name

rowguid

ModifiedDate

1

Bikes

CFBDA25C-DF71-47A7-B81B-64EE161AA37C

1998-06-01 00:00:00.000

2

Components

C657828D-D808-4ABA-91A3-AF2CE02300E9

1998-06-01 00:00:00.000

3

Clothing

10A7C342-CA82-48D4-8A38-46A2EB089B74

1998-06-01 00:00:00.000

4

Accessories

2BE3BE36-D9A2-4EEE-B593-ED895D97C2A6

1998-06-01 00:00:00.000


To get the comma delimited list of Categories, we can use the COALESCE function and a string variable:

DECLARE @categories varchar(200)
SET @categories =
NULL

SELECT @categories =
COALESCE(@categories +
',','')
+
Name
FROM
Production.ProductCategory

SELECT @categories

And returns, as expected a string "Accessories,Bikes,Clothing,Components"


 

But now, how would we return the recordset of categories with a second column containing the list of subcategories.

I can get the list of subcategories for a single category but not for each category using the COALESCE method.

This is where recursive queries help. First I need to number each subcategory within the category. I can do that using RANK and ROW_NUMBER functions.


 

SELECT

    c.Name as Category,

    s.Name as SubCategory,

ROW_NUMBER()
OVER(ORDER
BY c.Name,s.Name)
+ 1
        -
RANK()
OVER(ORDER
BY c.Name)
as SubCategoryNumber
FROM Production.ProductCategory c
JOIN Production.ProductSubCategory s
ON c.ProductCategoryID = s.ProductCategoryID

And now the recursive query where each loop of the query adds to the comma delimited string with the initial value equal to the first subcategory


 

WITH


SubCategories(Category, Subcategory, SubCategoryNumber)
AS
(

    SELECT

        c.Name as Category,

        s.Name as SubCategory,
        ROW_NUMBER()
OVER(ORDER
BY c.Name,s.Name)
+ 1
            -
RANK()
OVER(ORDER
BY c.Name)
as SubCategoryNumber
    FROM Production.ProductCategory c
    JOIN Production.ProductSubCategory s
    ON c.ProductCategoryID = s.ProductCategoryID ),

TempCategories(Category, Subcategories, SubCategoryNumber)
AS
(

    SELECT Category,
CAST(Subcategory as
varchar(max)), SubCategoryNumber
    FROM SubCategories
    WHERE SubCategoryNumber = 1

    UNION ALL

    SELECT TempCategories.Category,
CAST(TempCategories.SubCategories +
','
+ SubCategories.Subcategory as
varchar(max)), TempCategories.SubCategoryNumber + 1
    FROM TempCategories
    JOIN SubCategories
    ON TempCategories.Category = SubCategories.Category and TempCategories.SubCategoryNumber + 1 = SubCategories.SubCategoryNumber ),

MaxCategories(Category, SubCategoryNumber)
AS
(

    SELECT Category,
MAX(SubCategoryNumber)
AS SubCategoryNumber
    FROM SubCategories
    GROUP
BY Category ),

Categories (Category, Subcategories)
AS
(

    SELECT TempCategories.Category, TempCategories.Subcategories
    FROM TempCategories
    JOIN MaxCategories
    ON TempCategories.Category = MaxCategories.Category AND TempCategories.SubCategoryNumber = MaxCategories.SubCategoryNumber )

SELECT
*
FROM Categories

Which returns:


 

Category

Subcategories

Components

Bottom Brackets,Brakes,Chains,Cranksets,Derailleurs,Forks,Handlebars,Headsets,Mountain Frames,Pedals,Road Frames,Saddles,Touring Frames,Wheels

Clothing

Bib-Shorts,Caps,Gloves,Jerseys,Shorts,Socks,Tights,Vests

Bikes

Mountain Bikes,Road Bikes,Touring Bikes

Accessories

Bike Racks,Bike Stands,Bottles and Cages,Cleaners,Fenders,Helmets,Hydration Packs,Lights,Locks,Panniers,Pumps,Tires and Tubes


 

If we look at each query:


 

SubCategories, we've seen already, returns the full list of categories and subcategories with a subcategory number, string at 1, for each category.

TempCategories is the recursive query and adds at each level the previous subcategory list with the current subcategory, so the first subcategory contains itself, the second catains the first and itself, the third contains the first, the second and itself and so on. We then need to only keep the last record of each category that contains every subcategory.

MaxCategories contains the highest SubcategoryNumber for each category.

Categories joins TempCategories and MaxCategory to only retain the highest subcategory number per category. The result we need.


 

The aggregation operation defined in TempCategories (here adding the string together) could be replaced with any custom aggregation. It iteratively, in a known sequence, combines the previous calculated value with the new value for that row.


 

0 comments:

Post a Comment