Less is More :(

The majority of work I do, as a programmer who wears a shirt and tie, involves databases. And I try to keep things nice and simple when working with databases. What I mean by simple is, extra tables in a database is fine, even triple the “minimum number” (whatever that is), if they all mostly follow some pattern. Constructing a database schema is then a simple application of the rules devised when designing the pattern, and means there’s not much to think about or have go wrong. With a judicious choice of pattern, ORM tools like Hibernate or other things like .NET Entity Framework can start working for you rather than against you. It’s even possible to write code to check over the schema and ensure that the pattern is being followed, and then write code to write code to do all the data access (yes, write code to write code…)

My boss, on the other hand, has a slightly different idea of simplicity. Namely, in his view, simplicity = minimise number of tables, and rely on stored procedures for everything. The arguments I want to win with him are not really what I want to blog about today.

Today I’d like to describe how I implemented strongly-typed lists of objects (think List<SomeType> in C# / list<AClass> in C++) adhering to my boss’s vision of minimising table count, using only three tables and a pile of T-SQL and custom constraints, instead of the “more complex” system of lots of join tables and foreign key constraints. I do not particularly recommend this mode of implementation. For a start, it has made use of .NET Entity Framework, shall we say, “interesting.”

So consider directly representing classes, such as a Product class, each with a table in the schema, such as tProduct. Suppose the Product class has properties with the types List<Extras> and List<Retailers>, representing the extras one can buy with a product, and the retailers retailing a product. (Suppose we also model Extras and Retailers with tExtra and tRetailer, respectively.)

If we were implementing this with regular old join tables, there would be no columns in tProduct to represent the lists, rather, there would be some table tProductExtras, pointing at both tProduct and tExtra with FK constraints (call the columns ProductID and ExtraID), and a tProductRetailers, pointing at both tProduct and tRetailer in a similar fashion.

But now my boss comes along and says, “Oh, that’s too many tables. Your join tables really have the same structure, two columns of type INT, so they could easily be merged.” And he’s right on the second point, but with one caveat. If we had one join table for the whole database, call it tJoin, that simply had e.g. ParentID and ItemID, we cannot use foreign key constraints on either of them, since ParentID points to any of the tables that represent objects possessing a List of something, and ItemID points to any of the tables representing objects being part of a List. Which means there’s no database-level way of enforcing data integrity in tJoin, right? Lists are now so untyped it’s not funny.

Well, there is a way. (My sincere apologies.) Enter tList and tClass. For simplicity, let tClass just be a table with a two columns, ID as usual, and Name, and containing with names of classes, each one with a table in the database. So going back to the example above, tClass has (1, "Product"), (2, "Extra"), and (3, "Retailer").

Now define tList as follows: give it an ID column, and a column which is a FK to tClass, call it ClassID. The ClassID is going to give us the type of object being contained in the list—the “T” in “List”, as it were. tJoin is now going to have the columns ListID (FK to tList), and ItemID. (It needs to have something else if the List is going to be a proper list and specify the order in which items appear, but if this bugs you, just imagine we have an Order column in tJoin that specifies what order things appear in the list.)

CREATE TABLE dbo.tClass (
    ID int IDENTITY(0,1) NOT NULL,
    Name varchar(50) NOT NULL,
    CONSTRAINT PK_tClass PRIMARY KEY CLUSTERED
    (ID ASC))
CREATE TABLE dbo.tList (
    ID int IDENTITY(0,1) NOT NULL,
    ClassID int &nbsp;NOT NULL,
    CONSTRAINT PK_tList PRIMARY KEY CLUSTERED
    (ID ASC))
 GO
 ALTER TABLE dbo.tList WITH CHECK ADD CONSTRAINT [FK_tList_tClass] FOREIGN KEY(ClassID)
     REFERENCES dbo.tClass (ID)
 GO
 CREATE TABLE dbo.tJoin (
     ListID int NOT NULL,
     ItemID int NOT NULL,
)
GO
ALTER TABLE dbo.tJoin WITH CHECK ADD CONSTRAINT [FK_tJoin_tList] FOREIGN KEY(ListID)
    REFERENCES dbo.tList (ID)
GO

Now go back to tProduct. It is now going to have two FK columns to tList, call them ExtrasListID and RetailersListID. These will adequately model the Product having attributes of type List<Anything>. But really we also want to be sure that the tProduct columns point to the right kind of lists, i.e. that ExtrasListID refers to a tList referring to tClass with name ‘Extra’, and similarly for RetailersListID. Here comes the first custom constraint.

In SQL Server, one cannot have arbitrary code in a constraint—not even a SELECT. Only fairly simple things are allowed, such as =, >, IN, and BETWEEN. How is it possible, then, to enforce this kind of policy? It turns out that scalar user-defined functions (UDFs) are allowed. So we create a UDF like so, to perform a SELECT for us:

CREATE FUNCTION [dbo].[fListIsOfType]
(
    @ListID int,
    @ClassName varchar(50)
)
RETURNS int
AS
BEGIN
    DECLARE @Result int
    SELECT @Result = COUNT(*) FROM tList INNER JOIN tClass ON tList.ClassID = tClass.ID
    WHERE tList.ID = @ListID AND tClass.Name = @ClassName
    RETURN @Result
END

Now we have the pleasure of creating a custom checked constraint on each list column like so:

ALTER TABLE [dbo].[tProduct] WITH CHECK ADD CONSTRAINT [CK_tProduct_Extras] CHECK (([dbo].[fListIsOfType]([ExtrasListID],'Extra')>(0)))
GO
ALTER TABLE [dbo].[tProduct] CHECK CONSTRAINT [CK_tProduct_Extras]
GO
ALTER TABLE [dbo].[tProduct] &nbsp;WITH CHECK ADD CONSTRAINT [CK_tProduct_Retailers] CHECK (([dbo].[fListIsOfType]([RetailersListID],'Retailer')>(0)))
GO
ALTER TABLE [dbo].[tProduct] CHECK CONSTRAINT [CK_tProduct_Retailers]
GO

Great. Now we can be pretty sure that ExtrasListID is a List<Extra>, and not a List<SomethingElse>. We could stop here and rely on stored procedures to do the rest of the work—ensuring items are of the correct type when adding to lists and so on—but that’s no fun. Especially when *ahem* somebody invariably ignores the stored procedures and puts crap data in there manually.

The other side of tJoin requires constraining. This is a bit harder. Here’s how I did it. First, I wrote another scalar UDF like so:

CREATE FUNCTION [dbo].[fItemBelongs]
(
    @ItemID int,
    @ListID int
)
RETURNS int
AS
BEGIN
    DECLARE @Result int
    DECLARE @ClassName varchar(50)
    SELECT @ClassName = tClass.Name FROM tList INNER JOIN tClass ON tList.ClassID = tClass.ID WHERE tList.ID = @ListID
    SELECT @Result = CASE
        WHEN @ClassName = 'Extra' THEN (SELECT COUNT(*) FROM tExtra WHERE ID = @ItemID)
        WHEN @ClassName = 'Retailer' THEN (SELECT COUNT(*) FROM tRetailer WHERE ID = @ItemID)
        ELSE 0
    END
    RETURN @Result
END

Now I didn’t finish adding cases to the CASE expression, because I have a lot of tables. It’s boring.

The reason we can’t just write one metaquery, instead of a whole heap inside something like CASE, that finds the right table to do the lookup in is…because UDFs can’t execute arbitrary SQL. I’d love to just write

-- ...
DECLARE @sql nvarchar(max), @params nvarchar(max)
DECLARE @count int
SET @sql = 'SELECT @count = COUNT(*) FROM t' + @ClassName + ' WHERE ID = @ItemID'
SET @params = '@count int OUTPUT, @ItemID int'
EXEC sp_executesql @sql, @params, @count OUTPUT, @ItemID
RETURN @count
-- ..

and put that into a UDF, but due to the limitations we can’t (and fair enough, if we could execute stuff like that in a UDF, forget about query optimisation).

I stopped adding cases, but I used the function as the basis for something else. I wrote a stored procedure that updates the fItemBelongs function as follows.

CREATE PROCEDURE [dbo].[sUpdateItemBelongsTest]
AS
BEGIN
    SET NOCOUNT ON;
    DECLARE @sql nvarchar(max)
    SET @sql = 'ALTER FUNCTION fItemBelongs
    (
        @ItemID int,
        @ListID int
    )
    RETURNS int
    AS
    BEGIN
        DECLARE @Result int
        DECLARE @ClassName varchar(50)
        SELECT @ClassName = tClass.Name FROM tList INNER JOIN tClass ON tList.ClassID = tClass.ID WHERE tList.ID = @ListID
        SELECT @Result = CASE ' +
            REPLACE((SELECT '     WHEN @ClassName = ''' + ClassName + ''' THEN (SELECT COUNT(*) FROM t' + ClassName + ' WHERE ID = @ItemID)
            '    FROM tClass FOR XML PATH('')), '&#x0D;', '') +
            '    ELSE 0
            END
        RETURN @Result
    END'
    PRINT @sql
    EXEC sp_executesql @sql
END

“What the hell is that about?” I hear you ask. Well. It builds the ALTER FUNCTION statement using rows out of the tClass table, prints the statement just to show me what is being executed, then executes it. To build the massive CASE, it concatenates a pile of strings of the form WHEN ClassName = 'Blah' (SELECT COUNT(*) FROM tBlah WHERE ID = @ItemID, using the FOR XML PATH('') concatenation trick on the tClass table.

Now using the function is as easy as

ALTER TABLE [dbo].[tJoin] WITH CHECK ADD CONSTRAINT [CK_tJoin_ItemsBelong] CHECK (([dbo].[fItemBelongs]([ItemID],[ListID])>(0)))
GO
ALTER TABLE [dbo].[tJoin] CHECK CONSTRAINT [CK_tJoin_ItemsBelong]
GO

and to ensure you always have a current fItemBelongs, consider adding the stored procedure to a trigger on changes to tClass.

Finally, I wanted to ensure that tClass only referred to things that have tables—specifically, tables with a proper ID column of the right type and so on. To do this, I wrote another UDF.

CREATE FUNCTION [dbo].[fClassHasTable]
(
    @ClassName varchar(50)
)
RETURNS int
AS
BEGIN
    DECLARE @Result int
    SELECT @Result = COUNT(*) FROM sys.tables T
        INNER JOIN sys.columns C ON T.object_id = C.object_id
        WHERE T.name = 't' + @ClassName
        AND C.name = 'ID'
        AND C.user_type_id = 56        -- 56 = int, YMMV.
        AND C.is_nullable = 0
        AND C.is_identity = 1
    RETURN @Result
END

And using it in another constraint was pretty straightforward.