The truth about CTE’s?

I had an interesting conversation where a couple of colleagues were saying that CTE’s really work properly.  What do I mean by this?   CTE or Common Table Expressions is a SQL feature that Microsoft added to the T-SQL Language in 2005.   The feature was added to solve a sub query problem.  What exactly is a CTE?    Let’s start by examining the problem.

CTE’s were introduced to solve two problems in SQL Server.

  • Multiple instances of the same subquery
  • Hierarchical (recursive) queries

 

For this post I’m not going to talk about Hierarchical (recursive) queries as that is a subject all of it’s own.

 

Let’s talk about using the multiple instance problem.

Here’s an example of a SQL-2000 based query

select  ColA
from (
     select
        ColA, ColB
     from MyTable
     where ColC = 'Flintstones'
     ) as a
where ColB = 'Fred'
union all
select  ColA
from (
     select
        ColA, ColB
     from MyTable
     where ColC = 'Flintstones'
     ) as a
where ColB = 'Wilma'

 

Points to note about this code block.  Is that the same query is used twice within the same sql statement.  This presents two issues.  Firstly, We’ve got the same subquery being accessed twice, which means we are repeating ourselves when writing the code and secondly, we have an issue where that sub query is being executed twice.  Once for each side of the union.

Let’s see about solving these problems using pre-2005 syntax

select
        ColA, ColB
into #MyTempTable
from MyTable
where ColC = 'Flintstones'

select  ColA
from #MyTempTable as a
where ColB = 'Fred'
union all
select  ColA
from #MyTempTable as a
where ColB = 'Wilma'

drop #MyTempTable

This example is using a temp table.  It can also be modified to use a table variable.   This solves the two issues, but brings about another issue.    The issues solved are that the subquery is now executed only once and that there is only one piece of code doing the work which will make maintenance easier.   However the downside is that we are using a temporary table.    As we are only intending to use the query this one time, that can be an expensive option when compared against executing the query twice.   (This is a whole debate on the cost of writes to tempdb v running a query multiple times, let’s not get into that today)

 

In 2005, Microsoft added the CTE feature to give us another way to solve the problem.

with MyCTE as (
     select
        ColA, ColB
     from MyTable
     where ColC = 'Flintstones'
)
select  ColA
from MyCTE as a
where ColB = 'Fred'
union all
select  ColA
from MyCTE as a
where ColB = 'Wilma'

This is supposed to be that the Common Table (MyCTE) is executed once and can be referenced multiple times.   By that I mean that SQL Server executes MyCTE and creates a dataset of the result.  When that result is referenced in the main query, instead of MyCTE’s result set being calculated every time it’s referenced, once created the result set is reused.   It’s auto cleaned up once the query (the whole CTE) has finished.

 

Now for the confusion, my colleagues are saying that the Query from MyCTE is re-run every time it’s referenced within the CTE!   So in effect, whilst it’s making the code easier to read and maintain, it’s still working under the hood like the first block of code from this post.

 

Let’s take a closer look at the version with the Sub queries.    There’s the execution plan for this query.

Now let’s compare this against the CTE version of the query.

There’s something suspicious happening here.  Both of the execution plans look exactly the same!  So from that we can assume that SQL is doing the same thing under the hood right?    Or could it be that the execution plan is telling little white lies?

So what could be going on?    Well, there’s a programming technique called Lazy Loading.   Imaging this scenario.  SQL Server executes the CTE, the query plan looks like it’s running the sub query multiple times…. however, because it’s a CTE, sql actually only runs it one time, and stores the result set.  The second pass of the dataset should be really cheap as the lookup part of the work is already done and compiled.

If SQL Server is doing this, is should be possible to find some evidence.    Let’s see what we can find.

I’ve filled the table up with a million rows of junk.  Also please note that there’s no indexing on the table deliberately so that I can force a table scan.  This will allow me to easily see if there are multiple scans of the table happening as opposed to a single scan.

With the newly filled table.  Here’s the output of the query

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 1 ms.

(2 row(s) affected)
Table ‘MyTable’. Scan count 18, logical reads 9094, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row(s) affected)

SQL Server Execution Times:
CPU time = 360 ms, elapsed time = 82 ms.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.

 

So, if the CTE is processing things more efficiently, the logical reads should be lower.  With this query, I’d expect about 5000 reads maybe?

Here’s the output of the CTE version of the query.

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.

(2 row(s) affected)
Table ‘MyTable’. Scan count 18, logical reads 9094, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row(s) affected)

SQL Server Execution Times:
CPU time = 251 ms, elapsed time = 115 ms.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.

 

This is surprising, it’s still doing the same number of logical reads.    Hmm, does this mean that there’s something wrong with my testing mechanism?

Let’s check by running the version of the query that uses a temp table.  Here’s the output

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
Table ‘MyTable’. Scan count 1, logical reads 4547, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(3 row(s) affected)

(1 row(s) affected)

SQL Server Execution Times:
CPU time = 78 ms, elapsed time = 138 ms.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 1 ms.

(2 row(s) affected)
Table ‘#MyTempTable________________________________________________________________________________________________________000000000005’. Scan count 2, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row(s) affected)

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 45 ms.

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.

 

There’s two separate queries taking place.  first is 4547 logical reads, and the second has 2 logical reads making for a total of 4549.  So using this example, it does indeed seem that SQL Server (specificallcy Microsoft SQL Server 2016 (SP1-GDR) (KB4019089) – 13.0.4206.0 (X64) Jul 6 2017 07:55:03 Copyright (c) Microsoft Corporation Developer Edition (64-bit) on Windows 10 Pro 6.3 <X64> (Build 16299: ))  does treat cte’s exactly the same as sub queries.  There is no difference between writing a CTE and writing a sql statement with sub queries.

 

There is still one question that is burning on my mind.  The example that I’m using above is rather simplistic.  There is a possibility that the query optimiser is doing something with the statistics on the table and deciding that it’s not worth creating a dataset for the CTE query.  To answer that part of the question, I think further testing will be needed using a query that is more complicated.

Leave a Reply

Your email address will not be published. Required fields are marked *