This is second part of 2-part series on SQL clauses. Juan covered the IN() in the first part. We now turn to a close sibling, EXISTS clause.
As Juan demonstrated, IN() can be useful for some situations where we want to match a subset of another table without necessarily changing the output due to joining the tables. In some cases, the subquery itself might be quite complicated and may take lot of processing. We can simplify a bit by using EXISTS.
The difference between IN() and EXISTS() is that IN() returns a resultset which is then evaluated by a main query. EXISTS() simply returns a Boolean value without returning any data about the subquery at all. To illustrate, consider this statement:
SELECT *
FROM tblOrders
WHERE CustomerID IN (
SELECT CustomerID
FROM tblCustomers
WHERE State = 'IL'
);
Suppose we have 3 customers in the state of Illnois. The IN() can be considered equivalent to this:
SELECT *
FROM tblOrders
WHERE CustomerID IN (4, 8, 15);
That is, the subquery returns a list of possible values that is then evaluated by the outer query to determine if the order is to be outputted or not.
Let’s see how we can express the same thing using EXISTS.
SELECT *
FROM tblOrders
WHERE EXISTS (
SELECT NULL
FROM tblCustomers
WHERE tblOrders.CustomerID = tblCustomers.CustomerID
AND tblCustomers.State = 'IL'
);
There are a number of differences. Let’s go over them one by one.
First, you will notice that we do not say “CustomerID = EXISTS(…)”. EXISTS() itself only returns a Boolean value and therefore does not need to be compared to any column and for that reason, it stands on its own without any comparison.
Second, we relate the inner query by correlating tblOrders.CustomerID from outer query to tblCustomers.CustomerID. Some of you might be wondering, “but isn’t correlated subqueries bad?” Not always the case. For example, SQL Server tends to treat an EXISTS as a “semi-join” and thus evaluates it quite efficiently.
Thirdly, we apply the same criteria that we used in the original example, filtering down to only customers that are in the Illinois. Therefore, any customers that has ID of say, 3 and is in state of CA, the subquery would logically evaluate to zero results which causes the EXISTS() to return false. This brings us to the next point.
Lastly, the content of the subquery is totally irrelevant. I wrote “SELECT NULL” because there is nothing that will be evaluated in the SELECT. If you’re not convinced, then try this:
... WHERE EXISTS (
SELECT 1 / 0
FROM tblCustomers
);
Surely the division by zero should cause the query to fail with an error, right? Nope. It just blows through, see that there are records in the tblCustomers and therefore it returns true. This saves us one step compared to the IN() which requires evaluation of the output of IN(). The only thing that should matter is whether there is zero or non-zero records matched as the result of the subquery. That’s why we have 2 criteria; matching on the CustomerID with the tblOrders.CustomerID in addition to the state criteria.
So, which is faster?
Generally speaking, joins should be the first thing to try when building the query. We would consider using either IN() or EXISTS() if we wanted to compare data from other tables without changing the output as joining does. It tends to depend more on the schema and the content of query — sometimes IN() is the better performing because it is able to evaluate the whole subquery at once then used as a 2nd step. Sometimes EXISTS() is better because there is no resultset to evaluate. For smaller queries, you won’t see significant differences in either methods’ performance. For bigger queries or where queries is run frequently, you might want to benchmark and see which is more suited for the job. Note furthermore that different database engines have different “preferences” — Consider those two different approaches we can use to show only Orders that have no customers associated with the order:
Frustrated Join:
SELECT *
FROM tblOrders
LEFT JOIN tblCustomers
ON tblOrders.CustomerID = tblCustomers.CustomerID
WHERE tblCustomers.CustomerID IS NULL;
Note how the join is “frustrated” by the predicate “tblCustomers.CustomerID IS NULL”, which restrict the output to only those orders that has no customers
Existence check:
SELECT *
FROM tblOrders
WHERE NOT EXISTS (
SELECT NULL
FROM tblCustomers
WHERE tblCustomers.CustomerID = tblOrders.CustomerID
);
Likewise, we correlate the CustomerID between two tables in the EXISTS clause and put down NOT to reverse the result that would be outputted.
Both approaches would yield identical output — that is, we get a list of orders that has no customers associated. Access database engine is more likely to prefer the frustrated join whereas SQL Server tends toward the existence check. Different database engines also have their “preferences” so for queries where performance is crucial to you, it is usually useful to compare different expressions of SQL and see which works best for your specific query. As mentioned before, the schema and content of query can strongly influence the performance so those should be the first thing to consider. But when you have to work with the schema as it is, it is very useful to recognize common SQL patterns for the same problem so you have more tools at your disposal for getting most bang from even the most complicated query.
Happy selecting!