SQL Server has a feature for partitioning tables and indexes. Partitioning can be implemented at many levels, however. You can create multiple tables for one logical data set, you can split the set into multiple databases, and you can even split it among different servers. Sharding is another term. It refers to partitioning data to horizontally scale out compute resources and storage.
There are different methods of handling sharding. Each of them need a central point of connection to handle querying the data on the other shards. This is typically called the control node. The method I am to discuss today is one where linked servers are used to connect the various shards.
It is always important to expend the least amount of resources with any query to a database. With this architecture, eliminating nodes from each query execution is an easy and effective means of limiting the processing cost of a query.
This view UNIONs the dbo.Sales table on the control node and two remote shards. The view explicitly defines the id ranges which exist on each shard. This leverages SQL Server’s contradiction detection to drop remote servers calls from the query, when those ids are not required for the query output.
Click here to download the full setup script with create database statements, object creation, and sample data.
DROP VIEW IF EXISTS dbo.vwSales; GO CREATE VIEW dbo.vwSales AS SELECT id, amount FROM [LinkedServerElimination].dbo.Sales WHERE id <= 20 UNION ALL SELECT id, amount FROM [.\SQL2017_1].[LinkedServerElimination].dbo.Sales WHERE id > 20 AND id <= 40 UNION ALL SELECT id, amount FROM [.\SQL2016].[LinkedServerElimination].dbo.Sales WHERE id > 40 GO
With this view, we can see the linked server elimination in action by tracing the connections on each shard and verifying that no activity occurs on shards which do not have the necessary Sales ids. That method is the most complete but not as easy to visualize. A better picture of what is happen can be seen with the execution plans. Below are the actual execution plans on freshly recompiled versions of the same query, with different parameter inputs.
This query spans all three shards and produces a plan with two remote server calls and the self-reference on the control node. This is meant to show that the view can cover the full scope of data and to call out the cost of each remote server call.
DECLARE @min_id TINYINT = 0 DECLARE @max_id TINYINT = 100 SELECT * FROM dbo.vwSales s WHERE s.id >= @min_id AND s.id <= @max_id
Next, the query only references a small number of Sales ids, which all live on the local node’s dbo.Sales table. Therefore, no remote server calls are required.
DECLARE @min_id TINYINT = 0 DECLARE @max_id TINYINT = 10 SELECT * FROM dbo.vwSales s WHERE s.id >= @min_id AND s.id <= @max_id
This query spans two nodes, the local node and one remote node.
DECLARE @min_id TINYINT = 19 DECLARE @max_id TINYINT = 33 SELECT * FROM dbo.vwSales s WHERE s.id >= @min_id AND s.id <= @max_id
There you have it, contradiction detection, partition elimination, and sharding, in action.
I must admit, however, I used the OPTION (RECOMPILE) on all those queries to make sure the plans looked nice for this demonstration. If I had not, the execution plans would all have looked like the one with three legs to the plan. Without the query hint, work would still not occur on unnecessary shards, though. To confirm this, you would have to check the sessions on each shard, as I mentioned above. With that fact, I have technically not demonstrated proof of my claims. This is good because you should never believe anything you read on the internet anyways.
Go test it out for yourself and have fun with partition elimination.