SSIS vs T-SQL – which one is fastest for ETL tasks?

There are many myths around SSIS vs T-SQL performance. I have met several so-called “Business Intelligence experts” who keep claiming that T-SQL is a “toy language” and that if you really want performance you have to rewrite all your ETL in SSIS instead. On the other hand there are the relational engine people who are absolutely convinced of the opposite, based on the fact that with T-SQL everything is processed within the SQL engine. Who’s right and who’s wrong? I thought it is time someone reveals some of the myths.

Scope

Firstly, there are many other aspects in choosing ETL-tool besides performance such as maintainability, current skillset and specific features available only in SSIS or T-SQL. However my scope in this article is to look at only performance – the other aspects could be a good idea for another blog article.

Test case

Real testing is the only thing that counts. I could give many reasons on both sides why they should be fastest. Actually it depends on many factors, so don’t take my results in this article as an universal truth, but still I find it interesting to look at a common ETL-task like loading of a Fact table. This is a test case where data is already inside SQL Server and will differ from a test case of loading external data.

The scripts for creating these tables and test data are published further down in this article.

My test case is to load this Fact table with 1 million rows from a Source table. As part of the loading job, 10 dimension keys are to be converted from natural keys to surrogate keys (see the picture below). As most Business Intelligence people should know, using surrogate keys is a best practice in Fact tables, so this is something very common in Business Intelligence solutions.

Two test cases have been made. One where the Fact table initially is empty and another where the Fact table contains 4 million existing rows.

Very important to make the results in SSIS and T-SQL comparable is to make sure the solutions follow best practices and not be more complicated than necessary (I will avoid any deeper performance tuning in this test). For more in-depth info about how to improve performance, see this excellent article “The Data Loading Performance Guide” by the SQLCAT team.

Note that you could get other results than me depending on your environment. I’m running the tests on SQL Server 2012 and a laptop with quad-core CPU and a SSD disk.

T-SQL

The straightforward solution is to do a JOIN between the Source table and the Dimension tables, so it would look something like

SELECT Dim1.ID, …
FROM dbo.Source
JOIN dbo.Dim1 ON Source.Dim1_Key = Dim1.Key

However, if we are not careful, we will get bad performance and a warning.

Click on the picture to enlarge.

Beware of execution plans like this! The real performance killer here is the sort operator (note the warning icon) that gets executed against tempdb instead of in memory. This is because the query optimizer in SQL Server relational engine underestimates the number of rows. This is very typical for data warehouses, because of the data skew, and very bad for performance. Updating the statistics on the tables doesn’t help. We need to help the query optimizer in other ways.

  1. Fixing the estimated number of rows can be done by changing JOIN into “LEFT OUTER JOIN” instead. That guarantees to the query optimizer that we will get 1 million rows and since all rows will match there is no difference in the output.
  2. We also, from my experience, get better execution plans and performance if we disable foreign keys while loading and then enable them again after loading is finished so they get verified. We could also skip the validation completely, but that would in my opinion be bad practise. To disable foreign keys we can use “ALTER TABLE … NOCHECK …” and to enable them again “ALTER TABLE … WITH CHECK …”.

Another thing we can do is to make sure this operation is minimally logged (so not an excess amount of data is written to the transaction log, which would cost us a lot of unnecessary I/O). To do that, we add a TABLOCK hint at the beginning of the insert statement. Make sure the database is running in simple or bulk logged recovery model. When the table contains existing data, we also need to enable traceflag 610.

DBCC TRACEON(610)

Finally, to get a performance measurement, we enable some statistics

SET STATISTICS TIME ON
SET STATISTICS IO ON

The resulting T-SQL script is shown at the end of this article. We now get a nice parallel execution plan with 10 Hash Matches.

Execution time for loading the data and verifying the foreign keys with an empty table

CPU time = 49661 ms,  elapsed time = 8278 ms.
CPU time = 9345 ms,  elapsed time = 3772 ms.

So the total duration for this solution (loading an empty Fact table) is 8.278 + 3.772 = 12.1 seconds.

For the case when we have 4 million existing rows, the loading takes around 11.4 + 9.0 = 20.4 seconds.

SSIS

With SSIS we have two options: either running SSIS on the same machine as the relational engine or on a separate machine. I tested both cases.

Creating an SSIS package for this ETL task is very straightforward.  Basically we use an OLE DB source, a lookup transform for each dimension, and then write to the Fact table.

To improve performance we use a network packet size of 32767 instead of the default 4096 (configured in the OLEDB connection). Also we issue a table lock (like the TABLOCK option we used in the T-SQL solution) and avoid checking constraints (foreign keys) while loading the Fact table.

However we want them to be verified after the loading. Otherwise they will be flagged as is_not_trusted by the SQL Server relational engine. The easiest way to achieve this is to add a pre-loading step and a post-loading step that executes the same statements as in the T-SQL solution.

Running on same machine

To get best performance, we have to ensure we use the fastest possible communications method, shared memory, between SSIS and SQL Server database engine. This can be done by specifying lpc: in front of the server name in the connection string. I use the SQL Server Native Client 11.0 (SQLNCLI11.1) for the connection. I also use SQL Server Destination (which only works when running SSIS and the database on the same server).

Now the question is, how fast is this SSIS package? If we run it within Visual Studio, the GUI adds some overhead, so we’ll have to run it command line through dtexec to get a better estimate.

The total execution time in SSIS for loading the empty Fact table is approx 12.2 seconds. Loading the Fact table with 4 million existing rows takes approx 20.4 seconds. This is almost exactly same as the result for the T-SQL solution!

Running SSIS on a separate computer

What about running SSIS on a separate computer? I did that test also over a 1 GB LAN. Running on separate computers means we get more CPU power, but it also increases the overhead as the data needs to go through OLEDB and network drivers as well as be transmitted over the network.

The result for loading the empty Fact table is an execution time of 36 seconds.

This can surely be improved, probably by dividing the data flow over multiple connections, but it is nowhere near the performance when running everything on the same computer. For simple tasks like this, which don’t require much CPU and there is no other load on the computer, running SSIS on a separate machine does not improve performance.

Conclusions

While my test is simple and limited it does strongly indicate that T-SQL and SSIS does not have any significant performance difference when the data is already within the database. Of course it depends on doing things right – “a fool with a tool is still a fool”. The difference is in other areas such as maintainability, skills required and specific features.

Loading external data is a different story. This is not covered by this article, but SSIS greatly outperforms the SQL Server database engine when loading data from flat files or external databases (even SQL Server databases).

Also, running SSIS on a separate computer does not increase performance unless the ETL task is pretty CPU intensive or there are other reasons it’s necessary to offload the database engine.

Can the performance of my examples be improved? Sure it can! Here are some resources for further reading:

Scripts

Create the tables and indexes

CREATE TABLE [dbo].[Dim1](
[ID] [int] IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED,
[Key] [varchar](10) NOT NULL,
)

CREATE TABLE [dbo].[Dim2](
[ID] [int] IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED,
[Key] [varchar](10) NOT NULL,
)

CREATE TABLE [dbo].[Dim3](
[ID] [int] IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED,
[Key] [varchar](10) NOT NULL,
)

CREATE TABLE [dbo].[Dim4](
[ID] [int] IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED,
[Key] [varchar](10) NOT NULL,
)

CREATE TABLE [dbo].[Dim5](
[ID] [int] IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED,
[Key] [varchar](10) NOT NULL,
)

CREATE TABLE [dbo].[Dim6](
[ID] [int] IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED,
[Key] [varchar](10) NOT NULL,
)

CREATE TABLE [dbo].[Dim7](
[ID] [int] IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED,
[Key] [varchar](10) NOT NULL,
)

CREATE TABLE [dbo].[Dim8](
[ID] [int] IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED,
[Key] [varchar](10) NOT NULL,
)

CREATE TABLE [dbo].[Dim9](
[ID] [int] IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED,
[Key] [varchar](10) NOT NULL,
)

CREATE TABLE [dbo].[Dim10](
[ID] [int] IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED,
[Key] [varchar](10) NOT NULL,
)

CREATE TABLE [dbo].[Fact](
[ID] [int] IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED,
[Measure1] [float] NOT NULL,
[Measure2] [int] NOT NULL,
[Measure3] [int] NOT NULL,
[Dim1_Id] [int] NOT NULL,
[Dim2_Id] [int] NOT NULL,
[Dim3_Id] [int] NOT NULL,
[Dim4_Id] [int] NOT NULL,
[Dim5_Id] [int] NOT NULL,
[Dim6_Id] [int] NOT NULL,
[Dim7_Id] [int] NOT NULL,
[Dim8_Id] [int] NOT NULL,
[Dim9_Id] [int] NOT NULL,
[Dim10_Id] [int] NOT NULL
)

CREATE TABLE [dbo].

[Source](
[ID] [int] IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED,
[Measure1] [float] NOT NULL,
[Measure2] [int] NOT NULL,
[Measure3] [int] NOT NULL,
[Dim1_Key] [varchar](10) NOT NULL,
[Dim2_Key] [varchar](10) NOT NULL,
[Dim3_Key] [varchar](10) NOT NULL,
[Dim4_Key] [varchar](10) NOT NULL,
[Dim5_Key] [varchar](10) NOT NULL,
[Dim6_Key] [varchar](10) NOT NULL,
[Dim7_Key] [varchar](10) NOT NULL,
[Dim8_Key] [varchar](10) NOT NULL,
[Dim9_Key] [varchar](10) NOT NULL,
[Dim10_Key] [varchar](10) NOT NULL
) ON [PRIMARY]

GO

ALTER TABLE [dbo].[Fact] ADD CONSTRAINT [FK_Fact_Dim1] FOREIGN KEY([Dim1_Id]) REFERENCES [dbo].[Dim1] ([ID])
ALTER TABLE [dbo].[Fact] ADD CONSTRAINT [FK_Fact_Dim2] FOREIGN KEY([Dim2_Id]) REFERENCES [dbo].[Dim2] ([ID])
ALTER TABLE [dbo].[Fact] ADD CONSTRAINT [FK_Fact_Dim3] FOREIGN KEY([Dim3_Id]) REFERENCES [dbo].[Dim3] ([ID])
ALTER TABLE [dbo].[Fact] ADD CONSTRAINT [FK_Fact_Dim4] FOREIGN KEY([Dim4_Id]) REFERENCES [dbo].[Dim4] ([ID])
ALTER TABLE [dbo].[Fact] ADD CONSTRAINT [FK_Fact_Dim5] FOREIGN KEY([Dim5_Id]) REFERENCES [dbo].[Dim5] ([ID])
ALTER TABLE [dbo].[Fact] ADD CONSTRAINT [FK_Fact_Dim6] FOREIGN KEY([Dim6_Id]) REFERENCES [dbo].[Dim6] ([ID])
ALTER TABLE [dbo].[Fact] ADD CONSTRAINT [FK_Fact_Dim7] FOREIGN KEY([Dim7_Id]) REFERENCES [dbo].[Dim7] ([ID])
ALTER TABLE [dbo].[Fact] ADD CONSTRAINT [FK_Fact_Dim8] FOREIGN KEY([Dim8_Id]) REFERENCES [dbo].[Dim8] ([ID])
ALTER TABLE [dbo].[Fact] ADD CONSTRAINT [FK_Fact_Dim9] FOREIGN KEY([Dim9_Id]) REFERENCES [dbo].[Dim9] ([ID])
ALTER TABLE [dbo].[Fact] ADD CONSTRAINT [FK_Fact_Dim10] FOREIGN KEY([Dim10_Id]) REFERENCES [dbo].[Dim10] ([ID])

GO

CREATE UNIQUE NONCLUSTERED INDEX UX_Dim1 ON dbo.Dim1 ([Key]) WITH (FILLFACTOR = 100, DATA_COMPRESSION = NONE)
CREATE UNIQUE NONCLUSTERED INDEX UX_Dim2 ON dbo.Dim2 ([Key]) WITH (FILLFACTOR = 100, DATA_COMPRESSION = NONE)
CREATE UNIQUE NONCLUSTERED INDEX UX_Dim3 ON dbo.Dim3 ([Key]) WITH (FILLFACTOR = 100, DATA_COMPRESSION = NONE)
CREATE UNIQUE NONCLUSTERED INDEX UX_Dim4 ON dbo.Dim4 ([Key]) WITH (FILLFACTOR = 100, DATA_COMPRESSION = NONE)
CREATE UNIQUE NONCLUSTERED INDEX UX_Dim5 ON dbo.Dim5 ([Key]) WITH (FILLFACTOR = 100, DATA_COMPRESSION = NONE)
CREATE UNIQUE NONCLUSTERED INDEX UX_Dim6 ON dbo.Dim6 ([Key]) WITH (FILLFACTOR = 100, DATA_COMPRESSION = NONE)
CREATE UNIQUE NONCLUSTERED INDEX UX_Dim7 ON dbo.Dim7 ([Key]) WITH (FILLFACTOR = 100, DATA_COMPRESSION = NONE)
CREATE UNIQUE NONCLUSTERED INDEX UX_Dim8 ON dbo.Dim8 ([Key]) WITH (FILLFACTOR = 100, DATA_COMPRESSION = NONE)
CREATE UNIQUE NONCLUSTERED INDEX UX_Dim9 ON dbo.Dim9 ([Key]) WITH (FILLFACTOR = 100, DATA_COMPRESSION = NONE)
CREATE UNIQUE NONCLUSTERED INDEX UX_Dim10 ON dbo.Dim10 ([Key]) WITH (FILLFACTOR = 100, DATA_COMPRESSION = NONE)

Generate sample data

insert dbo.Dim1([Key]) select top 100 left(newid(), 10) from master.dbo.spt_values
insert dbo.Dim2([Key]) select top 100 left(newid(), 10) from master.dbo.spt_values
insert dbo.Dim3([Key]) select top 100 left(newid(), 10) from master.dbo.spt_values
insert dbo.Dim4([Key]) select top 100 left(newid(), 10) from master.dbo.spt_values
insert dbo.Dim5([Key]) select top 100 left(newid(), 10) from master.dbo.spt_values
insert dbo.Dim6([Key]) select top 100 left(newid(), 10) from master.dbo.spt_values
insert dbo.Dim7([Key]) select top 100 left(newid(), 10) from master.dbo.spt_values
insert dbo.Dim8([Key]) select top 1000 left(newid(), 10) from master.dbo.spt_values
insert dbo.Dim9([Key]) select top 1000 left(newid(), 10) from master.dbo.spt_values
insert dbo.Dim10([Key]) select top 10000 left(newid(), 10) from master.dbo.spt_values t1 cross join master.dbo.spt_values t2

-- Populate source table

;with cte as
(
select
t2.number * 1000 + t1.number rownum
,abs(checksum(newid())) % 100 + 1 randnum1
,abs(checksum(newid())) % 100 + 1 randnum2
,abs(checksum(newid())) % 100 + 1 randnum3
,abs(checksum(newid())) % 100 + 1 randnum4
,abs(checksum(newid())) % 100 + 1 randnum5
,abs(checksum(newid())) % 100 + 1 randnum6
,abs(checksum(newid())) % 100 + 1 randnum7
,abs(checksum(newid())) % 1000 + 1 randnum8
,abs(checksum(newid())) % 1000 + 1 randnum9
,abs(checksum(newid())) % 10000 + 1 randnum10
from master.dbo.spt_values t1
cross join master.dbo.spt_values t2
where t1.number < 1000 and t2.number < 1000 and t1.type = 'P' and t2.type = 'P'
)
insert into dbo.source
with (tablock)
(measure1, measure2, measure3, dim1_key, dim2_key, dim3_key, dim4_key, dim5_key, dim6_key, dim7_key, dim8_key, dim9_key, dim10_key)
select
checksum(newid()) * 0.000001
,checksum(newid()) % 1000
,checksum(newid()) % 10000
,Dim1.[Key]
,Dim2.[Key]
,Dim3.[Key]
,Dim4.[Key]
,Dim5.[Key]
,Dim6.[Key]
,Dim7.[Key]
,Dim8.[Key]
,Dim9.[Key]
,Dim10.[Key]
from cte
cross apply (select [Key] from dbo.Dim1 where [id] = cte.randnum1) Dim1
cross apply (select [Key] from dbo.Dim2 where [id] = cte.randnum2) Dim2
cross apply (select [Key] from dbo.Dim3 where [id] = cte.randnum3) Dim3
cross apply (select [Key] from dbo.Dim4 where [id] = cte.randnum4) Dim4
cross apply (select [Key] from dbo.Dim5 where [id] = cte.randnum5) Dim5
cross apply (select [Key] from dbo.Dim6 where [id] = cte.randnum6) Dim6
cross apply (select [Key] from dbo.Dim7 where [id] = cte.randnum7) Dim7
cross apply (select [Key] from dbo.Dim8 where [id] = cte.randnum8) Dim8
cross apply (select [Key] from dbo.Dim9 where [id] = cte.randnum9) Dim9
cross apply (select [Key] from dbo.Dim10 where [id] = cte.randnum10) Dim10

Load the fact table

set nocount on
truncate table dbo.fact

-- Run with cold cache
dbcc dropcleanbuffers

-- Disable foreign keys
alter table dbo.fact nocheck constraint all

set statistics io on
set statistics time on

INSERT INTO [dbo].[Fact]
with (tablock)
([Measure1],[Measure2],[Measure3],[Dim1_id],[Dim2_id],[Dim3_id],[Dim4_id],[Dim5_id],[Dim6_id],[Dim7_id],[Dim8_id],[Dim9_id],[Dim10_id])
select
measure1
,measure2
,measure3
,dim1.[id]
,dim2.[id]
,dim3.[id]
,dim4.[id]
,dim5.[id]
,dim6.[id]
,dim7.[id]
,dim8.[id]
,dim9.[id]
,dim10.[id]
from dbo.

left outer join dbo.dim1 on

.Dim1_Key = dim1.[Key]
left outer join dbo.dim2 on

.Dim2_Key = dim2.[Key]
left outer join dbo.dim3 on

.Dim3_Key = dim3.[Key]
left outer join dbo.dim4 on

.Dim4_Key = dim4.[Key]
left outer join dbo.dim5 on

.Dim5_Key = dim5.[Key]
left outer join dbo.dim6 on

.Dim6_Key = dim6.[Key]
left outer join dbo.dim7 on

.Dim7_Key = dim7.[Key]
left outer join dbo.dim8 on

.Dim8_Key = dim8.[Key]
left outer join dbo.dim9 on

.Dim9_Key = dim9.[Key]
left outer join dbo.dim10 on

.Dim10_Key = dim10.[Key]-- Enable foreign keys
ALTER TABLE dbo.fact WITH CHECK
CHECK CONSTRAINT all

set statistics io off
set statistics time off

About these ads

5 Comments

Filed under Technical

5 responses to “SSIS vs T-SQL – which one is fastest for ETL tasks?

  1. Pingback: When to use T-SQL or SSIS for ETL | James Serra's Blog

  2. Excellent article! It covered several myths that I wanted to sit down and benchmark myself. I have always preferred T-SQL as the technology to transform and load, and use SSIS for the transfer of data from one system to another’s staging environment then utilize it for the choreography of T-SQL execution.

  3. josh

    Thank you very much for this great article. I’m doing some research as we will build a new Data Warehouse and had a consultant here who says that SSIS is a bad tool and doing everything in the database is the way to go because it’s much more stable and faster. My guess is that he’s just more familiar with SQL than he is with SSIS…
    He wants to use a linked server instead of SSIS to connect to our oracle database and so on. I think I’ll have to be careful :-)
    Your article was a great help, I’ll do some tests before we start.

  4. Thanks for your comment. There are definitely cases where SQL has advantages over SSIS, but having an incompetent consultant is not a good reason for avoiding SSIS.

    Linked Servers is something I use very seldom and it’s really far slower than SSIS.

  5. Harry

    Next version of SQL Server will allow you to create a purely in-memory table which can be used for your initial target load table (staging), there you can massage/transform it all you like before merging/loading the data into a dimension or fact table. You can read about the Hekaton technology. Even though it’s stated for use in OLTP work, it can definitely be used for ELT work (massaging data in memory before loading to target dim/fact tables).

    Linked server queries are a big NO-NO. Your whole database will get locked up until the linked-server query is done executing, also you cannot take a backup of the database sending a linked-query out to another server. It really, truly sucks for large bulk operations. I speak from experience twice over.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s