There are times when you may need to carry out UPDATES on large tables that are in use and constantly being inserted, deleted, or updated.
If you carry out a large UPDATE that affects all of the rows in the table, and in my case millions of rows, then the table will be locked for the duration of the update and any other processes that may need to carry out DML statements will be BLOCKED from doing so.
For example, you may experience long delays caused by locking when trying to return data from this table on a website or API service. Or even worse deadlocks but you will most definitely experience performance issues and if any SELECT statements that access the data don't use a WITH (NOLOCK) statement they too will have to wait in line for the UPDATE to finish.
Obviously wrapping WITH (NOLOCK) onto every SELECT statement is not a good solution unless you know what you are doing as it will provide dirty reads and you may end up giving your users old or incorrect data.
This might be fine for some scenarios but in critical applications where data integrity is key then you need another solution that provides data integrity and allows you to UPDATE the table without performance problems.
When I find myself requiring the need to UPDATE every record in a large table I use a BATCH UPDATE process which cuts the large UPDATE statement down into lots of small UPDATES that affect only a few rows at a time.
By doing this the UPDATE rattles through all the tows of even large tables very quickly as long as the batch size of records updated in each loop iteration is small enough not to cause locking that may affect front-end processes.
For example instead of the whole table getting locked for an hour with lots of blocked processes building up behind waiting for it to finish it would instead only be locked for lots of little time periods lasting seconds or less.
These smaller locking periods allow other processes in to do their work and if the batch size is small enough and you have appropriate indexes you might find that you won't experience a full table lock anyway.
There are various methods for carrying out this approach and you should tailor your BATCH SIZE to your own requirements. Before SQL 2005 you could use the: SET NOCOUNT 50 command to set the size of the batch but in SQL 2005 and beyond you can use a variable directly with an UPDATE TOP (@VAR) command.
This is an example of a BATCH UPDATE, that uses a column in the table called LastUpdated, which gets updated every time the row in the table is. You could do this either through stored procedures that update the table, or triggers on insert and update. However because on each loop I update this column it means on the next loop the same records won't get touched as the time is within the 20 minutes I have set as the BATCH to be updated,
Obviously, this must be tailored to your own system, whether you create an "updated" flag column that is defaulted to 0, and then on the BATCH UPDATE set to 1, and the WHERE statement that selects which TOP X records are looked at ignore any that have been set to 1.
You definitely need something to change on UPDATE, otherwise, you will find this process going on forever as there is no way to order the records for that the UPDATE statement so that it could keep getting the same TOP(X) records on each batch, enabling the process to rattle on forever with you scratching your head wondering why.
If you can wangle an ORDER BY statement with a convoluted statement then that might work, however having a simple date or flag that is updated within the batch, and is also checked on each loop so that the same records are not looked at over and over again is the easy answer to this issue.
SET NOCOUNT ON
SET DATEFORMAT YMD
DECLARE @ROWS INT, @TOTALROWS INT, @BATCHSIZE INT
SELECT @ROWS = 1,
@TOTALROWS = 0,
@BATCHSIZE = 50
-- As we start @ROWS at 1 and we know there are thousands of records to update
-- then when it gets to the stage where the UPDATE returns @@rowcount of 0
-- we have finished the criteria of the loop and so exit it the sanity check ensures this
DO WHILE @ROWS > 0
BEGIN
-- Show the time this batch started as it might take hours
PRINT 'Job started at ' + UPPER(FORMAT(GETDATE(),'hh:mm:ss dd/MMM/yyyy'))
-- We must have a way that we don't keep updating the same records over and over again,
-- so I use the LastUpdated date which gets updated on each batch update then checked
-- in the where clause to ensure the date is at least 20 minutes in the future
-- Update data in the table in batches to prevent blocking if we have to do this
-- whilst people are using the site and accessing the table at the same time
UPDATE TOP(@BATCHSIZE) MyTable
SET MyColumn = dbo.udf_SOME_FUNCTION(MyPK),
Flag = 1,
LastUpdated = GETDATE()
WHERE SomeDate > '2023-JAN-01'
AND Flag = 0 -- put in for the update
-- this could be cut out
AND DATEDIFF(MINUTE,LastUpdated,GETDATE())>20
SELECT @ROWS = @@ROWCOUNT, @TOTALROWS = @TOTALROWS + @ROWS
PRINT 'Updated ' + CAST(@ROWS as varchar) + ' in batch'
-- As this UPDATE job may take hours we want other processes
-- to be able to access the tables in case of locks therefore
-- we wait for 2 seconds between each BATCH to allow
-- time for these processes to aquire locks and do their job
WAITFOR DELAY '00:00:02'
-- samity check
IF @ROWS = 0
BREAK
END
PRINT 'Updated ' + CAST(@TOTALROWS as varchar) + ' total rows'
If you really wanted to give the SELECT statements looking at this table as you UPDATE it in batches then you could add in a DELAY within each loop e.g a 2-second DELAY after the UPDATE statement and the SELECT @ROWS = ..... that collects stats for you to look at after the process has finished would just be something like this.
WAITFOR DELAY '00:00:02'
So hopefully this might help you out if you are having to UPDATE large tables that are also in use at the same time by websites, API's, services, or 3rd parties.
© 2023 - By Strictly-Software
3 comments:
With the code u supply
SELECT @ROWS = @@ROWCOUNT, @TOTALROWS = @TOTALROWS + @ROWS
PRINT 'Updated ' + CAST(@ROWS as varchar) + ' in batch'
I would suggest the following changes to your post:
SET @ROWS = @@ROWCOUNT
WAITFOR DELAY '00:00:01'
declare @msg nvarchar(200)
set @msg = 'Updated ' + CAST(@ROWS as varchar) + ' in batch'
RAISERROR (@msg, 0, 1) WITH NOWAIT
The @@ROWCOUNT will only give a count of 1
when used as you have stated, at least in SQL 2008 it will.
So, it's best practice to "SET" the value to @ROWS seperately
not inconjunction with a compound SELECT statement.
That was giving a Val or 1 with SQL 2008
Also, instead of Print, I would use RaiseError.
This will give the immediate results, without
breaking your process...
Hi Carlos
I take on board your comment about RAISEERROR but I think you are wrong about the comment you said re:
"The @@ROWCOUNT will only give a count of 1
when used as you have stated, at least in SQL 2008 it will.
So, it's best practice to "SET" the value to @ROWS seperately
not inconjunction with a compound SELECT statement."
As I have tested this in both SQL 2005 and SQL 2008 and with a PRINT statement within the loop showing a rolling count of both the current batch and total batch you get the correct numbers.
E.G if you edited the code to put a break after 150 rows e.g:
SELECT @ROWS = @@ROWCOUNT, @TOTALROWS = @TOTALROWS + @ROWS
PRINT 'Updated ' + CAST(@ROWS as varchar) + ' in batch'
PRINT 'Updated ' + CAST(@TOTALROWS as varchar) + ' so far'
IF @TOTALROWS > 100
BREAK
Then in both SQL 2005 and 2008 I get the following message output:
Updated 50 in batch
Updated 50 so far
Updated 50 in batch
Updated 100 so far
Updated 50 in batch
Updated 150 so far
Updated 150 total rows
So the joining of the @@ROWCOUNT doesn't have any effect on the variable @TotalRows.
The wait idea is good if you want to allow more room for other processes and have problems running batch updates but so far with proper indexing I haven't found a need for a WAIT statement yet. However it something to think about.
Thanks for commenting.
Hi, thanks I was working at home I didb't have the code to copy so I wrote by hand. Forgot the WHILE LOOP which was the main thing and also the sanity check, the WAITFOR DELAY in case there are locks to give time for other processes to grab them, and the main part of the code so that it didn't repeat the UPDATE on one row. I am a numpty for writing it from memory , the main crux of the article i.e to do BIG UPDATES in batches and not cause locks etc remained true though it was just my memory that was bad. Thw rows being updated were old rows and the flag column was most important of all as it only looks at rows wbere the flag is 0 not 1 which is updated by the code, had to still update this from my home though so hope I got it right this time. Thanks though for making me double check the code I had written against the code Iwrotew out here.
Post a Comment