There are times when you may need to carry out UPDATES on large tables that are in use and constantly being inserted, deleted, or updated.
If you carry out a large UPDATE that affects all of the rows in the table, and in my case millions of rows, then the table will be locked for the duration of the update and any other processes that may need to carry out DML statements will be BLOCKED from doing so.
For example, you may experience long delays caused by locking when trying to return data from this table on a website or API service. Or even worse deadlocks but you will most definitely experience performance issues and if any SELECT statements that access the data don't use a WITH (NOLOCK) statement they too will have to wait in line for the UPDATE to finish.
Obviously wrapping WITH (NOLOCK) onto every SELECT statement is not a good solution unless you know what you are doing as it will provide dirty reads and you may end up giving your users old or incorrect data.
This might be fine for some scenarios but in critical applications where data integrity is key then you need another solution that provides data integrity and allows you to UPDATE the table without performance problems.
When I find myself requiring the need to UPDATE every record in a large table I use a BATCH UPDATE process which cuts the large UPDATE statement down into lots of small UPDATES that affect only a few rows at a time.
By doing this the UPDATE rattles through all the tows of even large tables very quickly as long as the batch size of records updated in each loop iteration is small enough not to cause locking that may affect front-end processes.
For example instead of the whole table getting locked for an hour with lots of blocked processes building up behind waiting for it to finish it would instead only be locked for lots of little time periods lasting seconds or less.
These smaller locking periods allow other processes in to do their work and if the batch size is small enough and you have appropriate indexes you might find that you won't experience a full table lock anyway.
There are various methods for carrying out this approach and you should tailor your BATCH SIZE to your own requirements. Before SQL 2005 you could use the: SET NOCOUNT 50 command to set the size of the batch but in SQL 2005 and beyond you can use a variable directly with an UPDATE TOP (@VAR) command.
This is an example of a BATCH UPDATE, that uses a column in the table called LastUpdated, which gets updated every time the row in the table is. You could do this either through stored procedures that update the table, or triggers on insert and update. However because on each loop I update this column it means on the next loop the same records won't get touched as the time is within the 20 minutes I have set as the BATCH to be updated,
Obviously, this must be tailored to your own system, whether you create an "updated" flag column that is defaulted to 0, and then on the BATCH UPDATE set to 1, and the WHERE statement that selects which TOP X records are looked at ignore any that have been set to 1.
You definitely need something to change on UPDATE, otherwise, you will find this process going on forever as there is no way to order the records for that the UPDATE statement so that it could keep getting the same TOP(X) records on each batch, enabling the process to rattle on forever with you scratching your head wondering why.
If you can wangle an ORDER BY statement with a convoluted statement then that might work, however having a simple date or flag that is updated within the batch, and is also checked on each loop so that the same records are not looked at over and over again is the easy answer to this issue.
SET NOCOUNT ON
SET DATEFORMAT YMD
DECLARE @ROWS INT, @TOTALROWS INT, @BATCHSIZE INT
SELECT @ROWS = 1,
@TOTALROWS = 0,
@BATCHSIZE = 50
-- As we start @ROWS at 1 and we know there are thousands of records to update
-- then when it gets to the stage where the UPDATE returns @@rowcount of 0
-- we have finished the criteria of the loop and so exit it the sanity check ensures this
DO WHILE @ROWS > 0
BEGIN
-- Show the time this batch started as it might take hours
PRINT 'Job started at ' + UPPER(FORMAT(GETDATE(),'hh:mm:ss dd/MMM/yyyy'))
-- We must have a way that we don't keep updating the same records over and over again,
-- so I use the LastUpdated date which gets updated on each batch update then checked
-- in the where clause to ensure the date is at least 20 minutes in the future
-- Update data in the table in batches to prevent blocking if we have to do this
-- whilst people are using the site and accessing the table at the same time
UPDATE TOP(@BATCHSIZE) MyTable
SET MyColumn = dbo.udf_SOME_FUNCTION(MyPK),
Flag = 1,
LastUpdated = GETDATE()
WHERE SomeDate > '2023-JAN-01'
AND Flag = 0 -- put in for the update
-- this could be cut out
AND DATEDIFF(MINUTE,LastUpdated,GETDATE())>20
SELECT @ROWS = @@ROWCOUNT, @TOTALROWS = @TOTALROWS + @ROWS
PRINT 'Updated ' + CAST(@ROWS as varchar) + ' in batch'
-- As this UPDATE job may take hours we want other processes
-- to be able to access the tables in case of locks therefore
-- we wait for 2 seconds between each BATCH to allow
-- time for these processes to aquire locks and do their job
WAITFOR DELAY '00:00:02'
-- samity check
IF @ROWS = 0
BREAK
END
PRINT 'Updated ' + CAST(@TOTALROWS as varchar) + ' total rows'
If you really wanted to give the SELECT statements looking at this table as you UPDATE it in batches then you could add in a DELAY within each loop e.g a 2-second DELAY after the UPDATE statement and the SELECT @ROWS = ..... that collects stats for you to look at after the process has finished would just be something like this.
WAITFOR DELAY '00:00:02'
So hopefully this might help you out if you are having to UPDATE large tables that are also in use at the same time by websites, API's, services, or 3rd parties.
© 2023 - By Strictly-Software