Users: 30k; Pages Per Session: 2.5p; Avg. Session Duration: 1:35mi
TechDevOps.com
Explore Tools
{Featuring Microsoft Technologies + Enterprise Data + Cloud Computing}



Huge Transaction Log File with Always-ON Availability Group Database in SQL Server 2016 STD
by BF (Senior MSSQL Operations - Enterprise, Cloud, Strategy)
2017-12-11








Description:

Working within a SQL Server 2016 STD Always-On Environment, I was faced with a very busy OLTP database that has a Transaction Log file size at 700GB and growing. This database was in an Availability Group with two Nodes and in Synchronous Mode. I needed to get this log file size reduced asap before it consumed all the drive space.


Issues with a large transaction log file in Always-On:
- Transaction Log file on the Primary Replica will continue to grow, as well as Secondary replica.
- Transaction Log File Growth (VLFs added) is considered a Transaction & will generate records for that. Sending to SEC Replica and cause its file growth also.(REDO replay)
- Crash Recovery reads Transaction Log records from the start of the active portion of the transaction log. Thus Recovery can take a long time when active portion is large and take a while before the Database is brought into an Online state.



Analysis:

Check sysdatabases log_reuse_wait_desc, SQL Server:Database Replica > Redo Bytes, SQL Server:Database > Log bytes flushed\sec

log_reuse_wait_desc:
Select log_reuse_wait_desc, * from sys.databases where name = 'xyz'
AVAILABILITY_REPLICA
This will show up if your Primary Replica is waiting for a Secondary Replica to complete the processing of log records for a REDO.

Log Send Queue and Redo Queue Size:
select ar.replica_server_name, drs.truncation_lsn, drs.log_send_queue_size, drs.redo_queue_size from
sys.dm_hadr_database_replica_states drs join sys.availability_replicas ar on drs.replica_id=ar.replica_id where drs.is_local=0

SQL Server:Database Replica > Redo Bytes:
On the Secondary Replica, monitor the SQL Server:Database Replica > Redo Bytes Remaining counter. If this counter continues to increase, it means that the secondary replica could not process the log records fast enough for REDO, causing the primary replica to wait and for SQL Server to not be able to truncate the log on the primary

SQL Server:Database > Log bytes flushed\sec:
On the Primary Replica, monitor the SQL Server:Database > Log bytes flushed\sec counter to see if there has been a change in values vs the baselines. This will tell you the amount of log records generated by transactions on the database.

Check Transaction Log Space (Size and Used)
DBCC sqlperf(logspace)

Check # of VLF files:
DBCC LogInfo
Typically Status= 2 means the VLF is active and cannot re-use/shrink. With Always-ON that changes. Even if = 2 AND w/ LOGBACKUP or NOTHING in log_reuse_wait_desc means the Log file can he shrunk. You may need to execute the LOG backup and DBCC SHRINK file a few times to get it smaller size.

Check for any Open Transactions:
DBCC OPENTRAN

Check Database Activity:
Select * from sysprocesses where dbid = x
sp_who2



Solution:

After all the checks, the issue was the Secondary Node was not catching up with the Primary Node with the Log Block Replay/REDO. At this point we had two options. First option is wait and see if the REDO process would complete. Second option would be to remove the Secondary database from the Availability Group and reconfigure the AG. I chose the second option as I don't like to wait much.


Option 1:

Currently the log_Reuse_Wait_Desc status of the database Resload is “Availability_Replica” . This status of database can occur when we are slow in redoing the logs on the secondary replica or if the secondary replica is not connected to the primary replica or the availability replica primary/Secondary are slow to send/redo the transactions on their respective files because of performance/disk slowness, etc.

We can wait for all the transactions to complete in this scenario and see if the problem is fixed automatically by itself or we can check the performance of server to verify the same. Risk involved is that the log file size will increase and disk might get full which might get the application down.

Option 2:

We can remove the database from AG, shrink log file and re-add it to the AG making sure disk does not get full.

Option 2 - Steps:

On the Secondary Replica..Expand the Availability Group..Availability Databases, we need to Suspend Database Movement and Remove Secondary Database from AG.

On the Secondary Replica..Expand the databases section..Notice the database has now gone into restoring mode/not synchronizing mode..Delete the database on Secondary Replica.

On the Primary Replica..Expand the Availability Group..Availability Databases..Remove Database from Availability Group (confirm it gets removed from the SEC also)

On the Primary Replica..Execute select name, log_reuse_wait_desc from sys.databases and see the output.
- If the status Shows NOTHING, then go ahead and shrink the log file.
- If the status Shows LOG_BACKUP, then we will have to take a log backup of database and make sure the status shows “Nothing” because when the status shows nothing, then only we will be able to shrink the log file. Keep repeating the shrink and log backup process until the size of log file goes down.

Once the Transaction Log file is shrunk to a small size (1000MB in this case), go ahead and prepare to add the database back to the Availability Group
- Primary Replica - take full database backup, copy to Secondary Replica, restore backup with no recovery
- Secondary Replica - take log backup, copy to Secondary Replica, restore log with no recovery
- Add the database now back to the AG by using the AG Wizard as join only.


UPDATE: This issue occurred again and a re-start of Node 2 SQL Services fixed the transaction log issue
1. Re-start Node 2 SQL Services. Node 1 log_reuse_wait_desc went to NOTHING.
2. Shrink Node 1 T-Log. Node 1 log_reuse_wait_desc went to BACKUP LOG.
3. Backup Log on Node 1
4. Shrink Log on Node 1
Log size now reduced


Resources:

The transaction log for database is full due to AVAILABILITY_REPLICA