Reproduce the error:
First, Create two new databases called DBPri and DBSec. DBPri is the Publication database. DBSec is the Subscriber database.
Next, Create a new table called Individuals in each of the databases. (Ensure Subscriber database does not have Identify field for the Primary Key)
Next, Insert a record in DBPri: insert into dbpri.dbo.individuals select 1, 'John', 'Smith' (this gets replicated to your DBSec database)
Next, Delete the record from DBSec: delete from dbsec.dbo.individuals where logid = 1
At this point the two tables are out of synch.
Next, Update that record in DBPri: update dbpri.dbo.individuals set fname = 'Jimmy', lname = 'Smith' where LogID = 1
Since LogID = 1 no longer exists in DBSec, Transactional Replication will fail. Replication Monitor will show an error in the
"Distribution to Subscriber History" tab. First, it will show a status = "Running" with Action Message: "Error executing a batch of commands.
Retrying individual commands." Then it will fail completely with error messsage: "The row was not found at the Subscriber when applying the
replicated command. (Source: MSSQLServer, Error number: 20598)"
At this point no Transactions are being replicated at all from Distributor to Subscriber database. The system is down.
Solutions:
#1. Using Replication Monitor, get the Transaction Sequence Number AND the Command ID.
ex. Command attempted: ...Transaction sequence number: 0x0000003200000B1C000400000000, Command ID: 1)
You should also see this exact error by executing this query in your Distribution database:
select * from distribution.dbo.msrepl_errors order by time desc (key fields: xact_seqno, command_id)
You should also check the how many transactions are in Distribution waiting to be sent to Subscriber. You do this by checking the watermark.
Watermark is the highest LSN in the Subscriber database. So, check see how many LSN's in the Distribution database have a higher LSN and this
indicates the # of transactions waiting in Distribution DB.
select * from distribution.dbo.MSrepl_commands where xact_seqno > (select transaction_timestamp from dbsec.dbo.msreplication_subscriptions)
#2. Using below T-SQL, w/ Transaction Sequence Number as input, locate the actual command is causing the error.
exec distribution.dbo.sp_browsereplcmds '0x0000003200000B1C000400000000', '0x0000003200000B1C000400000000'
ex. {CALL [sp_MSupd_dboIndividuals] (,'Jimmy',,1,0x02)}
At this point, you know the table is Individuals & you know the Primary Key = 1("1" seen in above parameter list) & you know it was trying to update a field to value = Jimmy. (You can confirm the table by executing this query: select * from dbpri.dbo.sysarticles where artid = 1 (artid=1 is from the above query output)) (If there are multiple records returned here, you need to focus on the CommandID from this recordset and the CommandID seen in Replication Monitor)
So, next query the DBSec database and notice this record is missing: select * from dbsec.dbo.individuals where Logid = 1
You could next insert the missing record directly in the Subscriber database(DBSec). Right after doing that the command that was failing would then succeed and replication would resume functioning.
*Notes: There may be other rows that are missing and that have to be manually inserted (or deleted) at the subscriber if there are more failures. If there is a huge backlog of commands you can also use -SkipErrors so Replication can get through the backlog. Once it have it caught up you can stop Distribution Agent and make sure your table(s) is 100% synced. Then remove the -SkipErrors parameter. ex. 20598:2627 Once you have the table(s) synched between the Publication DB and Subscriber DB, stop all Distribution Agents and execute sp_publication_validation and get a row count check and checksum check to confirm you are 100% synced.
Notes:
You can use -SkipErrors in the Distribution Agent Profile to get your replication back up and functioning quickly. Replication would replicate the rest of the changes and skip the command causing the error. This requires a start & stop of the Distribution Agent Job. That is a good short-term fix IF you needed Transactional Replication running as soon as possible. After one million of these errors "The row was not found at the Subscriber when applying the replicated command" seen in the system, Transactional Replication will stop functioning. Make sure you know exactly what the effects are BEFORE using this parameter.
Resources:
How to troubleshoot error 20598 "The row was not found at the Subscriber when applying the replicated command"
Misc:
SQL Job check on error "The row was not found at the Subscriber when applying the replicated command"
Use Distribution go
Declare @RepErrorsNew int --check last 5 mins; sql job executes every 5 mins
Select @RepErrorsNew = (select top 1 ID from Distribution.dbo.MSrepl_errors with (nolock) where error_text like '%The row was not found at the Subscriber when applying the replicated command.%' and [time] > dateadd(mi,-5,getdate()) order by [time] desc)
If @RepErrorsNew IS NOT NULL and @RepErrorsNew > 0 Begin RAISERROR('Transactional Replication Error Alert - Row Not Found!', 16, 1) End
|
|
|
|
|