Any of us who have been responsible for managing transactional replication knows that bad commands are possible and do happen. So, we have a less than perfect solution which requires DBA intervention to correct certain problems. So what do we do?
The Typical Response
Many DBAs see replication as a highly complex beast that is difficult to tackle. The basic operations like creating subscriptions and publications, and re-initializing publications we have down but when it comes to resolving issues with individual records or commands it always seems like it will take longer to fix it then to just redo it.
So that is what a lot of us do, we re-initialize every time there is a problem and hope that the problems are few and far between. This, however, isn’t a great option when you have at least one article in a publication that is large (say, 100 million records or above).
Replication will stop the flow of transactions for all articles in a publication even if only one has any bad commands in the queue. Then if we re-initialize a new snapshot will be created (locking related tables for the duration which could be a long time if your articles are large) and then either the tables are dropped and recreated or all data is deleted to make way for the full snapshot to be transmitted. This can take hours or even days to accomplish with table locks for a good amount of that time holding your applications in a down state.
Method #1 – Rebuild without Sync
The first article to discuss is The Trouble with Transactional Replication and large articles.
In this article Edward explains how large articles can be a time consuming factor in a re-sync of data for replication and how to completely rebuild your individual publications, thus clearing out any bad commands and/or errors, and get replication back up and running in minutes without having to create a new snapshot or re-sync any good data. You will still need to push up the missing records to your subscribers to make up for the initial issue but there are Data Compare tools out there for this (as he mentions).
Method #2 – Monkeying with the System Stored Procedures
In this article, Super Fast Transactional Replication Repair, Edward covers a way to troubleshoot at the article level rather than the publication level.
Let’s say you have a publication with 100 articles in it and there is one table where a delete statement was run on a non-updating subscriber. You now have a ‘row not found on subscriber’ error and all of the articles for that subscriber are now held up until the offending records are resolved.
What Edward covers here, is how to disable the error trapping on the system stored procedures specific to an individual, offending, article so that the others can continue to process. Then you can worry about fixing the individual article without everything else being down, or more importantly, progressively building up a queue which will need to be pushed once everything is fixed.