Today I noticed the data drive of one of my Production servers growing. It wasn’t growing overly fast and it had been quite some time since the last SAN drive expansion so I gave the thumbs up to my SAN techs to add 5GBs of space. I started with 5GBs just out of a hunch even though I knew in a couple weeks I’d be giving the green light to add more space again.
It turned out that adding 5GBs at first was a good idea because what I assumed with normal data growth turned out to be a ballooning distribution database. When I noticed my disk space alert again in the same day is when I got wise because this server should not grow 5 GBs in a day.
After a quick look it was easy to notice the distribution database was the issue with it being 48GBs which is far above normal for the publications that it handles.
I went ahead and pulled up my friend, the replication monitor, and noticed a red X over the publisher with the ballooned distribution database. Upon further investigation I noticed 5 publications were red and the offending subscription in each was for the same subscriber (our source database for our data warehouse which is used to perform warehouse tuning operations separate from the transactional data).
I pulled up the subscriber and noticed that it has no subscriptions. What apparently happened was, one of my DBAs had done a restore without using the KEEP_REPLICATION flag. This restored the database who was a subscriber but broke the transfer. Since the source still had the subscription registered in it all the transactions for the last 72 hours were being held and not purged because the default retention period for a distribution database is 72 hours.
So what I ended up doing was first dropping all of the subscriptions that were broken and then recreating them. This fixed the root cause but my distribution database was still large and had 0% free space to shrink.
What I had to do next was hit up the Google. I knew that the job ‘Distribution clean up: distribution’ was running and would eventually clean up my database but when I have a problem I like to stick to it until fixed and this job was running slowly. That is when I found clsilva’s post Replication: Distribution Cleanup and Paul Ibison’s post How to Massively Optimise Transactional Replication.
The command that clsilva talks about, EXEC dbo.sp_MSdistribution_cleanup @min_distretention = 0, @max_distretention = 72, is actually exactly what the job, Distribution clean up: distribution is running that is slow.
This process is slow because it is trying to purge a row count, which in my case, is very large. I ended up stopping this job and running the commands myself in steps as recommended by clsilva. I started with a 2 hour block and found that I was deleting records at a 2 row per second rate. This was sufficient evidence that to me that the process was running successfully but was simply very slow. When you have a working system stored procedure I never dive down deeper and try to tweak things myself. So what I ended doing was executed a script that ran this command decrementing the max and min values by 10 hours a piece in separate batches. The only purpose of this was so that I could have a 10 hour incremented print out in SSMS that would reassure me and give me a pseudo-progress bar.
Upon completion of my script I shrunk my distribution database down to its normal size and went on with my day.