he data-loss scenario I described in the DevX article SQL Server 2005 Bug Alert: Merge Replication Could Mean Data Loss
(published November 2, 2007) is mainly due to the difference in SQL Server 2005's replication behavior when a publication database uses partition groups versus when the publisher is a SQL Server 2000 Enterprise edition or when the publication doesn't use partition groups. In the publication script example from that article, the parameter responsible for this change in behavior is @use_partition_groups
. First introduced in SQL Server 2005, this parameter is meant to optimize replication by using partitions, but two existing bugs related to it opened the door to data loss.
A solution that was suggested in the previous article was to stop using partition groups altogether, as this data loss was an inherent behavior of partition groups design. However, setting the partition groups to false led to a different data-loss scenario, which finally prompted Microsoft to open an active bug (#442076). With this new information, I can explain exactly which SQL Server 2005 setting is causing the data loss.
In this addendum to the previous article, I identify the stored procedures that are responsible for the bad code and the two data-loss scenarios explained in that article (with accompanying scripts for reproducing both cases). The data-loss scenario presented in this article concerns only DBAs or database developers who use merge replication with join filtering, SQL Server 2005 as publisher/distributor, and SQL Server 2005 (including Express edition) as subscribers. If your subscribers are MSDE (Microsoft SQL Server 2000 Desktop Edition), you should not worry about the data-loss reproductions.
In the scripts presented in this article, the @use_partition_groups parameter is set to null (which is the same as enabling the parameter), and its default value is true (see Listing 1). The reproduction presented in this article is the same as using @use_partition_groups=N'true' in the publication script, and the compatibility level is not important in this case.
I've taken it upon myself to write this addendum because Microsoft unfortunately hasn't published the bug anywhere (I just received e-mail confirmation via tech support), even though the data-loss scenarios it creates are pretty common. My own organization has gone through a difficult period of trying to recover lost data and the confidence of our customers.
Anatomy of the Bug
Per the e-mail I received via tech support, the description of bug number 442076 is as follows:
When partition groups is not being used, there is a bug in sp_MSdelsubrowsbatch.
T1 is the parent table for PK-FK relation (has PK).
T2 is another table for the PK-FK relation (has FK).
T2 is the parent table for the join filters with HOSTNAME filter.
T1 is the child table for join filters.
Deletes in T2 at publisher should be propagated to the subscriber. However, publisher should not lose those changes.
What actually happens is if more than one row are deleted at the publisher, the following sync propagates the deletes to T2 and expands them to T1 at the subscriber. A subsequent merge then deletes the rows in T1 at the publisher leading to data loss.
Join filters are in the reverse order of PK-FK.
Happens only with setupbelongs. Does not repro if partition groups is used.
Happens only when more than one row are deleted.
The reason is that the delete of T2 for more than two rows that expand to deletes in T1 are flagged with a user delete instead of a system delete at the subscriber. Hence, when merge runs for the second time, it enumerates these deletes from the subscriber and sends them to the publisher.
Note that in this description dynamic filtering is not necessary; the bug occurs with static filters as well. It does not occur with MSDE subscribers, however. This active bug is not published in Knowledge Base yet, but Microsoft plans to fix it in Service Pack 3. It was also inherited by SQL Server 2008, but apparently it will be fixed in the final release.