UPDATED SQL Server 2005 Bug Alert: Data Loss in Merge Replication

he data-loss scenario I described in the DevX article SQL Server 2005 Bug Alert: Merge Replication Could Mean Data Loss (published November 2, 2007) is mainly due to the difference in SQL Server 2005’s replication behavior when a publication database uses partition groups versus when the publisher is a SQL Server 2000 Enterprise edition or when the publication doesn’t use partition groups. In the publication script example from that article, the parameter responsible for this change in behavior is @use_partition_groups. First introduced in SQL Server 2005, this parameter is meant to optimize replication by using partitions, but two existing bugs related to it opened the door to data loss.

A solution that was suggested in the previous article was to stop using partition groups altogether, as this data loss was an inherent behavior of partition groups design. However, setting the partition groups to false led to a different data-loss scenario, which finally prompted Microsoft to open an active bug (#442076). With this new information, I can explain exactly which SQL Server 2005 setting is causing the data loss.

In this addendum to the previous article, I identify the stored procedures that are responsible for the bad code and the two data-loss scenarios explained in that article (with accompanying scripts for reproducing both cases). The data-loss scenario presented in this article concerns only DBAs or database developers who use merge replication with join filtering, SQL Server 2005 as publisher/distributor, and SQL Server 2005 (including Express edition) as subscribers. If your subscribers are MSDE (Microsoft SQL Server 2000 Desktop Edition), you should not worry about the data-loss reproductions.

In the scripts presented in this article, the @use_partition_groups parameter is set to null (which is the same as enabling the parameter), and its default value is true (see Listing 1). The reproduction presented in this article is the same as using @use_partition_groups=N’true’ in the publication script, and the compatibility level is not important in this case.

I’ve taken it upon myself to write this addendum because Microsoft unfortunately hasn’t published the bug anywhere (I just received e-mail confirmation via tech support), even though the data-loss scenarios it creates are pretty common. My own organization has gone through a difficult period of trying to recover lost data and the confidence of our customers.

Anatomy of the Bug
Per the e-mail I received via tech support, the description of bug number 442076 is as follows:

When partition groups is not being used, there is a bug in sp_MSdelsubrowsbatch.

T1 is the parent table for PK-FK relation (has PK).
T2 is another table for the PK-FK relation (has FK).
T2 is the parent table for the join filters with HOSTNAME filter.
T1 is the child table for join filters.

Deletes in T2 at publisher should be propagated to the subscriber. However, publisher should not lose those changes.
What actually happens is if more than one row are deleted at the publisher, the following sync propagates the deletes to T2 and expands them to T1 at the subscriber. A subsequent merge then deletes the rows in T1 at the publisher leading to data loss.

Join filters are in the reverse order of PK-FK.
Happens only with setupbelongs. Does not repro if partition groups is used.
Happens only when more than one row are deleted.

The reason is that the delete of T2 for more than two rows that expand to deletes in T1 are flagged with a user delete instead of a system delete at the subscriber. Hence, when merge runs for the second time, it enumerates these deletes from the subscriber and sends them to the publisher.

Note that in this description dynamic filtering is not necessary; the bug occurs with static filters as well. It does not occur with MSDE subscribers, however. This active bug is not published in Knowledge Base yet, but Microsoft plans to fix it in Service Pack 3. It was also inherited by SQL Server 2008, but apparently it will be fixed in the final release.

Reproducing the Data-Loss Scenario
This section explains how to reproduce two data-loss scenarios: Repro 1 and Repro 2. Repro 1 addresses the scenario described in the previous section (bug number 442076). Repro 2 addresses the data loss that occurs on a table where the join filters are in the direct order of PK-FK (not an active bug).

Repro 1
The relevant partition parameters for Repro 1 are as follows:

@keep_partition_changes=N'false'@use_partition_groups = N'false'@compatibility_level=N'90RTM'

The replication loses data on the Order table. The data that was filtered from the subscriber on a previous replication is lost on the second replication at the publisher.

Take the following steps to reproduce this data loss (download the scripts for Repro 1 here):

  1. Create Publication Database.sql.
  2. Run the script Insertions at the Publication Database.sql.
  3. Create Test Merge Publication.sql.
  4. Create Subscriber Database.sql.
  5. Run Snapshot Agent. Note: You should configure your distributor for the publisher if it hasn’t been before.
  6. Create the subscription at subscriber. (Create Subscription.sql).
  7. Replicate (you can use Windows XP Synchronization Manager if you’re using SQLE subscribers).
  8. Run the script First Batch Update (@Subscriber).sql.
  9. Replicate again. (This will make sure the replication works as expected.)
  10. Run the script Second Batch Update (@Subscriber).sql.
  11. Replicate again.

Data loss will occur, as the updated records from Step 8 (running the First Batch Update (@Subscriber).sql) are gone from the publication database after Step 11. Also, no conflict is recorded.

Repro 2
The relevant partition parameters for Repro 2 are as follows:

@keep_partition_changes=N'false'@use_partition_groups = N'false'@compatibility_level=N'90RTM'

Some factors to note about this scenario are:

  • The enumeration of changes does not account for the deletions in the ApprProperty table at the publisher.
  • The filter joins in this case are in direct order of PK-FK (The Comment table has a FK to the Order table).
  • The hierarchy is as follows:
    1. The table User is filtered based on username. The PK on this table is UserId.
    2. The table Order has a FK to User (UserId). The PK on this table is OrderId.
    3. The table Comment has a FK to Order (OrderlId).

Take the following steps to reproduce the Repro 2 data loss (download the scripts for Repro 2 here):

  1. Create Publication Database.sql.
  2. Run the script Insertions (@Publication).sql.
  3. Create Publication.sql.
  4. Run Snapshot Agent.
  5. Create Subscriber Database.sql.
  6. Create the subscription. (Create Subscription.sql.)
  7. Replicate.
  8. Run the script First Insertions in Comments (@Subscriber).sql.
  9. Replicate again. (This will make sure the replication works as expected.)
  10. Run the script First Update_Order (@Subscriber).sql. (This update changes the order rows for filtering.)
  11. Replicate. (The filter works properly and all comments are there.)
  12. Replicate again.

Comments will lose data after Step 12. It should contain three comments, but the comments inserted in Step 8 (running the First Insertions in Comments (@Subscriber).sql) are lost.

This information should be useful to DBAs and database developers who use merge replication systems with data filtering in SQL Server 2005.

Share the Post:
Share on facebook
Share on twitter
Share on linkedin

Related Posts