### WEBINAR:

On-Demand

Full Text Search: The Key to Better Natural Language Queries for NoSQL in Node.js

#### Data Compression

To apply what you've learned about binary numbers to data compression, follow the example in this section, which compresses binary numbers by replacing each maximal sequence of n 1's with the binary number n. The example does this only when the replacement makes the original message shorter. For example, for the original message 10111111111111111000111011 (26 bits), you will get the compressed message 10111100011011 (14 bits). (You actually can apply the same algorithm to a sequence of 0's.)

Replacing a maximal sequence of n 1's with the binary number n means you can't replace just a subset of 1's from the sequence of 1's. For example, you can't take five 1's from the 15 in the above example. You need to replace an entire piece, which will be flanked by 0's on both sides (as is the case when 1's are in the middle of a binary sequence) or bounded by a zero on one side (as is the case when 1's are at the very beginning or at the very end of a binary sequence).

Because this example mandates that replacement has to shorten the message, there is a lower limit to the sequences of 1's that you can replace. That limit is equal to three 1's. Indeed, 111 can be replaced by 11, which shortens the length of the binary message by one digit. Conversely, 11 can be replaced by 10, which will not shorten the length of the message.

Even though the smallest unit of information is one bit, processing data bit by bit is very inconvenient. Usually data is stored, transferred, or processed by bytes or by words that consist of 1, 2, 4, or 8 bytes. For my tests, I used 20-bit binary numbers generated in the previous example. You should align byte operations on those numbers to the nearest byte boundary (24 bits). However, for the purposes of data compression, the data doesn't need to be aligned to the byte boundary. It doesn't even need to keep the leading zeros, which I've deleted to make the remaining transformations lighter (see Listing 4).

**Listing 4.** Deleting Leading Zeros

```
SET NOCOUNT ON;
DELETE binFinal WHERE decNum = 0;
UPDATE binFinal
SET binNum = RIGHT(binNum, LEN(binNum) - PATINDEX('%[^0]%',binNum) + 1);
SELECT * FROM binFinal;
Result:
decNum binNum
----------- ----------------------------
1 1
2 10
3 11
4 100
5 101
6 110
7 111
8 1000
. . . . . . . . . . . . . . . . . . . .
11219 10101111010011
11220 10101111010100
. . . . . . . . . . . . . . . . . . . .
1048573 11111111111111111101
1048574 11111111111111111110
1048575 11111111111111111111
```

Now it's time to try compressing the generated binary numbers in accordance with the example rules. First of all, create and load a new table similar to binFinal but with one more column that will store compressed data (see Listing 5).

**Listing 5.** Create and Prepare Table for Data Compression

```
SET NOCOUNT ON
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'dbo.binWithCompr')
AND type in (N'U'))
DROP TABLE dbo.binWithCompr;
CREATE TABLE dbo.binWithCompr(
decNum int PRIMARY KEY,
originSeq varchar(50),
compresSeq varchar(50))
INSERT INTO binWithCompr
SELECT decNum, binNum, binNum FROM binFinal;
SELECT * FROM dbo.binWithCompr;
Result:
decNum originSeq compresSeq
----------- ----------------------- --------------
1 1 1
2 10 10
3 11 11
4 100 100
. . . . . . . . . . . . . . . . . .
```

Next, generate all the patterns of 1's presented in the binary sequences of the binFinal table (see Listing 6). The patterns can vary from 111 to 111111111111111111111.

**Listing 6.** Generate Patterns of 1's

```
DECLARE @maxLen int, @i int;
SELECT @maxLen = LEN(originSeq) FROM binWithCompr
WHERE decNum = (SELECT MAX(decNum) FROM binWithCompr)
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'dbo.patterns')
AND type in (N'U'))
DROP TABLE dbo.patterns
CREATE TABLE dbo.patterns(c1 varchar(100) PRIMARY KEY, decLength int);
SET @i = 3;
WHILE(@i <= @maxLen)
BEGIN
INSERT INTO dbo.patterns VALUES(REPLICATE('1', @i), @i);
SELECT @i = @i + 1;
END
SELECT * FROM dbo.patterns;
Result:
c1 decLength
-------------------------------- ------------
111 3
1111 4
11111 5
111111 6
1111111 7
11111111 8
. . . . . . . . . . . . . . . . . .
1111111111111111111 19
11111111111111111111 20
```

Next, replace each maximal pattern of 1's you found with its binary equivalent. The entire compression requires two steps:

- Replace all maximal sequences of n 1's with the decimal version of n, surrounded by '*'.
- Replace the decimal numbers with their binary equivalents and remove '*'.

Listing 7 describes the first compression step. Listing 8 is the final compression step.