inary data is natural for computers. Everything running on the computer’s physical layer, in the registers, ALU, memory, and so on, is just a result of manipulating a set of 0’s and 1’s, where 0 and 1 are the logical representations of two different voltage levels. While very convenient for machine processing, base-two numeral systems have some drawbacks. For one, each binary digit can keep only two values: 0 or 1. That means large binary numbers will produce very long binary messages, which cause problems with data transfer, processing, and storage.

One solution to the problem of long binary messages is data encoding that shortens the original message?in another words, data compression. You can employ any number of compression algorithms to compress data, but what happens under the hood? Rather than discuss the fairly large number of compression algorithms out there, this article focuses on the technical aspects of data compression, namely binary data manipulations that can be useful for any compression algorithm.

Author’s Note: This article developed as a SQL/Transact-SQL solution to a problem called “Bit Compressor” from the ACM International Collegiate Programming Contest (ACM-ICPC) 2006 World Finals (click here to view the problem). While the solution technique(s) can serve as an approach to completing practical tasks (generating and manipulating binary numbers, encrypting data using patterns, and detecting and fixing data transmission errors), the problem primarily serves as a brainteaser. |

#### How to Generate Binary Numbers in SQL

While binary data is native for computers, using binary data directly in the programming layers?above assemblers and C?is very inconvenient because the human eye and brain are much more familiar with decimal numbers. In addition, binary numbers can occupy a lot of storage space when represented in long character strings. This may be why SQL and SQL Server don’t have programming interfaces that allow direct manipulation of binary data. If you want to generate binary numbers with either technology, you need to find your own way to do it.

One approach for producing binary numbers is generating decimal numbers (which is very easy to do in SQL Server or SQL) and then converting them to their binary equivalents. To convert data from an n-base (decimal) to an m-base (binary) numeral system, you can use the following well-known algorithm:

- Take the number in the n-base system and repeatedly divide it by the radix of the m-base numeral system.
- Taken in order from last to first, the remainders of your divisions will give you the converted number.

For example, if you want to convert 87_{10} to its binary equivalent, you need to take 87 and repeatedly divide it by 2 as follows:

`87/2 = 43, remainder 143/2 = 21, remainder 121/2 = 10, remainder 110/2 = 5, remainder 05/2 = 2, remainder 12/2 = 1 remainder 0`

When you attach the remainders of all the intermediate results to the result of the last division, you get 1010111 = 2^{6} + 2^{4} + 2^{2} + 2^{1} + 2^{0} = 64 + 16 + 4 + 2 + 1 = 87_{10}. This approach works, but it may become very inefficient if you need to generate lots of binary numbers.

If you have hexadecimal numbers, you can use another simple conversion approach: just replace each hexadecimal digit by its binary equivalent. For example, you can convert E70_{16} = 14 * 16^{2} + 7 * 16^{1} = 3696_{10} to 111001110000_{2} = 2^{11} + 2^{10} + 2^{9} + 2^{6} + 2^{5} + 2^{4} = 3696_{10}.

Listing 1 shows a more interesting method for generating binary numbers with SQL cross joins.

Listing 1.How to Generate Binary Numbers Using Cross Joins

`SET NOCOUNT ONIF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'dbo.bin') AND type in (N'U'))DROP TABLE dbo.binCREATE TABLE bin(colBin varchar(50));INSERT INTO bin VALUES('0');INSERT INTO bin VALUES('1');SELECT (b1.colBin + b2.colBin) as binNumber FROM bin b1 CROSS JOIN bin b2 ORDER BY 1;Result:binNumber----------------------------------------00011011`

Incrementing the number of cross joins by one and adding one more addend to the select list in the query from Listing 1, you will get eight binary numbers (see Listing 2).

Listing 2.How to Generate More Binary Numbers Using Cross Joins

`SELECT (b1.colBin + b2.colBin + b3.colBin) as binNumber FROM bin b1 CROSS JOIN bin b2 CROSS JOIN bin b3 ORDER BY 1;Result:binNumber------------------------------------------------------------000001010011100101110111`

The number of rows in the result (the amount of generated binary numbers) is 2 to the power of the number of tables participating in the cross joins (in this case, 2^{3} = 8). If you continue incrementing the number of tables participating in cross joins together with the number of addends in the select list of the query, you theoretically can get any amount of binary numbers.

The problem with this approach is the growing query. If, for example, you need to generate all possible binary numbers in 20 digits, you need to produce the query with 20 cross joins and 20 addends in the select list. The script in Listing 3 helps solve the problem of a growing query.

I ran the script for the @binLen = 20 and generated all possible binary numbers in 20 digits: 2^{20} ? 1 = 1,048,575 numbers. To make it more convenient, I loaded the result into the table binFinal. That table stores all binary numbers together with their decimal equivalents.

#### Data Compression

To apply what you’ve learned about binary numbers to data compression, follow the example in this section, which compresses binary numbers by replacing each maximal sequence of n 1’s with the binary number n. The example does this only when the replacement makes the original message shorter. For example, for the original message 10111111111111111000111011 (26 bits), you will get the compressed message 10111100011011 (14 bits). (You actually can apply the same algorithm to a sequence of 0’s.)

Replacing a maximal sequence of n 1’s with the binary number n means you can’t replace just a subset of 1’s from the sequence of 1’s. For example, you can’t take five 1’s from the 15 in the above example. You need to replace an entire piece, which will be flanked by 0’s on both sides (as is the case when 1’s are in the middle of a binary sequence) or bounded by a zero on one side (as is the case when 1’s are at the very beginning or at the very end of a binary sequence).

Because this example mandates that replacement has to shorten the message, there is a lower limit to the sequences of 1’s that you can replace. That limit is equal to three 1’s. Indeed, 111 can be replaced by 11, which shortens the length of the binary message by one digit. Conversely, 11 can be replaced by 10, which will not shorten the length of the message.

Even though the smallest unit of information is one bit, processing data bit by bit is very inconvenient. Usually data is stored, transferred, or processed by bytes or by words that consist of 1, 2, 4, or 8 bytes. For my tests, I used 20-bit binary numbers generated in the previous example. You should align byte operations on those numbers to the nearest byte boundary (24 bits). However, for the purposes of data compression, the data doesn’t need to be aligned to the byte boundary. It doesn’t even need to keep the leading zeros, which I’ve deleted to make the remaining transformations lighter (see Listing 4).

Listing 4.Deleting Leading Zeros

`SET NOCOUNT ON;DELETE binFinal WHERE decNum = 0;UPDATE binFinal SET binNum = RIGHT(binNum, LEN(binNum) - PATINDEX('%[^0]%',binNum) + 1);SELECT * FROM binFinal; Result:decNum binNum----------- ----------------------------1 12 103 114 1005 1016 1107 1118 1000. . . . . . . . . . . . . . . . . . . .11219 1010111101001111220 10101111010100. . . . . . . . . . . . . . . . . . . .1048573 111111111111111111011048574 111111111111111111101048575 11111111111111111111`

Now it’s time to try compressing the generated binary numbers in accordance with the example rules. First of all, create and load a new table similar to binFinal but with one more column that will store compressed data (see Listing 5).

Listing 5.Create and Prepare Table for Data Compression

`SET NOCOUNT ONIF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'dbo.binWithCompr') AND type in (N'U'))DROP TABLE dbo.binWithCompr;CREATE TABLE dbo.binWithCompr( decNum int PRIMARY KEY, originSeq varchar(50), compresSeq varchar(50)) INSERT INTO binWithComprSELECT decNum, binNum, binNum FROM binFinal;SELECT * FROM dbo.binWithCompr;Result:decNum originSeq compresSeq----------- ----------------------- --------------1 1 12 10 103 11 114 100 100. . . . . . . . . . . . . . . . . .`

Next, generate all the patterns of 1’s presented in the binary sequences of the binFinal table (see Listing 6). The patterns can vary from 111 to 111111111111111111111.

Listing 6.Generate Patterns of 1's

`DECLARE @maxLen int, @i int; SELECT @maxLen = LEN(originSeq) FROM binWithCompr WHERE decNum = (SELECT MAX(decNum) FROM binWithCompr)IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'dbo.patterns') AND type in (N'U'))DROP TABLE dbo.patternsCREATE TABLE dbo.patterns(c1 varchar(100) PRIMARY KEY, decLength int);SET @i = 3;WHILE(@i <= @maxLen)BEGIN INSERT INTO dbo.patterns VALUES(REPLICATE('1', @i), @i); SELECT @i = @i + 1; ENDSELECT * FROM dbo.patterns;Result:c1 decLength-------------------------------- ------------111 31111 411111 5111111 61111111 711111111 8. . . . . . . . . . . . . . . . . . 1111111111111111111 1911111111111111111111 20`

Next, replace each maximal pattern of 1's you found with its binary equivalent. The entire compression requires two steps:

- Replace all maximal sequences of n 1's with the decimal version of n, surrounded by '*'.
- Replace the decimal numbers with their binary equivalents and remove '*'.

Listing 7 describes the first compression step. Listing 8 is the final compression step.

#### Message Decompression

When you have a table with original and compressed values, you can easily decompress data. All you need to do is query the table and find the original message corresponding to what you just received. Listing 9 is an example of message decompression.

Listing 9.Message Decompression

`DECLARE @compresSeq VARCHAR(20);SELECT @compresSeq = '10001011';SELECT * FROM binWithCompr WHERE compresSeq = @compresSeq;Result:decNum originSeq compresSeq----------- --------------------------------- ---------------139 10001011 10001011279 100010111 10001011491 111101011 10001011983 1111010111 1000101118431 100011111111111 1000101163487 1111011111111111 100010111048571 11111111111111111011 10001011`

Listing 10.How to Find Number of 1's and/or 0's in Binary Messages

`DECLARE @compresSeq VARCHAR(20);SELECT @compresSeq = '10001011';SELECT COUNT(*) AS num_of_1s, SUBSTRING(t2.originSeq, t1.decNum,1) AS binVvalue, t2.originSeq, t2.compresSeq FROM (SELECT decNum FROM binWithCompr WHERE decNum <= (SELECT MAX(LEN(originSeq)) FROM binWithCompr WHERE compresSeq = @compresSeq)) t1 CROSS JOIN (SELECT originSeq, compresSeq FROM binWithCompr WHERE compresSeq = @compresSeq) t2 WHERE SUBSTRING(t2.originSeq, t1.decNum,1) = '1' GROUP BY SUBSTRING(t2.originSeq, t1.decNum,1), t2.originSeq, t2.compresSeq;Result:num_of_1s binVvalue originSeq compresSeq----------- -------------- ------------------- --------------------7 1 111101011 1000101112 1 100011111111111 1000101119 1 11111111111111111011 100010118 1 1111010111 100010114 1 10001011 100010115 1 100010111 1000101115 1 1111011111111111 10001011`

As you can see, the information about the number of 1's in the original message received with the compressed message can help you uniquely identify the original message.But this is not always correct. In some cases, different original messages with the same number of 1's can be converted into the same compressed message. For example, if you run the script from Listing 10 for @compresSeq = '1010101', you will get the following result:

`num_of_1s binValue originSeq compresSeq----------- --------- ------------ -------------7 1 111110101 101010110 1 11111011111 10101017 1 101011111 10101017 1 101111101 10101014 1 1010101 1010101`

As you can see, three original messages have seven 1's and two 0's, which can be compressed into the message 1010101. Therefore, there is no way to uniquely identify the original message (unless you have some additional information about it). All that means is you should use this type of compression only in situations where some data loss is acceptable (lossy data compression/decompression such as in image processing and transmission).

#### Message Size Matters

The example uses a hash table I created to make the decompression process easier. For obvious reasons, the described approach can be applied only to relatively short (2-4 bytes) binary messages. The number of binary combinations in 32 bits is 2^{32} = 4,294,967,296. Generating that amount of binary numbers is difficult but doable. Generating all binary combinations for messages that are hundreds or thousands of bits in length, however, is impossible.

Nevertheless, you can use the techniques shown in this article in many data compression implementations and/or when you need to manipulate binary data.