Powerful, Flexible Text-Formatting Solutions in SQL Server

emoving extra empty spaces is a pretty common task that seems very simple, but in reality it is not as trivial as you may assume. The first impulsive solution that many programmers try is replacing two empty spaces with one, which works as long as the number of empty spaces doesn’t exceed two. The problem starts when text has three, four, or more empty spaces between words. If you try repeatedly replacing every two empty spaces with one in those scenarios, you can completely remove some of the empty spaces between the words.

The conventional approach for such a task would be checking for empty spaces, character by character, in a loop and removing the extra spaces. However, there is another approach (see Listing 1).

Listing 1. Removing Extra Empty Spaces

SET QUOTED_IDENTIFIER OFFGOSET NOCOUNT ONDECLARE @str varchar(2000)SELECT @str = " Bipartite graphs are useful for modelling matching problems."CREATE TABLE #words(singleWord varchar(50))SELECT @str = REPLACE(@str, CHAR(13), ' ');SELECT @str = REPLACE(@str, CHAR(10), ' '); -- Split by words and load text into the table SELECT @str = 'INSERT INTO #words(singleWord) SELECT A="' + REPLACE(@str, ' ', '"UNION ALL SELECT"') + '"'EXECUTE(@str);-- Delete spaces between the wordsSELECT * FROM #words; DELETE #words WHERE singleWord = '';SELECT * FROM #words; -- Restore original text without extra empty spacesSELECT @str = '';SELECT @str = @str + singleWord + ' ' FROM #words;SELECT @str;DROP TABLE #words

Using any of the techniques that were explained in the article “Fast Text Processing in SQL Server” (Listing 1 uses the technique with dynamic SQL), you can convert and load text into a table. The empty strings will represent the empty spaces between the words:

singleWord---------------------------------------BipartiteGraphs areusefulformodellingmatchingproblems.(17 row(s) affected)

All you would need to do now is delete the rows from the table #words where a singleWord column has the empty string values. Then you’d need to restore the original text, using a simple word accumulator, built on a string variable @str. The final result would be a normalized (without extra spaces) string: “Bipartite graphs are useful for modelling matching problems.”

Author’s Note: I received a few emails from readers complaining that some scripts from my article “Fast Text Processing in SQL Server” are not working. As it turns out, they used case-sensitive server collations. Please, be aware that my solutions have been tested for SQL_Latin1_General_CP1_CI_AS case-insensitive collation only. For case-sensitive collations, adjust the solutions correspondingly.

Simple Text Formatting

Now suppose you have text that you want to format in such a way that each row will consist of exactly nine words. (I chose nine, but a row can contain any reasonable number of words.) You’d just need to split the text and load it into the table, and then remove all the extra empty spaces using the technique in Listing 1. Finally, you would restore the original text, counting the words. Listing 2 shows the solution.

Listing 2. Text Formatting by the Number of Words in a Row

SET QUOTED_IDENTIFIER OFFGOSET NOCOUNT ONDECLARE @str varchar(2000)SELECT @str = "Benjamin Franklin (January 17, 1706?April 17, 1790) was one of the mostprominent of Founders and early political figures and statesmen of theUnited States. Considered the earliest of the Founders, Franklin wasnoted for his ingenuity, diversity of interests, and wit, and wasextraordinarily influential in the development of the American Revolutiondespite never holding national elective office. Born in Boston, Massachusetts,to a tallow-maker, Franklin became a newspaper editor, printer, and merchant in Philadelphia, Pennsylvania, becoming very wealthy. He spent many years in England and published the famous Poor Richard's Almanack and Pennsylvania Gazette.He formed both the first public lending library and volunteer fire department in America as well as the Junto, a political discussion club.";-- preparation before splittingSELECT @str = REPLACE(@str, CHAR(13), ' ' );SELECT @str = REPLACE(@str, CHAR(10), ' '); -- splitting the text using recursion techniqueWITH ProcessText AS (SELECT 1 as startPos, CHARINDEX (' ',@str, 1) as spacePos UNION ALL SELECT spacePos + 1, CHARINDEX (' ',@str, spacePos + 1) FROM ProcessText WHERE spacePos > 0) SELECT startPos, spacePos, SUBSTRING(@str, startPos, spacePos - startPos) AS singleWord INTO #words FROM ProcessText WHERE spacePos <> 0 UNION ALL SELECT MAX(spacePos) + 1, 0, RIGHT(@str, LEN(@str) - MAX(spacePos)) FROM ProcessText OPTION(MAXRECURSION 0);-- deleting extra spaces (rows with empty string in singleWord column DELETE #words WHERE singleWord = '';-- restoring and formatting the original textDECLARE @str2 varchar(8000), @formatter int, @cnt int SELECT @str2 = '', @formatter = 9, @cnt = 0;SELECT @cnt = @cnt + 1, @str2 = CASE WHEN @cnt % @formatter = 0 THEN @str2 + singleWord + char(10) ELSE @str2 + singleWord + ' ' END FROM #wordsSELECT @str2DROP TABLE #wordsResults:------------------------------------------------------------------------------Benjamin Franklin (January 17, 1706?April 17, 1790) was oneof the most prominent of Founders and early politicalfigures and statesmen of the United States. Considered theearliest of the Founders, Franklin was noted for hisingenuity, diversity of interests, and wit, and was extraordinarilyinfluential in the development of the American Revolution despitenever holding national elective office. Born in Boston, Massachusetts,to a tallow-maker, Franklin became a newspaper editor, printer,and merchant in Philadelphia, Pennsylvania, becoming very wealthy. Hespent many years in England and published the famousPoor Richard's Almanac and Pennsylvania Gazette. He formed boththe first public lending library and volunteer fire departmentin America as well as the Junto, a politicaldiscussion club.

I used the recursive technique to convert and load text into the table. (“Fast Text Processing in SQL Server” explains that technique in detail.)

You also can experiment with the formatter in Listing 2. For example, if you change the @formatter value from 9 to, say 5, you will get a result where each row of the text consists of exactly five words:

Results: ----------------------------------------------Benjamin Franklin (January 17, 1706?April17, 1790) was one ofthe most prominent of Foundersand early political figures andstatesmen of the United States.Considered the earliest of theFounders, Franklin was noted forhis ingenuity, diversity of interests,. . . . . . . . . . . . . . . . . . . first public lending library andvolunteer fire department in Americaas well as the Junto,a political discussion club.

Now suppose you don’t want to format text by the number of the words. Instead, you want to have the rows to have a certain length (e.g., 50 characters in each row). This task is more complicated than the previous one, because you can’t make rows of exactly 50 characters without breaking words. However, with some approximation, formatting by number of characters works (see Listing 3).

Listing 3. Text Formatting by the Number of Characters in a Row

SET QUOTED_IDENTIFIER OFFGOSET NOCOUNT ONDECLARE @str varchar(2000)SELECT @str = "Benjamin Franklin (January 17, 1706?April 17, 1790) was one of the mostprominent of Founders and early political figures and statesmen of theUnited States. Considered the earliest of the Founders, Franklin wasnoted for his ingenuity, diversity of interests, and wit, and wasextraordinarily influential in the development of the American Revolutiondespite never holding national elective office. Born in Boston, Massachusetts,to a tallow-maker, Franklin became a newspaper editor, printer, and merchantin Philadelphia, Pennsylvania, becoming very wealthy. He spent many years inEngland and published the famous Poor Richard's Almanack and Pennsylvania Gazette.He formed both the first public lending library and volunteer fire departmentin America as well as the Junto, a political discussion club.";SELECT @str = REPLACE(@str, CHAR(13), ' ');SELECT @str = REPLACE(@str, CHAR(10), ' ');WITH ProcessText AS (SELECT 1 as startPos, CHARINDEX (' ',@str, 1) as spacePos UNION ALL SELECT spacePos + 1, CHARINDEX (' ',@str, spacePos + 1) FROM ProcessText WHERE spacePos > 0) SELECT startPos, spacePos, SUBSTRING(@str, startPos, spacePos - startPos) AS singleWord INTO #words FROM ProcessText WHERE spacePos <> 0 UNION ALLSELECT MAX(spacePos) + 1, 0, RIGHT(@str, LEN(@str) - MAX(spacePos)) FROM ProcessTextOPTION(MAXRECURSION 0);DELETE #words WHERE singleWord = '';DECLARE @str2 varchar(8000), @formatter int, @cnt int SELECT @str2 = '', @formatter = 50, @cnt = 1;SELECT @cnt = CASE WHEN @cnt > = 0 AND @cnt < @formatter THEN @cnt + len(singleWord) + 2 ELSE 1 END, @str2 = CASE WHEN @cnt = 1 THEN @str2 + singleWord + char(10) ELSE @str2 + singleWord + ' ' END FROM #wordsSELECT @str2DROP TABLE #wordsResults: -----------------------------------------------------------------Benjamin Franklin (January 17, 1706?April 17, 1790) was one of the most prominent of Founders andearly political figures and statesmen of the UnitedStates. Considered the earliest of the Founders, Franklinwas noted for his ingenuity, diversity of interests,and wit, and was extraordinarily influential inthe development of the American Revolution despitenever holding national elective office. Born inBoston, Massachusetts, to a tallow-maker, Franklin becamea newspaper editor, printer, and merchant in Philadelphia,Pennsylvania, becoming very wealthy. He spent manyyears in England and published the famous PoorRichard's Almanack and Pennsylvania Gazette. Heformed both the first public lending library andvolunteer fire department in America as well asthe Junto, a political discussion club.

The code in Listing 3 is very similar to the solution from Listing 2. However, it is quite difficult to make the code work with adequate approximation. I had to tune the final query, adding an addend 2 to the very end of the next line:

WHEN @cnt > = 0 AND @cnt < @formatter THEN @cnt + len(singleWord) + 2

That adjustment allowed me to count the free spaces between the words and some special formatting characters (like line feed), when I calculated the row length.

A More Complicated Task

Unlike the previous examples, which might look fun but not practical, the solution described in this section potentially is very useful. Suppose you work in a big software company with tens to hundreds of developers that are located in different departments, cities, or even countries. The developers have their own code-writing styles that can vary pretty widely. As a result, the source code for the company's products is hardly readable, and debugging it is very complicated.

To solve this problem, you start constructing naming and code-writing conventions. Unfortunately, they don't work well and you still have tons of code that break the conventions. So you decide to create some kind of black box program that receives the original source code as an input, formats it in accordance with rules that you defined, and then issues the formatted code as output.

For instance, you may want your program to identify keywords in the original listing (code) and start those keywords from a new line with certain indents. Now, really think about this task. It can have many, many rules and very complicated logic with hundreds of branches and conditions. Implementing such logic in your code will cost you a large number of work cycles. And even if you do it, any further change or new rule implementation will turn into a nightmare. Thus, you need to find a solution where the implementation code and the formatting rules are independent.

The following subsection discusses one possible solution.

The Keyword Identification Solution

Assume you have the following fragment of the code that you want to format:

IF @keyWord1 = ''BEGIN		IF EXISTS(SELECT keyWord1 FROM rules WHERE keyWord1 = @keyWord2)BEGINSELECT @str = CHAR(10) + @str + @keyWord2SELECT @keyWord1 = @keyWord2ENDELSEBEGINSELECT @str = @str + ' ' + @keyWord2ENDEND

First, you need to de-normalize the text by converting it and loading it into the table.I created a stored procedure spu_LoadTextTable (see Listing 4) for that purpose, but you can use a script or UDF.

Listing 4. De-Normalize Text; Create and Run spu_LoadTextTable

IF EXISTS (SELECT * FROM INFORMATION_SCHEMA.ROUTINES WHERE SPECIFIC_SCHEMA = N'dbo' AND SPECIFIC_NAME = N'spu_LoadTextTable')DROP PROCEDURE dbo.spu_LoadTextTableGOCREATE PROCEDURE dbo.spu_LoadTextTable @str varchar(8000)ASSET NOCOUNT ON-- preparation before splittingSELECT @str = REPLACE(@str, CHAR(13), ' ');SELECT @str = REPLACE(@str, CHAR(10), ' '); SELECT @str = REPLACE(@str, CHAR(9), ' '); -- split text using recursion techniqueWITH ProcessText AS (SELECT 1 as startPos, CHARINDEX (' ',@str, 1) as spacePos UNION ALLSELECT spacePos + 1, CHARINDEX (' ',@str, spacePos + 1) FROM ProcessText WHERE spacePos > 0) SELECT startPos, spacePos, SUBSTRING(@str, startPos, spacePos - startPos) AS singleWord INTO #words FROM ProcessText WHERE spacePos <> 0 UNION ALLSELECT MAX(spacePos) + 1, 0, RIGHT(@str, LEN(@str) - MAX(spacePos)) FROM ProcessTextOPTION(MAXRECURSION 0);-- deleting extra spaces (rows with empty string in singleWord column DELETE #words WHERE singleWord = '';IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].words') AND type in (N'U'))DROP TABLE wordsSELECT ROW_NUMBER() OVER(ORDER BY (SELECT 1)) AS wordID, singleWord INTO words FROM #words;SELECT * FROM words;GOSET QUOTED_IDENTIFIER OFF;GOEXEC spu_LoadTextTable "IF @keyWord1 = ''BEGINIF EXISTS(SELECT keyWord1 FROM rules WHERE keyWord1 = @keyWord2)BEGINSELECT @str = CHAR(10) + @str + @keyWord2SELECT @keyWord1 = @keyWord2ENDELSEBEGINSELECT @str = @str + ' ' + @keyWord2ENDEND"Result:wordID singleWord-------------------- -------------------1 IF2 @keyWord13 =4 ''5 BEGIN6 IF7 EXISTS(SELECT8 keyWord19 FROM10 rules11 WHERE12 keyWord1. . . . . . . . . . . . . . . . . . . .35 +36 '37 '38 +39 @keyWord240 END41 END

To deal with the rules, I created a rules table:

IF NOT EXISTS (SELECT * FROM sys.objects    WHERE object_id = OBJECT_ID(N'[dbo].rules') AND type in (N'U'))CREATE TABLE rules(   keyWord1 varchar(50) NULL,   keyWord2 varchar(50) NULL,   indent int DEFAULT 0)

Then I inserted the rules into the table:

INSERT INTO rules VALUES('IF', 'BEGIN', 0);INSERT INTO rules VALUES('BEGIN', 'IF', 3);INSERT INTO rules VALUES('IF', 'FROM', 3);INSERT INTO rules VALUES('FROM', 'WHERE', 0);INSERT INTO rules VALUES('WHERE', 'BEGIN', -3);INSERT INTO rules VALUES('BEGIN', 'SELECT', 3);INSERT INTO rules VALUES('SELECT', 'SELECT', 0);INSERT INTO rules VALUES('SELECT', 'END', -3);INSERT INTO rules VALUES('END', 'ELSE', 0);INSERT INTO rules VALUES('ELSE', 'BEGIN', 0);INSERT INTO rules VALUES('END', 'END', -3);

Here's how the rules break down: if keyWord1 comes after keyWord2, then keyWord2 has to start from a new line. The indent of the new line will be equal to the current indent value plus an indent value taken from the rule. For example, the last inserted rule (INSERT INTO rules VALUES('END', 'END', -3);) means if a found keyword 'END' has a predecessor keyword 'END', then a new line for 'END' has to start from the very last indent position minus 3.

Finally, I created a stored procedure spu_FormatText (again, it could be a script or UDF) that builds the result (formatted text) in accordance with the rules from the table (see Listing 5).

Listing 5. Formatting Text in Accordance with the Rules

IF EXISTS (SELECT * FROM INFORMATION_SCHEMA.ROUTINES WHERE SPECIFIC_SCHEMA = N'dbo' AND SPECIFIC_NAME = N'spu_FormatText')DROP PROCEDURE dbo.spu_FormatTextGOCREATE PROCEDURE dbo.spu_FormatText ASSET NOCOUNT ONDECLARE @str varchar(8000)DECLARE @i int, @m int, @keyWord1 varchar(50), @keyWord2 varchar(50), @spChar int;SELECT @i = 1, @m = 0, @keyWord1 = '', @keyWord2 = '', @spChar = 0, @str = '';WHILE (1=1)BEGIN SELECT @keyWord2 = singleWord FROM words WHERE wordID = @i IF @@ROWCOUNT = 0 BREAK IF @keyWord1 = '' BEGIN IF EXISTS(SELECT keyWord1 FROM rules WHERE keyWord1 = @keyWord2) BEGIN SELECT @str = CHAR(10) + @str + @keyWord2 SELECT @keyWord1 = @keyWord2 END ELSE SELECT @str = @str + ' ' + @keyWord2 END ELSE BEGIN IF EXISTS(SELECT keyWord1 FROM rules WHERE keyWord1 = @keyWord1 AND keyWord2 = @keyWord2) BEGIN SET @spChar = 1 END ELSE BEGIN SET @spChar = 0 END SELECT @m = @m + ISNULL(indent, 0) FROM rules WHERE keyWord1 = @keyWord1 AND keyWord2 = @keyWord2 IF @spChar = 1 BEGIN SELECT @str = @str + CHAR(10) + SPACE(@m) + @keyWord2 SELECT @keyWord1 = @keyWord2 END IF @spChar = 0 BEGIN SELECT @str = @str + ' ' + @keyWord2 END END SELECT @i = @i + 1ENDSELECT @strGO------------------------------------------------EXEC spu_FormatTextResult: ------------------------------------------------------IF @keyWord1 = ''BEGIN IF EXISTS(SELECT keyWord1 FROM rules WHERE keyWord1 = @keyWord2) BEGIN SELECT @str = CHAR(10) + @str + @keyWord2 SELECT @keyWord1 = @keyWord2 END ELSE BEGIN SELECT @str = @str + ' ' + @keyWord2 ENDEND

The beauty of the solution in Lisitng 5 is its independence. The solution is completely separated from the formatting rules, which are stored in the rules table. Such an approach allows you to change, add, or delete any rule without touching the code. For example, you can update the indent column by changing the starting position of the keyword. Alternatively, you can delete the rule from the table, in which case the deleted combination of two keywords won't be processed (remaining on the same line of the text). Also, you can add a new rule when your listing has a missing keyword.

Text Formatting to Suite Your Scenario

You can write very powerful and flexible text formatting solutions in SQL Server. In fact, you can apply the black box program in the previous section to any programming language, but you should consider it only as an idea, not a final product. Armed with the techniques in this article, creating the final product is up to you.

Share the Post:
Share on facebook
Share on twitter
Share on linkedin

Overview

Recent Articles: