dcsimg
LinkedIn
Twitter
RSS Feed
Download our iPhone app
TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   FORUMS  |   TIP BANK
Browse DevX
Sign up for e-mail newsletters from DevX


advertisement
 

Fast Text Processing in SQL Server : Page 2

Processing text or long strings usually reduces SQL to a prosaic procedural language. Learn a few techniques for facilitating speedy text processing in SQL.


advertisement

SQL Server 2005 Recursive Techniques

I'm a big fan of recursion and I always try to use it in my projects (see my MSDN article, "Recursion in T–SQL"). However, implementing recursion in SQL Server 2000 (SS2000) is not an easy, intuitive task. SS2000 allows only 32 nesting levels (calls), which makes the recursion implementation very tricky or even impossible in some situations.

This is why I was very glad to see SQL Server 2005 (SS2005) introduce a recursive query that uses common table expression (CTE). I consider that feature a great enhancement to Transact-SQL. As an example, consider how easily you can convert and load the following text (the poem "True Beauty" written by my daughter Anat Kozak) into a table using SS2005 recursion. It is a recursive query using CTE:


Listing 1. How to Convert by Words and Load Text into the Table


SET QUOTED_IDENTIFIER OFF GO DECLARE @str varchar(2000); SELECT @str = "A lonely moon about the sky, A gentle flower in the breeze, A giant cloud of smoky stars, Some sticky honey made by bees. The soft sweet voice of melody And moist warm air, which we all breathe Tremendous power of our minds Which can make everyone believe..." 1 SELECT @str = REPLACE(@str, CHAR(10), ' '); 2 SELECT @str = REPLACE(@str, CHAR(13), ' ' ); 3 WITH ProcessText AS 4 (SELECT 1 as startPos, CHARINDEX (' ',@str, 1) as spacePos 5 UNION ALL 6 SELECT spacePos + 1, CHARINDEX (' ',@str, spacePos + 1) 7 FROM ProcessText WHERE spacePos > 0) 8 SELECT startPos, 9 spacePos, 10 SUBSTRING(@str, startPos, spacePos - startPos) AS singleWord 11 INTO #words 12 FROM ProcessText WHERE spacePos <> 0 13 UNION ALL 14 SELECT MAX(spacePos) + 1, 0, RIGHT(@str, LEN(@str) - MAX(spacePos)) 15 FROM ProcessText 16 OPTION(MAXRECURSION 0);

The CTE query definition starts in line 4, where an anchor member is defined. The recursive member, referencing ProcessText (the CTE name) is defined in lines 6 and 7. The statement, located in lines 8-16, executes CTE, inserting the result into the temporary table.

The query recursively looks for an empty space between the words and extracts substring (word), located between the current and previous empty spaces. You should take special care of some control characters, such as line feed (Char(10)) and carriage return (Char(13)), unless you want them to participate in the processing and be counted as the characters. You can find an example of such control characters processing in lines 1 and 2 of the code.

Running SELECT * FROM #words will produce the following result:


startPos    spacePos    sepWord
----------- ----------- --------
1           2           A
3           9           lonely
10          14          moon
15          20          about
21          24          the
25          29          sky,
30          30          
31          32          A
33          39          gentle
40          46          flower
. . . . . . . . . . . . . . . . .

112         116         made
117         119         by
120         125         bees.

126         126         
127         127         
128         128         
129         132         The
. . . . . . . . . . . . . . .

244         248         make
249         257         everyone
258         0           believe...

As you can see, the code from Listing 1 did additional work: it calculated the starting position of each word and the position of each empty space from the very beginning of the text. And that's not all. You can get a lot of useful information about the original text. For example, you can determine the number of words in the text as follows:


SELECT COUNT(*) AS numWords
   FROM #words
   WHERE singleWord <>'';

Result:

numWords
-----------
48

You can determine the length of each word as follows:


SELECT LEN(singleWord) as wordLength,* 
   FROM #words;

-- Result:

wordLength  startPos    spacePos    singleWord
1           1           2           A
6           3           9           lonely
4           10          14          moon
5           15          20          about
3           21          24          the
4           25          29          sky,
0           30          30          
1           31          32          A
6           33          39          gentle
6           40          46          flower
. . . . . . . . . . . . . . . . . . . . . .

You can determine how often each distinct word has been used as follows:


SELECT COUNT(*) AS wordFrequency, singleWord
  FROM #words
  GROUP BY singleWord;

-- Result:

wordFrequency singleWord
9             
3             A
1             about
1             air,
1             all
1             And
1             bees.

1             believe...

1             breathe
1             breeze,
1             by
1             can
. . . . . . . . . . . .

1             sticky
1             sweet
3             The
1             Tremendous
1             voice
1             warm
1             we
2             Which

The result this last query produced includes a couple of interesting details:

  • The result has been implicitly sorted in alphabetical order on column singleWord, because the GROUP BY clause was using a SORT operation internally.
  • The number 9 in the first line of the result indicates the number of lines in the text, but not the number of spaces between the words.



Close Icon
Thanks for your registration, follow us on our social networks to keep up-to-date