Login | Register   
LinkedIn
Google+
Twitter
RSS Feed
Download our iPhone app
TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   FORUMS  |   TIP BANK
Browse DevX
Sign up for e-mail newsletters from DevX


advertisement
 

Fast Text Processing in SQL Server : Page 4

Processing text or long strings usually reduces SQL to a prosaic procedural language. Learn a few techniques for facilitating speedy text processing in SQL.


advertisement

SQL Server 2000 and SQL Server 2005 Techniques

The recursive queries using CTE are available only in SS2005, so to produce the same results in SS2000 as the techniques in the previous section did, you need to apply dynamic SQL. The following script (see Listing 3) demonstrates that technique:

Listing 3. How to Convert by Words and Load Text into the Table, Using Dynamic SQL

SET QUOTED_IDENTIFIER OFF
GO
SET NOCOUNT ON
DECLARE @str varchar(8000)
SELECT @str = "James Joyce (1882-1941) was an Irish writer and poet, and is
widely considered one of the most significant writers of the 20th century."

SELECT @str = REPLACE(@str, CHAR(13),'')
SELECT @str = REPLACE(@str, CHAR(10), '')
IF EXISTS(SELECT name FROM sysobjects WHERE name = 'dynWords' AND type = 'U')
DROP TABLE dynWords
CREATE TABLE dynWords(
wordID int identity(1,1) not null,
word varchar(30) not null)

SELECT @Str = 'INSERT INTO dynWords(word) SELECT A="' +
REPLACE(@str, ' ', '"UNION ALL SELECT"') + '"'
EXECUTE(@str);
SELECT * FROM dynWords;

Result:

wordID word
------ -------------
1 James
2 Joyce
3 (1882-1941)
4 was
. . . . . . . . . . . .

20 of
21 the
22 20th
23 century.

The logic in this example is very simple. You need to add the header "INSERT INTO #dynWords(word) SELECT A=" to the original text and then replace the spaces between the words with the phrase "UNION ALL SELECT". You will get the following string:



INSERT INTO #dynWords(word) SELECT A="James"UNION ALL SELECT"Joyce"… UNION ALL SELECT"century."

When you execute this string dynamically, the result will be loaded into the #dynWords table.

Now, let's try to determine:

  • The number of words in the text
  • The length of each word
  • The position of each empty space between the words

Listing 4 is the solution to accomplish this.

Listing 4. Processing Converted Text, Using SQL (Part 1)

-- 1. The number of words in the text SELECT COUNT(*) AS 'Total words in the text:' FROM dynWords WHERE word NOT IN ('', ' ', CHAR(13), CHAR(10)); Result: Total words in the text: ------------------------ 23 -- 2. The length of each word SELECT *, LEN(word) wordLength FROM dynWords; Result: wordID word wordLength ----------- ------------------------------ ----------- 1 James 5 2 Joyce 4 3 (1882-1941) 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 the 3 22 20th 4 23 century. 8 -- 3. The numeric position of each empty space between the words PRINT 'The numeric position of each empty space between the words:' SELECT tbl2.*, tbl1.c1 as spacePos FROM (SELECT c1 = SUM(LEN(t1.word)+1), c2 = t2.wordID FROM dynWords AS t1 INNER JOIN dynWords AS t2 ON t1.wordID<=t2.wordID GROUP BY t2.wordID) tbl1 INNER JOIN (SELECT * FROM dynWords) tbl2 ON tbl1.c2 = tbl2.wordID Result: The numeric position of each empty space between the words: wordID word spacePos ----------- ------------------------------ ----------- 1 James 6 2 Joyce 11 3 (1882-1941) 23 4 was 27 . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 of 117 21 the 121 22 20th 126 23 century. 135

Queries 1 and 2 are self-explanatory. Query 3 is much more interesting, though it uses a pretty well known technique with self-joins. The aggregate function SUM(LEN(t1.word)+1) accumulates the number of letters (including the spaces) in the words preceding the current one. Using COUNT() instead of SUM(), you could get the number of words preceding the current one--in other words, the order number of each word in the text, which already is represented by the identity column wordID.

Now, let's proceed and find:

  • The word "and" and words that start with a letter "w"
  • The position of each phrase that starts with a letter "w"

Listing 5 is the solution to accomplish this.

Listing 5. Processing Converted Text, Using SQL (Part 2)

-- 1) Find words "and" and words, starting with a letter "w". SELECT wordID, word FROM dynWords WHERE word LIKE 'w%'OR word LIKE 'AND'; Result: wordID word ----------- -------------- 4 was 7 writer 8 and 10 and 12 widely 19 writers -- 2) The position of each phrase, starting with a letter "w". SET QUOTED_IDENTIFIER OFF GO SET NOCOUNT ON DECLARE @str varchar(8000) SELECT @str = "James Joyce (1882-1941) was an Irish writer and poet, and is widely considered one of the most significant writers of the 20th century." SELECT @str = REPLACE(@str, CHAR(13),' ') SELECT @str = REPLACE(@str, CHAR(10), '') IF EXISTS(SELECT name FROM sysobjects WHERE name = 'dynWords' AND type = 'U') DROP TABLE dynWords CREATE TABLE dynWords(wordID int identity(1,1) not null, word varchar(100) not null) SELECT @str = 'INSERT INTO dynWords(word) SELECT A="' + REPLACE(@str, ' w', '"UNION ALL SELECT"w') + '"'; EXECUTE(@str) PRINT 'The position of each phrase starting with "w":' SELECT tbl2.*, tbl1.c1 as startPos FROM (SELECT (SUM(LEN(t1.word)+1)+1) as c1, c2 = t2.wordID FROM dynWords AS t1 INNER JOIN dynWords AS t2 ON t1.wordID < t2.wordID GROUP BY t2.wordID) tbl1 INNER JOIN (SELECT * FROM dynWords) tbl2 ON tbl1.c2 = tbl2.wordID Result: The position of each phrase starting with "w": wordID word startPos ----------- ---------------------------------------------- --------- 2 was an Irish 24 3 writer and poet, and is 37 4 widely considered one of the most significant 61 5 writers of the 20th century. 107

SQL Server Speed Gap: 2005 vs. 2000

Text conversion and processing techniques are very powerful and flexible; you can use them in many projects. The techniques with dynamic SQL are more restrictive (a 8,000-character limit for the varchar data type or 4,000 for nvarchar) than the recursive techniques, which are faster, but you can use them only in SS2005.



Alex Kozak is a senior DBA/analyst working for SAP Canada. He has more than 15 years of database and programming experience. Microsoft has included some of his articles in the MSDN Library.
Comment and Contribute

 

 

 

 

 


(Maximum characters: 1200). You have 1200 characters left.

 

 

Sitemap