Fast Text Processing in SQL Server

Fast Text Processing in SQL Server

rocessing text or long strings in SQL has never been an easy task. SQL is a powerful language for performing fast operations on data sets, but when it comes to processing text or long strings, it’s usually reduced to a prosaic procedural language. This article shows a few techniques for facilitating speedy text processing in SQL. Although demonstrated in SQL Server, you can apply the underlying ideas to any RDBMS with only small adjustments. Also, no third-party tools, extended stored procedures, or user-defined functions or objects written in any programming language other than SQL (or Transact-SQL) are used.

Using these techniques you will be able to do the following and more without any loops:

  • Determine the number of the words in the text
  • Determine the length and position of each word in the text
  • Determine the number of occurrences of a letter (pattern) and their positions in the text
  • Determine the frequency of each distinct word or letter in the text
  • Eliminate a letter’s duplicates
  • Eliminate extra spaces between the words or between lines of text
  • Convert text according to a given format (e.g., define the length of lines in the text or implement more sophisticated formatting of the text)

Preliminary Transformations

SQL is a language dedicated to set-based processing. Text by nature requires one-by-one sequential processing, which is not the strongest feature of SQL. Hence, you can’t expect an improvement in text processing if you don’t change the layout of the text. In other words, you need to convert text into a structure that allows set-based manipulations.

Traditionally in relational databases, such structures were and continue to be tables. Therefore, to be able to process text using SQL, you need to put it into a table, where each word (or letter) will have a row value in a specific column.

The sections to follow show a few techniques that you can use for text conversion.

SQL Server 2005 Recursive Techniques

I’m a big fan of recursion and I always try to use it in my projects (see my MSDN article, “Recursion in T?SQL“).However, implementing recursion in SQL Server 2000 (SS2000) is not an easy, intuitive task. SS2000 allows only 32 nesting levels (calls), which makes the recursion implementation very tricky or even impossible in some situations.

This is why I was very glad to see SQL Server 2005 (SS2005) introduce a recursive query that uses common table expression (CTE). I consider that feature a great enhancement to Transact-SQL. As an example, consider how easily you can convert and load the following text (the poem “True Beauty” written by my daughter Anat Kozak) into a table using SS2005 recursion. It is a recursive query using CTE:

Listing 1. How to Convert by Words and Load Text into the Table

SET QUOTED_IDENTIFIER OFFGODECLARE @str varchar(2000); SELECT @str = "A lonely moon about the sky,A gentle flower in the breeze,A giant cloud of smoky stars,Some sticky honey made by bees.The soft sweet voice of melodyAnd moist warm air, which we all breatheTremendous power of our mindsWhich can make everyone believe..."1 SELECT @str = REPLACE(@str, CHAR(10), ' ');2 SELECT @str = REPLACE(@str, CHAR(13), ' ' );3 WITH ProcessText AS 4 (SELECT 1 as startPos, CHARINDEX (' ',@str, 1) as spacePos 5 UNION ALL 6 SELECT spacePos + 1, CHARINDEX (' ',@str, spacePos + 1) 7 FROM ProcessText WHERE spacePos > 0) 8 SELECT startPos, 9 spacePos, 10 SUBSTRING(@str, startPos, spacePos - startPos) AS singleWord 11 INTO #words12 FROM ProcessText WHERE spacePos <> 0 13 UNION ALL14 SELECT MAX(spacePos) + 1, 0, RIGHT(@str, LEN(@str) - MAX(spacePos))15 FROM ProcessText16 OPTION(MAXRECURSION 0);

The CTE query definition starts in line 4, where an anchor member is defined. The recursive member, referencing ProcessText (the CTE name) is defined in lines 6 and 7. The statement, located in lines 8-16, executes CTE, inserting the result into the temporary table.

The query recursively looks for an empty space between the words and extracts substring (word), located between the current and previous empty spaces. You should take special care of some control characters, such as line feed (Char(10)) and carriage return (Char(13)), unless you want them to participate in the processing and be counted as the characters. You can find an example of such control characters processing in lines 1 and 2 of the code.

Running SELECT * FROM #words will produce the following result:

startPos    spacePos    sepWord----------- ----------- --------1           2           A3           9           lonely10          14          moon15          20          about21          24          the25          29          sky,30          30          31          32          A33          39          gentle40          46          flower. . . . . . . . . . . . . . . . .112         116         made117         119         by120         125         bees.126         126         127         127         128         128         129         132         The. . . . . . . . . . . . . . .244         248         make249         257         everyone258         0           believe...

As you can see, the code from Listing 1 did additional work: it calculated the starting position of each word and the position of each empty space from the very beginning of the text. And that’s not all. You can get a lot of useful information about the original text. For example, you can determine the number of words in the text as follows:

SELECT COUNT(*) AS numWords   FROM #words   WHERE singleWord <>'';Result:numWords-----------48

You can determine the length of each word as follows:

SELECT LEN(singleWord) as wordLength,*    FROM #words;-- Result:wordLength  startPos    spacePos    singleWord1           1           2           A6           3           9           lonely4           10          14          moon5           15          20          about3           21          24          the4           25          29          sky,0           30          30          1           31          32          A6           33          39          gentle6           40          46          flower. . . . . . . . . . . . . . . . . . . . . .

You can determine how often each distinct word has been used as follows:

SELECT COUNT(*) AS wordFrequency, singleWord  FROM #words  GROUP BY singleWord;-- Result:wordFrequency singleWord9             3             A1             about1             air,1             all1             And1             bees.1             believe...1             breathe1             breeze,1             by1             can. . . . . . . . . . . .1             sticky1             sweet3             The1             Tremendous1             voice1             warm1             we2             Which

The result this last query produced includes a couple of interesting details:

  • The result has been implicitly sorted in alphabetical order on column singleWord, because the GROUP BY clause was using a SORT operation internally.
  • The number 9 in the first line of the result indicates the number of lines in the text, but not the number of spaces between the words.

SQL Server 2005 Recursive Techniques (cont’d)

Now, go back to the Listing 1. Replace lines 4 and 6 of the code with the following lines:

4 (SELECT 1 as startPos, CHARINDEX (' A',@str, 1) as  spacePos  . . . . .6  SELECT spacePos + 1, CHARINDEX (' A',@str, spacePos + 1) 

Running this updated script, Listing 1 will produce the following result:

startPos    spacePos    singleWord1           14          A lonely moon15          30          about the sky, 31          62          A gentle flower in the breeze, 63          160         A giant cloud of smoky stars,. . .sweet voice of melody161         175         And moist warm176         189         air, which we190         0           all breathe  Tremendous power. . . everyone believe...

At first glance, the result doesn’t make any sense; you just got all the phrases that start with a specific letter (A). But as you will see later, this option can be very useful when you want to format the text. Thus, when you split and load text into a table, you get easy access to each separate word, and that gives you a lot of flexibility and processing power.

But what if you need even more granularity? What if you need to work with each individual letter, instead of the whole word? Well, you can do that using the following script:

Listing 2. How to Convert by Letters and Load Text into the Table

SET QUOTED_IDENTIFIER OFFGODECLARE @str varchar(2000); SELECT @str = "A lonely moon about the sky,A gentle flower in the breeze,A giant cloud of smoky stars,Some sticky honey made by bees.The soft sweet voice of melodyAnd moist warm air, which we all breatheTremendous power of our mindsWhich can make everyone believe...";1 SELECT @str = REPLACE(@str, CHAR(10), ' ');2 SELECT @str = REPLACE(@str, CHAR(13), '');3 WITH ProcessText AS 4 (SELECT 1 as startPos, SUBSTRING (@str,1,1) as letter 5 UNION ALL 6 SELECT startPos + 1, SUBSTRING (@str,startPos+1, 1) 7 FROM ProcessText WHERE startPos <= len(@str))8 SELECT * INTO #letters9 FROM ProcessText 10 OPTION(MAXRECURSION 0);SELECT * FROM #letters;Result:startPos letter----------- ------1 A2 3 l4 o5 n6 e7 l8 y9 10 m11 o12 o13 n. . . . . . .249 250 b251 e252 l253 i254 e255 v256 e257 .258 .259 .260

The solution in Listing 2 is based on the same idea as the solution in Listing 1. The query recursively applies the SUBSTRING() function to the original string, incrementing the starting position by one on each recursion's step, and stops processing when the starting position reaches the end of the string (text).

The solution from Listing 2 gives you direct access to each letter in the text, which gives you additional flexibility and processing power. Now, you can easily determine the number of letters in the text:

SELECT COUNT(*) AS NumLetters   FROM #letters   WHERE letter <> '';Result:NumLetters-----------219

You can determine how often each distinct letter has been used:

SELECT COUNT(*) AS NumLetters,letter     FROM #letters   WHERE letter <> ''   GROUP BY letter;Result:NumLetters  letter----------- ------4           ,4           .14          a6           b6           c6           d32          e. . . . . . .13          t4           u3           v7           w8           y1           z

If you change Listing 2 slightly by replacing lines 4 and 6 with the following lines:

4 (SELECT 1 as startPos, SUBSTRING (@str,1,2) as letter    . . . . . . . . . . . .6 SELECT startPos + 2, SUBSTRING (@str,startPos+2, 2)

You will get the consequent pairs of letters in the result:

startPos    letter----------- ------1           A 3           lo5           ne7           ly9            m11          oo13          n 15           a17          bo. . . . . . . .

If you replace lines 4 and 6 with the following lines:

4 (SELECT 1 as startPos, SUBSTRING (@str,1,2) as letter    . . . . . . . . . . . .6 SELECT startPos + 2, SUBSTRING (@str,startPos+2, 2)

You will get the consequent pairs of letters with overlapping:

startPos    letter----------- ------1           A 2           lo3           on4           ne5           el6           ly7           y . . . . . . .

The preceding two cases allow text splitting by two letters (it actually can be any number of letters) and potentially are very useful for text formatting or finding duplicates or extra spaces in the text (more on that later).

SQL Server 2000 and SQL Server 2005 Techniques

The recursive queries using CTE are available only in SS2005, so to produce the same results in SS2000 as the techniques in the previous section did, you need to apply dynamic SQL. The following script (see Listing 3) demonstrates that technique:

Listing 3. How to Convert by Words and Load Text into the Table, Using Dynamic SQL

SET QUOTED_IDENTIFIER OFF
GO
SET NOCOUNT ON
DECLARE @str varchar(8000)
SELECT @str = "James Joyce (1882-1941) was an Irish writer and poet, and is
widely considered one of the most significant writers of the 20th century."

SELECT @str = REPLACE(@str, CHAR(13),'')
SELECT @str = REPLACE(@str, CHAR(10), '')
IF EXISTS(SELECT name FROM sysobjects WHERE name = 'dynWords' AND type = 'U')
DROP TABLE dynWords
CREATE TABLE dynWords(
wordID int identity(1,1) not null,
word varchar(30) not null)

SELECT @Str = 'INSERT INTO dynWords(word) SELECT A="' +
REPLACE(@str, ' ', '"UNION ALL SELECT"') + '"'
EXECUTE(@str);
SELECT * FROM dynWords;

Result:

wordID word
------ -------------
1 James
2 Joyce
3 (1882-1941)
4 was
. . . . . . . . . . . .

20 of
21 the
22 20th
23 century.

The logic in this example is very simple. You need to add the header "INSERT INTO #dynWords(word) SELECT A=" to the original text and then replace the spaces between the words with the phrase "UNION ALL SELECT". You will get the following string:

INSERT INTO #dynWords(word) SELECT A="James"UNION ALL SELECT"Joyce"? UNION ALL SELECT"century."

When you execute this string dynamically, the result will be loaded into the #dynWords table.

Now, let's try to determine:

  • The number of words in the text
  • The length of each word
  • The position of each empty space between the words

Listing 4 is the solution to accomplish this.

Listing 4. Processing Converted Text, Using SQL (Part 1)

-- 1. The number of words in the textSELECT COUNT(*) AS 'Total words in the text:' FROM dynWords WHERE word NOT IN ('', ' ', CHAR(13), CHAR(10));Result: Total words in the text: ------------------------ 23-- 2. The length of each wordSELECT *, LEN(word) wordLength FROM dynWords;Result:wordID word wordLength ----------- ------------------------------ ----------- 1 James 52 Joyce 43 (1882-1941) 11. . . . . . . . . . . . . . . . . . . . . . . . . . .21 the 322 20th 423 century. 8-- 3. The numeric position of each empty space between the wordsPRINT 'The numeric position of each empty space between the words:'SELECT tbl2.*, tbl1.c1 as spacePos FROM (SELECT c1 = SUM(LEN(t1.word)+1), c2 = t2.wordID FROM dynWords AS t1 INNER JOIN dynWords AS t2 ON t1.wordID<=t2.wordID GROUP BY t2.wordID) tbl1 INNER JOIN (SELECT * FROM dynWords) tbl2 ON tbl1.c2 = tbl2.wordID Result: The numeric position of each empty space between the words:wordID word spacePos ----------- ------------------------------ ----------- 1 James 62 Joyce 113 (1882-1941) 234 was 27. . . . . . . . . . . . . . . . . . . . . . . . . . .20 of 11721 the 12122 20th 12623 century. 135

Queries 1 and 2 are self-explanatory. Query 3 is much more interesting, though it uses a pretty well known technique with self-joins. The aggregate function SUM(LEN(t1.word)+1) accumulates the number of letters (including the spaces) in the words preceding the current one. Using COUNT() instead of SUM(), you could get the number of words preceding the current one--in other words, the order number of each word in the text, which already is represented by the identity column wordID.

Now, let's proceed and find:

  • The word "and" and words that start with a letter "w"
  • The position of each phrase that starts with a letter "w"

Listing 5 is the solution to accomplish this.

Listing 5. Processing Converted Text, Using SQL (Part 2)

-- 1) Find words "and" and words, starting with a letter "w".SELECT wordID, word FROM dynWords WHERE word LIKE 'w%'OR word LIKE 'AND';Result:wordID word ----------- --------------4 was7 writer8 and10 and12 widely19 writers-- 2) The position of each phrase, starting with a letter "w".SET QUOTED_IDENTIFIER OFFGOSET NOCOUNT ONDECLARE @str varchar(8000)SELECT @str = "James Joyce (1882-1941) was an Irish writer and poet, and iswidely considered one of the most significant writers of the 20th century."SELECT @str = REPLACE(@str, CHAR(13),' ')SELECT @str = REPLACE(@str, CHAR(10), '')IF EXISTS(SELECT name FROM sysobjects WHERE name = 'dynWords' AND type = 'U') DROP TABLE dynWordsCREATE TABLE dynWords(wordID int identity(1,1) not null, word varchar(100) not null)SELECT @str = 'INSERT INTO dynWords(word) SELECT A="' + REPLACE(@str, ' w', '"UNION ALL SELECT"w') + '"';EXECUTE(@str)PRINT 'The position of each phrase starting with "w":' SELECT tbl2.*, tbl1.c1 as startPos FROM (SELECT (SUM(LEN(t1.word)+1)+1) as c1, c2 = t2.wordID FROM dynWords AS t1 INNER JOIN dynWords AS t2 ON t1.wordID < t2.wordID GROUP BY t2.wordID) tbl1 INNER JOIN (SELECT * FROM dynWords) tbl2 ON tbl1.c2 = tbl2.wordIDResult:The position of each phrase starting with "w":wordID word startPos ----------- ---------------------------------------------- ---------2 was an Irish 243 writer and poet, and is 374 widely considered one of the most significant 615 writers of the 20th century. 107

SQL Server Speed Gap: 2005 vs. 2000

Text conversion and processing techniques are very powerful and flexible; you can use them in many projects. The techniques with dynamic SQL are more restrictive (a 8,000-character limit for the varchar data type or 4,000 for nvarchar) than the recursive techniques, which are faster, but you can use them only in SS2005.

devx-admin

devx-admin

Share the Post:
Poland Energy Future

Westinghouse Builds Polish Power Plant

Westinghouse Electric Company and Bechtel have come together to establish a formal partnership in order to design and construct Poland’s inaugural nuclear power plant at

EV Labor Market

EV Industry Hurting For Skilled Labor

The United Auto Workers strike has highlighted the anticipated change towards a future dominated by electric vehicles (EVs), a shift which numerous people think will

Soaring EV Quotas

Soaring EV Quotas Spark Battle Against Time

Automakers are still expected to meet stringent electric vehicle (EV) sales quotas, despite the delayed ban on new petrol and diesel cars. Starting January 2023,

Affordable Electric Revolution

Tesla Rivals Make Bold Moves

Tesla, a name synonymous with EVs, has consistently been at the forefront of the automotive industry’s electric revolution. The products that Elon Musk has developed

Poland Energy Future

Westinghouse Builds Polish Power Plant

Westinghouse Electric Company and Bechtel have come together to establish a formal partnership in order to design and construct Poland’s inaugural nuclear power plant at the Lubiatowo-Kopalino site in Pomerania.

EV Labor Market

EV Industry Hurting For Skilled Labor

The United Auto Workers strike has highlighted the anticipated change towards a future dominated by electric vehicles (EVs), a shift which numerous people think will result in job losses. However,

Soaring EV Quotas

Soaring EV Quotas Spark Battle Against Time

Automakers are still expected to meet stringent electric vehicle (EV) sales quotas, despite the delayed ban on new petrol and diesel cars. Starting January 2023, more than one-fifth of automobiles

Affordable Electric Revolution

Tesla Rivals Make Bold Moves

Tesla, a name synonymous with EVs, has consistently been at the forefront of the automotive industry’s electric revolution. The products that Elon Musk has developed are at the forefront because

Sunsets' Technique

Inside the Climate Battle: Make Sunsets’ Technique

On February 12, 2023, Luke Iseman and Andrew Song from the solar geoengineering firm Make Sunsets showcased their technique for injecting sulfur dioxide (SO₂) into the stratosphere as a means

AI Adherence Prediction

AI Algorithm Predicts Treatment Adherence

Swoop, a prominent consumer health data company, has unveiled a cutting-edge algorithm capable of predicting adherence to treatment in people with Multiple Sclerosis (MS) and other health conditions. Utilizing artificial

Personalized UX

Here’s Why You Need to Use JavaScript and Cookies

In today’s increasingly digital world, websites often rely on JavaScript and cookies to provide users with a more seamless and personalized browsing experience. These key components allow websites to display

Geoengineering Methods

Scientists Dimming the Sun: It’s a Good Thing

Scientists at the University of Bern have been exploring geoengineering methods that could potentially slow down the melting of the West Antarctic ice sheet by reducing sunlight exposure. Among these

why startups succeed

The Top Reasons Why Startups Succeed

Everyone hears the stories. Apple was started in a garage. Musk slept in a rented office space while he was creating PayPal with his brother. Facebook was coded by a

Bold Evolution

Intel’s Bold Comeback

Intel, a leading figure in the semiconductor industry, has underperformed in the stock market over the past five years, with shares dropping by 4% as opposed to the 176% return

Semiconductor market

Semiconductor Slump: Rebound on the Horizon

In recent years, the semiconductor sector has faced a slump due to decreasing PC and smartphone sales, especially in 2022 and 2023. Nonetheless, as 2024 approaches, the industry seems to

Elevated Content Deals

Elevate Your Content Creation with Amazing Deals

The latest Tech Deals cater to creators of different levels and budgets, featuring a variety of computer accessories and tools designed specifically for content creation. Enhance your technological setup with

Learn Web Security

An Easy Way to Learn Web Security

The Web Security Academy has recently introduced new educational courses designed to offer a comprehensible and straightforward journey through the intricate realm of web security. These carefully designed learning courses

Military Drones Revolution

Military Drones: New Mobile Command Centers

The Air Force Special Operations Command (AFSOC) is currently working on a pioneering project that aims to transform MQ-9 Reaper drones into mobile command centers to better manage smaller unmanned

Tech Partnership

US and Vietnam: The Next Tech Leaders?

The US and Vietnam have entered into a series of multi-billion-dollar business deals, marking a significant leap forward in their cooperation in vital sectors like artificial intelligence (AI), semiconductors, and

Huge Savings

Score Massive Savings on Portable Gaming

This week in tech bargains, a well-known firm has considerably reduced the price of its portable gaming device, cutting costs by as much as 20 percent, which matches the lowest

Cloudfare Protection

Unbreakable: Cloudflare One Data Protection Suite

Recently, Cloudflare introduced its One Data Protection Suite, an extensive collection of sophisticated security tools designed to protect data in various environments, including web, private, and SaaS applications. The suite

Drone Revolution

Cool Drone Tech Unveiled at London Event

At the DSEI defense event in London, Israeli defense firms exhibited cutting-edge drone technology featuring vertical-takeoff-and-landing (VTOL) abilities while launching two innovative systems that have already been acquired by clients.

2D Semiconductor Revolution

Disrupting Electronics with 2D Semiconductors

The rapid development in electronic devices has created an increasing demand for advanced semiconductors. While silicon has traditionally been the go-to material for such applications, it suffers from certain limitations.

Cisco Growth

Cisco Cuts Jobs To Optimize Growth

Tech giant Cisco Systems Inc. recently unveiled plans to reduce its workforce in two Californian cities, with the goal of optimizing the company’s cost structure. The company has decided to

FAA Authorization

FAA Approves Drone Deliveries

In a significant development for the US drone industry, drone delivery company Zipline has gained Federal Aviation Administration (FAA) authorization, permitting them to operate drones beyond the visual line of

Mortgage Rate Challenges

Prop-Tech Firms Face Mortgage Rate Challenges

The surge in mortgage rates and a subsequent decrease in home buying have presented challenges for prop-tech firms like Divvy Homes, a rent-to-own start-up company. With a previous valuation of

Lighthouse Updates

Microsoft 365 Lighthouse: Powerful Updates

Microsoft has introduced a new update to Microsoft 365 Lighthouse, which includes support for alerts and notifications. This update is designed to give Managed Service Providers (MSPs) increased control and

Website Lock

Mysterious Website Blockage Sparks Concern

Recently, visitors of a well-known resource website encountered a message blocking their access, resulting in disappointment and frustration among its users. While the reason for this limitation remains uncertain, specialists