Login     Register

        Contact Us     Search

Normalizing Data with the PARSE function

Jul 31

Written by: Charles Flock
7/31/2009 4:07 PM  RssIcon

In this article we discuss the ease with which a string of delimited data can be converted to normalized data by using the XLeratorDB PARSE function.
In this article we will look at normalizing delimited data using the PARSE and NUMWORDS functions in XLeratorDB. We have created some test data, which consists of a string delimited by semi-colons terminated by a carriage return and line feed. The string data contains an implicit format. The first element is a category and all the remaining elements are members of that category. In this particular case we have used categories and members based on the Harry Potter series by J.K. Rowling.
Each category will have a different number of members, so there is no easy way to create a table beforehand to accommodate the implicit structure of the data. Thus, there is no easy way to parse the data in Excel and import it into SQL Server. We will have to do that in the database.
To get started, we create a simple table to store the string data. We will call the table StringData and it will contain a single column called String.
CREATE TABLE StringData
(
      String varchar(max)
)
 
Next we will populate the table with data. We will use INSERT statements to do this, but we could have used the Import Data function in SMS, the BULK INSERT function in T-SQL, or even the bcp utility.
 
Next we will populate the table with data. We will use INSERT statements to do this, but we could have used the Import Data function in SMS, the BULK INSERT function in T-SQL, or even the bcp utility.
INSERT INTO StringData
SELECT 'Students; Harry Potter; Ronald Weasley; Hermione Granger; Ginerva Weasley; Neville Longbottom; Luna Lovegood'
UNION ALL
SELECT 'Professors; Albus Dumbledore; Severus Snape; Horace Slughorn; Rubeus Hagrid; Filius Flitwick; Minerva McGonagall; Poppy Pomfrey; Quirinius Quirrell; Pomona Sprout; Sybill Trelawney; Cuthbert Binns; Charity Burbage; Wilhelmina Grubbly-Plank; Rolanda Hooch; Aurora Sinistra; Septima Vector'
UNION ALL
SELECT 'Order of the Phoenix Members; Alastor Moody; Sirius Black; Minerva McGonagall; Remus Lupin; Aberforth Dumbledore; Albus Dumbledore; Arabella Figg; Mundungus Fletcher; Rubeus Hagrid; Peter Pettigrew; James Potter; Lily Potter; Severus Snape; Fleur Delacour; Kingsley Shacklebolt; Nymphodora Tonks; Arthur Weasley; Molly Weasley; Bill Weasley; Charlie Weasley'
UNION ALL
SELECT 'Death Eaters; Lord Voldemort; Antonin Dolohov; Bellatrix Lestrange; Lucious Malfoy'
UNION ALL
SELECT 'Places; Diagon Alley; Hogsmeade; Hogwarts School of Witchcraft and Wizardry; The Burrow; Knockturn Alley; Little Whinging; Azkabhan'
UNION ALL
SELECT 'Organizations; Order of the Phoenix; Death Eaters; Ministry of Magic; Dumbledore’s Army'
UNION ALL
SELECT 'Creatures; Hippogriffs; Werewolves; Basilisk; House-elves; Inferi; Dementors; Owls; Boggarts; Centaurs; Ghosts; Goblins; Thestrals'
UNION ALL
SELECT 'Objects; Wand; Broomstick; Pensieve; Time-Turner; Sorcerer’s Stone'
UNION ALL
SELECT 'Houses; Gryffindor; Hufflepuff; Ravenclaw; Slytherin'
UNION ALL
SELECT 'Subjects; Transfiguration; Defence Against the Dark Arts; Charms; Potions; Astromony; History of Magic; Herbology; Arithmancy; Ancient Runes; Divinations; Care of Magical Creatures; Muggle Studies; Flying; Apparition'
UNION ALL
SELECT 'Ghosts; The Bloody Baron; The Fat Friar; The Grey Lady; Nearly Headless Nick; Moaning Myrtle'
UNION ALL
SELECT 'Ministry of Magic; Arthur Weasely; Kingsley Shacklebolt; Nymphodora Tonks; Ludo Bagman; Barty Crouch, Sr.; John Dawlish; Cornelius Fudge; Bertha Jorkins; ; Rufus Scrimgeur; Pius Thicknesse; Dolores Umbridge; Percy Weasley'
UNION ALL
SELECT 'Headmasters; Albus Dumbledore; Dolores Umbridge; Severus Snape; Minerva McGonagall; Phineas Nigellus Black; Dexter Fortescue; Armando Dippet'
UNION ALL
SELECT 'Dragons; Antipodean Opaleye; Chinese Fireball; Common Welsh Green; Hebridean Black; Hungarian Horntail; Norwegian Ridgeback; Peruvian Vipertooth; Romanian Longhorn; Swedish Short-Snout; Ukranian Ironbelly'
 
 
Here is what the data look like after it has been inserted.
 
 
 
Let’s say we wanted to know all the categories for Albus Dumbledore. We could use the following SELECT statement
select wct.TRIM(wct.PARSE(String,'\;', 1)) as [Dumbledore Categories]
FROM StringData
where string like '%Albus Dumbledore%'
Which returns the following result:

This SQL retrieves the first word (words are delimited by semicolons) of any records containg ‘Albus Dumbledore’. As mentioned earlier, the data is structured so that the first word of each record represents the category for the remaining words in the record, which represent the members of the category.
We used 2 XLeratorDB functions in this statement. The PARSE function gets a specific ‘word’ in a delimited string. We have specified that we want the first word in the string, and that the string is delimited by semi-colons. Since the PARSE function allows you to specify multiple delimiters and it uses the semi-colon as its delimiter, we need to use the ‘\’ as the escape character. The TRIM function removes the leading and trailing spaces from the result of the PARSE function.
If we wanted to, we could use another XLeratorDB function, JOINSTR_q, to put all the categories for Albus Dumbledore into a single line seperated by a semi-colon.
Select wct.JOINSTR_q(
'; '
,NULL
,'select wct.TRIM(wct.PARSE(String,' + wct.quotes('\;') + ', 1))
FROM StringData
where string like ' + wct.QUOTES('%Albus Dumbledore%') + ' order by 1') as [Dumbledore Categories]
 Which returns the following result:

Now, we are going to create a list that shows categories and and members of the category in a normalized fashion. To do this, we are going to use a recursive CTE using the PARSE and TRIM functions.
with mycte (Category, Member, Wordno) as
(
      select wct.TRIM(wct.PARSE(String,'\;', 1))
      ,wct.TRIM(wct.PARSE(String,'\;', 2))
      ,2
      from StringData
      Union all
      select mycte.Category
      ,wct.TRIM(wct.PARSE(String,'\;', wordno + 1))
      ,mycte.wordno + 1
      from StringData, mycte
      where mycte.category = wct.PARSE(StringData.String,'\;', 1)
      and mycte.wordno < wct.numwords(StringData.String, '\;')
)
      Select category, member from mycte order by 1, 2
This creates a CTE with 126 rows and looks like this:

Since the CTE is created on the fly and only exists within the execution scope of the SELECT statement, we are going to put the CTE into a temporary table (though it could just as easily be a permanent table) and do some further data manipulation on it. This statement will create the temporary table and insert the date into it.
CREATE table HP#
(
      category varchar(max)
      ,member varchar(max)
)    
 
;with mycte (Category, Member, Wordno) as
(
      select wct.TRIM(wct.PARSE(String,'\;', 1))
      ,wct.TRIM(wct.PARSE(String,'\;', 2))
      ,2
      from StringData
      Union all
      select mycte.Category
      ,wct.TRIM(wct.PARSE(String,'\;', wordno + 1))
      ,mycte.wordno + 1
      from StringData, mycte
      where mycte.category = wct.TRIM(wct.PARSE(StringData.String,'\;', 1))
      and mycte.wordno < wct.numwords(StringData.String, '\;')
) insert into HP# select Category, Member from mycte
If we want to know all Death Eaters, for example, we would run the following SELECT statement.
select *
from HP#
where category = 'Death Eaters'
Which returns the following result:

We could also find out how many entries there are in each category:
select Category
,count(*) [Number of Members]
from HP#
group by Category
Which returns the following result:

Here, we will find all the members that are in more than one category, display the members alphabetically, and list all the categories of which they are members in a semi-colon delimited string.
select a.member,
wct.JOINSTR_q(
      '; ',
      NULL,
      'Select b.category
      from HP# b
      where b.member = ' + wct.QUOTES(a.member) + 'order by 1') as Categories
From HP# A
Group by member
HAVING count(*) > 1
Which returns the following result:

As you can see, creating normalized data from a delimited string by using XLeratorDB did not reuqire any stored procedures, cursors, declare or set statements, or while statements. Using the PARSE and JOINSTR functions we were able to take a string, normalize it, and repackage it using mostly SELECT statements and common table expressions.

 

 

Tags:
Categories:
Copyright 2008-2017 Westclintech LLC         Privacy Policy        Terms of Service