SAS has several pre-defined character as well as numeric functions, which are used on the variable values to generate some required outputs. Lets understand this with an example. A retail company wants to know about customer contact details based on their address. But the problem is they have single field data for customer address. Now an address may contain Name, Place, Telephone no., City, State, and PIN Code. One needs to seperate these values under different heads or variables.
This can be performed using various character functions in SAS.
CUST_ID ADD
110 AJAY SINGH SEC.105, NOIDA 988134532
112 BABITA KUMARI NAYI GALI DELHI 893324511
113 SATVINDER GURUGRAM 01244321912
Looking at above address data one might use various combination of SAS character or Numeric functions to get valuable information out of it. But before we write any code, lets first get introduced with SAS Character functions & SAS numeric functions.
Character Functions:
Below is a table mentioning some important character functions. I will explain some of them in detail.
Compress function: A compress function is used to remove blanks and a set of characters or punctuation or numerals, mentioned as in second argument. There is a third argument also, which is called as function modifier & produce different results as per requirement. A list of compress function modifier is mentioned below.
Compress function syntax is
COMPRESS ( STRING-NAME, ' TO BE REMOVED CHARACTERS', MODIFIER)
Lets apply compress function to above example & try to take out contact nos. from address.
/* First let's create a dataset CUST_DATA as mentioned in above example.*/
DATA CUST_DATA;
LENGTH ADD$ 50.;
INPUT CUST_ID ADD$ &;
DATALINES;
110 AJAY SINGH SEC.105, NOIDA 988134532
112 BABITA KUMARI NAYI GALI DELHI 893324511
113 SATVINDER GURUGRAM 01244321912
;
RUN;
DATA CUST_CONT;
SET CUST_DATA;
CONTACT = COMPRESS(ADD, '1234567890', 'k');
DROP ADD;
RUN;
We will get below dataset. Which has only numbers under variable "Contact".
For customer id 110 contact no. obtained is not correct. It has some part of address in it. We have to modify our code, so that it contains only contact no. Also a point to notice is that contact no. is still a character variable.(Notice, contact no. is left aligned.) We can easily convert a character string of numbers into a numeric variable using PUT function, and vice-versa using an INPUT function. I will describe the same in detail.
Looking at above address data one might use various combination of SAS character or Numeric functions to get valuable information out of it. But before we write any code, lets first get introduced with SAS Character functions & SAS numeric functions.
Character Functions:
Below is a table mentioning some important character functions. I will explain some of them in detail.
Sr. No. | Function | Application |
1 | Length (string-name) | Retruns length of a character string not counting trailing blank. |
2 | Lengthn (string-name) | Similar to Length function, except it returns, a '0', if a blank value. |
3 | Lengthc( string-name) | Return storage length of string. |
4 | Upcase (string) | Converts whole string to uppercase |
5 | Lowcase (string) | Converts whole string to lowcase |
6 | Propcase(string) | Converts whole string to sentance case |
7 | Compbl (string) | Converts 2 or more blank to single blank between two words |
8 | Compress (string) | removes blank or character specified |
9 | CAT(string1, string2) | It joins 2 string & output string length - 200, if variable length is not declared |
10 | CATS(string1, string2) | removes leading & trailing blank before joining 2 strings |
11 | CATX('seperator', string1, string2) | removes leading & trailing blank, & then put a seperator b/w 2 strings |
12 | Left(string) | removes leading blanks |
13 | Trim(string) | removes trailing blanks |
14 | Strip(string) | removes all blanks |
15 | Find(string, 'find-string', modifier) | Returns the position of find-string in string, if modifier is mentioned it will make string case irrelevant. |
16 | Findc(string, 'characters') | Returns the position of first individual character provided in the 'characters'. It looks for individual character unlike 'FIND' function which searches for string. |
17 | Findw(string, find-string,delimiter, modifier) | Finds word not string, rest is similar to Find function |
18 | Anydigit() | Position of first digit is returned, if not then 0 is returned |
19 | Anypunct() | Position of punctuation is returned, if not then 0 is returned |
20 | Anyspace() | Position of blank space is returned |
21 | Anyalpha() | Position of first alphabet is returned, if not then 0 is returned |
22 | Notdigit() | If argument contains anything other than digit, then position of letter which is not digit is returned. If all digit then '0' is returned. |
23 | Notalpha() | If argument contains anything other than alphabet, then position of letter which is not alphabet is returned. If all alphabet then '0' is returned. |
24 | Notalnum() | If argument contains anything other than alphanumeric, then position of letter which is not alphabet is returned. If all alphanumeric then '0' is returned. |
25 | Verify(string-name, 'valid characters') | Returns a '0' if any valid characters are there, & if not then position of first invalid character is returned. |
26 | substr(string-name, position, length) | Returns a part of string declared as first argument, having first character mentioned as second argument as position, & length of string as third argument |
27 | scan(string-name,word position, delim) | Used to extract word from a string, second argument tells which position word to extract. |
28 | compare(string1, string2, modifier) | Used to compare 2 string, retuns a 0 if both strings are matched. If a modifier ';' is mentioned, then longer string is stripped to smaller string length & then compared. |
29 | Translate(string,'to replace letter', 'replaced by letter') | It replaces letters in the string by the third argument letters. |
30 | Transword(string, replace word, replaced by word) | It replaces words in the string by the third argument words unlike Translate which replaces only letters. |
Compress function: A compress function is used to remove blanks and a set of characters or punctuation or numerals, mentioned as in second argument. There is a third argument also, which is called as function modifier & produce different results as per requirement. A list of compress function modifier is mentioned below.
Compress function syntax is
COMPRESS ( STRING-NAME, ' TO BE REMOVED CHARACTERS', MODIFIER)
Modifier | Action |
a | Remove all Characters |
d | Remove all digits |
i | ignore string case |
k | retain all items listed in second argument |
s | blank space, tab to be added in second argument |
p | Remove all Punctuations |
Lets apply compress function to above example & try to take out contact nos. from address.
/* First let's create a dataset CUST_DATA as mentioned in above example.*/
DATA CUST_DATA;
LENGTH ADD$ 50.;
INPUT CUST_ID ADD$ &;
DATALINES;
110 AJAY SINGH SEC.105, NOIDA 988134532
112 BABITA KUMARI NAYI GALI DELHI 893324511
113 SATVINDER GURUGRAM 01244321912
;
RUN;
DATA CUST_CONT;
SET CUST_DATA;
CONTACT = COMPRESS(ADD, '1234567890', 'k');
DROP ADD;
RUN;
We will get below dataset. Which has only numbers under variable "Contact".
For customer id 110 contact no. obtained is not correct. It has some part of address in it. We have to modify our code, so that it contains only contact no. Also a point to notice is that contact no. is still a character variable.(Notice, contact no. is left aligned.) We can easily convert a character string of numbers into a numeric variable using PUT function, and vice-versa using an INPUT function. I will describe the same in detail.
Suppose if we want to take out first name of customer from the address variable. we can do so by using a combination of character function.
First we are interested in finding the first space after the first name. Below mentioned code can give the same.
DATA SPACE;
SET CUST_DATA;
n = FINDC(ADD, ' ');
RUN;
Now we know the length of string that we need to take out from ADD variable which is the first name of customer.
Using a SUBSTR function we can take out the string of a particular length.
Again adding a substring function step in above code, mentioning the length of first names, i.e. n-1.
DATA SPACE;
SET CUST_DATA;
n = FINDC(ADD, ' ');
FIRST_NAME = SUBSTR(ADD, 1, n-1);
DROP n ADD;
RUN;
Below is the output for this program.
Remember you can combine & code for contact no. & first name in single program, Here i am writing seperate codes for better understanding of the codes.
Next we will try to take out city from the address field. We know city of residence for customer is just before his contact no. & will leverage this information to take out customer's current location from add variable.
Lets first remove all the trailing blanks from the Add observation, since all observations are of different length.
This can be achieved using LEFT & TRIM function.
ADD1 = LEFT(TRIM(ADD));
Now we will use a SCAN function to look for second word, but this time from right side.
CITY = SCAN(ADD, -2);
Combining all the information & codes above.
DATA CUST_DATA_REFINED;
SET CUST_DATA;
ADD1 = LEFT(TRIM(ADD));
n= FINDC(ADD, ' ');
FIRST_NAME = SUBSTR(ADD, 1, n-1);
CITY = SCAN(ADD1,-2);
DROP n ADD ADD1;
RUN;
We will have below dataset as output.
No comments:
Post a Comment