3) Searching with Regular Expressions

This Blog entry is from the Loading and Shaping section in Learn R.

The substr() function was used to search for any occurrence of the string "Richard".  The substr() is a very limited function and assumes a certain amount of structure exists in the base string.  The grepl() function allows for the searching of a character string with regular expressions rather than specific location based arguments.  Regular Expressions are a sequence of symbols and characters expressing a string, or pattern, describing a search within a longer piece of text.  Regular Expressions can be quite complex but they are extraordinarily powerful for string matching.

This procedure sets out to replicate the substr() function using Regular Expressions and the grepl() function,  searching for any string that starts with "Richard" using the ^ symbol:

NamesGrepl <- grepl(^Richard,NamesSubstr)

Run the line of script to console:


Write the NamesGrepl vector out to console by typing:


It can be observed that any name string starting with "Richard" has been returned as a logical vector.  To make this abstraction useful for machine learning it is a simple matter of transforming it to a numeric vector by typing:

NameGreplNumeric <- as.numeric(NamesGrepl)

Run the line of script to console:


Write out the NamesGrepNumeric vector by typing:


Run the line of script to console:


It can be seen that this vector is now more appropriate for machine learning.  Nesting the functions,  the procedure could be created more sucinctly by typing:

NamesGrepNumericNested <- as.numeric(grepl("^Richard",Names))

2) Extracting a substring from a string, testing logically and presenting for machine learning.

This Blog entry is from the Loading and Shaping section in Learn R.

In Horizontal Abstraction, it is quite common to have the requirement to inspect a string of data looking for an occurrence (or pattern) and return a logical value that can be used in machine learning.

In this example, a string will be inspected and return a 1 in the event that the string "Richard" is present.

Firstly, create a vector of name strings by typing:

Names <- c("Richard","Robert","Reinhard","Raymond","Richardino","Richardo")

Run the line of script to console:


Use the substr() function to create a vector of the first 7 characters of the value contained in the Names vector,  by typing:

NamesSubstr <= substr(Names,1,7)

Write the NamesSubstr vector:


Run the line of script to console:


The question being posed is whether the first characters of the name is equal to "Richard".  To perform this evaluation, create a logical vector from the NamesSubstr vector by typing:

NamesSubstrLogical <- NamesSubstr == "Richard"
a script using the results of the substr function to match a name in r

Notice how a double equals sign is used to eliminate confusion between evaluation and assignment. 

Run the line of script to console:


Write the logical vector out to console by typing:


Run the line of script to console:


The character notion of TRUE or FALSE cannot be used in machine learning readily (you can’t multiply by text) and it follows that these values should be converted to a numeric value using the as.numeric() function,  typing:

NamesSubstrLogicalNumeric <- as.numeric(NamesSubstrLogical)

Run the line of script to console:


Write the newly created vector to console by typing:


Run the line of script to console. A more concise line of script nesting the functions might be:

NamesSubstrLogicalNumericNested <- as.numeric(substr(Names,1,7) == "Richard")

An alternative approach might be converting the logical vector to a factor.