2) Extracting a substring from a string, testing logically and presenting for machine learning.

This Blog entry is from the Loading and Shaping section in Learn R.

In Horizontal Abstraction, it is quite common to have the requirement to inspect a string of data looking for an occurrence (or pattern) and return a logical value that can be used in machine learning.

In this example, a string will be inspected and return a 1 in the event that the string "Richard" is present.

Firstly, create a vector of name strings by typing:

Names <- c("Richard","Robert","Reinhard","Raymond","Richardino","Richardo")
a-script-creating-a-names-vector-of-string-type-in-r.png

Run the line of script to console:

creating-a-vector-in-the-r-console-containing-strings-which-represent-names.png

Use the substr() function to create a vector of the first 7 characters of the value contained in the Names vector,  by typing:

NamesSubstr <= substr(Names,1,7)
a-script-to-use-the-sybstr-function-in-r-on-a-vector-of-strings.png

Write the NamesSubstr vector:

a-script-on-newly-created-vector-from-the-results-of-using-the-substr-function-on-a-vector-of-names-in-r.png

Run the line of script to console:

writing-to-r-console-the-results-of-a-substr-function-filter-on-a-list-of-names.png

The question being posed is whether the first characters of the name is equal to "Richard".  To perform this evaluation, create a logical vector from the NamesSubstr vector by typing:

NamesSubstrLogical <- NamesSubstr == "Richard"
a script using the results of the substr function to match a name in r

Notice how a double equals sign is used to eliminate confusion between evaluation and assignment. 

Run the line of script to console:

a-script-using-the-results-of-the-substr-function-to-match-a-name-written-to-the-r-console.png

Write the logical vector out to console by typing:

NamesSubstrLogical
a-seperate-vector-was-created-as-the-result-of-substr-matching-and-will-be-written-out-r-script.png

Run the line of script to console:

It-can-be-seen-that-a-logical-vector-has-been-returned-because-of-equalty-in-r-console.png

The character notion of TRUE or FALSE cannot be used in machine learning readily (you can’t multiply by text) and it follows that these values should be converted to a numeric value using the as.numeric() function,  typing:

NamesSubstrLogicalNumeric <- as.numeric(NamesSubstrLogical)
a-script-turning-a-logical-vector-to-a-numeric-vector-in-r.png

Run the line of script to console:

conversion-from-logical-vector-to-a-numeric-vector-written-to-r-console.png

Write the newly created vector to console by typing:

a-script-to-write-out-a-numeric-vector-in-r.png

Run the line of script to console. A more concise line of script nesting the functions might be:

NamesSubstrLogicalNumericNested <- as.numeric(substr(Names,1,7) == "Richard")
a-script-to-substr-function-creating-a-logical-vector-then-turning-it-into-numeric-in-r.png

An alternative approach might be converting the logical vector to a factor.