stringr + regex = amazing

28. April 2016 data, programming, r 0
stringr + regex = amazing

I had a data request come through that required me to break down submitted forms by the topic of concern (i.e., why was the form submitted in the first place?).  I remember from my earlier work with this database, that the form fields were stored in key value pairs in one column.  This was ok, but then I remembered, that each form was different because the application only stored the fields that were turned on (checked).

After searching the web, I came across a Stack Overflow question where Hadley Wickham mentioned using his stringr package.  This was great, except I had no real clue how to use Regex–I’ve used it once, but it was in a DataCamp tutorial where I had to copy and paste the expression in and hit submit.  This is when I ran across RegExr.com.  This site allows you to paste your string into a text box and then create an expression to see what it matches.

I was able to create an expression paired with str_match_all() function from stringr that extracted the topic of concern selection from this gigantic string and add it to a new column.  In addition to the code below, I created 6 other for loops to extract main subtopics from the this extracted string.  Once that was complete, it was off to Excel to recode the extracted values to their actual values.

library(stringr)
library(dplyr)
data2 <- data %>% filter(type == "NOC")


for (i in 1:nrow(data2)) {
data2$dataExtReason[i] <- str_match(data2$formdetails[i], '"toc".+\\}\\]\\}')
}

Also published on Medium.


Leave a Reply