‘urltools’ R package for the win!

20. January 2017 analytics, data, programming, r 0
‘urltools’ R package for the win!

Yesterday, there was an urgent need to sort through 247,000 URLs to find parameter issues. I started digging into the task in R using some packages to strip out the parameters from the URLs. However, it seemed like I was writing a lot of code to extract the parameters and then clean the extracted values and time was against us.  Then it occurred to me that I was probably not the first person to ever have to do this sort of work in R. Almost instantly, I went to Google and typed “strip URL parameters” and hit return. The first link, at least for me, was the urltools package. Everything I was trying to code myself can easily be performed with a few function calls.

A quick run down of what I needed to do: check the scheme (http or https), check the domain name, check the path that a guest will be directed to on the site, and finally check each one of the parameters for accuracy. The parameter extraction also includes parameters that are not digestible by the website (improper capitalization of parameters causes them to be ignored). After writing the script, which took only a few minutes after reading the documentation, I broke out the 247,000 URLs into their respective columns and wrote them to a CSV file in about 3 seconds. Don’t get me wrong, there is nothing more gratifying to me than writing my own code and getting the results, but I also do not find value in reinventing the wheel. If it’s there and does what I need it to do, I’m going to use it!

This package has so many other useful functions available and is, like most R packages, well documented with good examples.

Below is the script I used for my URL accuracy check with the exception of the parameter vector in the param_get() function.

Happy parsing!

library(urltools)

# Read in the URLs
dat = read.csv("LandingPageURLs.csv", stringsAsFactors = F)

# Extract the scheme, domain, and path from the urls
dat$scheme = scheme(dat$LandingPage)
dat$domain = domain(dat$LandingPage)
dat$path = path(dat$LandingPage)

# Extract particular parameters from the URLs
x = param_get(dat$LandingPage, c("parameter1", "parameter2", "parameter3", "etc"))

#combine the original data set and the extracted parameters data set together into a new data frame
datNew = cbind(dat, x)

# Write the data frame to CSV
write.csv(datNew, "URL-Split.csv")

Leave a Reply