Extracting Data with an API and RStudio

30. July 2016 data, r 0
Extracting Data with an API and RStudio

Recently, I needed to be able to extract analytics data from our provider before the last day of our contract.  We were moving to a much more robust system, and they were being less than helpful and very unresponsive.  That was when I remembered that they have an API available for extracting data.  The API allows you to select a view for the report (metrics and segments that I pre-configured) and select the reporting period (day, week, month, quarter, fiscal year). Since I wanted the data extracted by day, that meant I would need to run the report or update the API URL 943 times.

I like to pride myself on being an efficient person and I knew that processing these files manually was not going to be efficient and I was going to waste a lot of time.  As I was trying to find a better way to process all of these files, I remembered that R has a download.file() function.  I knew that I could wrap the download.file() function in a for loop that would loop through each day and download the file.  The code below is a refactored version of the original script that I created.  Version 1 was rough–like really rough. I did some things backwards, so be glad that you’re not reading that version of the code.

The analytics company we were using structured their API using the following format.  Note: this is not a real URL, but the structure is similar.

https://api.thisparticularanalyticscorp.com/extract?userid=XXXX&view=pagecontent&format=CSV&date=D20130130&filename=D20130130.csv

The URL needed two sections (both dates) to be updated in order to get the service to download the CSV.  In order to get the date field in the appropriate format, I used the gsub() function to remove the hyphens.

The gsub() function looks at a particular string and tries to find a match for a specified input.  When a match is located, it replaces the matched value with your replacement string.

gsub("stringtomatch", "replacementstring", stringtosearch)

The final bit, and probably the most important, was getting the data combined and into a usable format.  This included adding a new categorical variable to the data set.  I knew I needed the date of the report and the fiscal year (Dec-Nov) added to make future analysis easier.  Without adding the date to a new column, there could be hundreds of thousands or rows in the dataset, but no date on which to aggregate.  Fiscal year is a bit tricky since my company’s fiscal year starts in December and not January.  I wrote about this in a previous post, but the Lubridate package currently doesn’t allow you to modify when your year starts, so most of the built in functions that provide aggregation (month, week, quarter) are not accurate for this part of the script.  To get around this, I wrote a simple ifelse() statement that adds 1 to the year of seqdate if the month is equal to December.  Otherwise, ifelse() returns the current year from seqdate.

Depending on the data being extracted and the responsiveness of the service providing the file, the entire process could take anywhere from a few minutes to several hours.  When I know that the CSV files being returned are going to be massive, I just download the files by commenting out the read.csv and rbind statements.  For instance, last night, I needed to download 943 CSV files containing daily metrics for all of our web pages.  Each file contained about 13,000 records and had an average file size of 4.3 MB.  This would have resulted in a data frame of over 12.2 million rows.  I would love to have done this, but my computer would have died.  Due to number of records that would be in the data frame, I just downloaded the files and will process them at a later time.  Chances are high that we will only need a subset of the data for analysis.  If we end up needing all 12M records, I’ll have to find another solution.

Full Code:

library(lubridate)

startTime = Sys.time()
days = seq(as.Date("2013-12-01"), as.Date("2016-06-30"), by = "day")

allDat = data.frame()
for (i in seq_along(days)) {
    seqdate = gusb("-", "", days[i])
    url = paste(
             "https://api.thisparticularanalyticscorp.com/extract?userid=XXXX&view=pagecontent&format=CSV&date=D",
             seqdate,
             "&filename=D",
             seqdate,
             ".csv",
             sep = "")
    filename = paste("PageContents-", seqdate, ".csv", sep = "")

    download.file(url, filename)

    dat = aread.csv(filename, stringsAsFactors = F)

    dat$ReportDate = seqdate

    dat$FiscalYear = ifelse(month(seqdate) = 12, year(seqdate) + 1, year(seqdate))
    
    allDat = rbind(allDat, dat)
}

endTime = Sys.time()

duration = endTime - startTime

duration

Leave a Reply