Geocoding Addresses in R with ggmap

16. March 2016 data, programming, r 2
Geocoding Addresses in R with ggmap

A few months ago, I needed to geocode over 1,600 addresses so that I could plot them on a map. The biggest problem I had was trying to figure out how to geocode these addresses in an efficient manner. There are several services on the web that will allow you to geocode address-by-address, but that would have taken forever.  Other websites offer a to geocode the addresses for a price, but I needed to do it on the cheap.

After researching solutions for a few hours and going through several packages in R, I stumbled upon the ggmap package. This package contains a function that takes an address, uses Google to grab the latitude and longitude, and finally returns it to R. The “magic” comes in the form of a for loop that reads all of the addresses in one by one and stores the geocoded address back into the data frame. So, let’s get into it!

Script breakdown

First things first, if you don’t have the ggmap package installed, use install.packages("ggmap") and follow the onscreen prompts.  Once it’s installed, fire up a new R Script file in Studio.  Also, there are restrictions (set by Google) for how many times you can geocode in a day.  Check out the package documentation or run geocodeQueryCheck(userType = "free") in the console to see the current rate limits and what you’ve used.

I wanted to know how long it took for the script to run, so I used a super basic way to calculate the processing time. I stored the current Sys.time() into a variable called startTime. At the end of the script, the processing time is calculated.

# Set the time the script fired
startTime <- Sys.time()

For my implementation, I wanted to select the file that contains the addresses. When run, the file.choose() function pops up a file chooser window.   My address file contains about 30 variables (columns), one of which is the Site.Address column that contains the full street address.

Once the file has been selected, I read in the CSV file into the origAddress variable. StringsAsFactors needs to be false, otherwise the addresses will be imported as a factor instead of a character and the geocoding will not work.

# Select the file from the file chooser
fileToLoad <- file.choose(new = TRUE)

# Read in the CSV data and store it in a variable 
origAddress <- read.csv(fileToLoad, stringsAsFactors = FALSE)

 

I needed to create a new data frame in which to store my geocoded variables while the for loop processes all of the addresses.  Next, is the meat of the script, the for loop.  The for loop reads the number of rows in the  origAddress dataset to tell it when to stop its processing.  As part of the geocode function, I pass in the address at position 1, tell it to provide me with the particular output of latona, and tell it to use google as my source.  Once the geocode function spits out the latitude, longitude, and address it used, I tell R to push this information into the  origAddress dataset into new columns called lon, lat, and geoAddress.  Once the for loop ends (after reaching the last row in the  origAddress dataset), I tell R to write the  origAddress to a new CSV file called geocoded.csv in my working directory.

# Initialize the data frame
geocoded <- data.frame(stringsAsFactors = FALSE)

# Loop through the addresses to get the latitude and longitude of each address and add it to the
# origAddress data frame in new columns lat and lon
for(i in 1:nrow(origAddress))
{
  # Print("Working...")
  result <- geocode(origAddress$Site.Address[i], output = "latlona", source = "google")
  origAddress$lon[i] <- as.numeric(result[1])
  origAddress$lat[i] <- as.numeric(result[2])
  origAddress$geoAddress[i] <- as.character(result[3])
}

# Write a CSV file containing origAddress to the working directory
write.csv(origAddress, "geocoded.csv")

 

The last part of the script stores the current Sys.time() into a new variable called endTime. Then, I used simple subtraction to calculate the time difference and report it to the console.

endTime <- Sys.time()

# calculate and print the time difference (processing time)
processingTime <- endTime - startTime
processingTime

 

Full Script that I created

library(ggmap)

# Set the time the script fired
startTime <- Sys.time()

# Select the file from the file chooser
fileToLoad <- file.choose(new = TRUE)

# Read in the CSV data and store it in a variable 
origAddress <- read.csv(fileToLoad, stringsAsFactors = FALSE)

# Initialize the data frame
geocoded <- data.frame(stringsAsFactors = FALSE)

# Loop through the addresses to get the latitude and longitude of each address and add it to the
# origAddress data frame in new columns lat and lon
for(i in 1:nrow(origAddress))
{
  # Print("Working...")
  result <- geocode(origAddress$Site.Address[i], output = "latlona", source = "google")
  origAddress$lon[i] <- as.numeric(result[1])
  origAddress$lat[i] <- as.numeric(result[2])
  origAddress$geoAddress[i] <- as.character(result[3])
}

# Write a CSV file containing origAddress to the working directory
write.csv(origAddress, "geocoded.csv")

endTime <- Sys.time()

# calculate and print the time difference (processing time)
processingTime <- endTime - startTime
processingTime

 

At some point, I’ll put together a new post on how I combined the geocoding in R with another package to create the map below.

Map showing red dots identifying location


2 thoughts on “Geocoding Addresses in R with ggmap”

  • 1
    Steve on October 6, 2016 Reply

    Thanks for this. Really helped me out. One question though – the loop stops when a result is not found, how do I change the code to skip and keep going if there is no result for a given address.

    Thanks.

Leave a Reply