Scraping Web Pages to Improve Search Engine Results

03. May 2017 data, programming, r 0
Scraping Web Pages to Improve Search Engine Results

Today, I found an issue on some pages that needed to be addressed to improve the search engine results on Google. Namely, the title tag was not being set, causing them to show oddly in a Google search. I wasn’t sure how many pages were affected, but needed to get the information to the SEO team for resolution.  To do this, I created an R script that navigates to each web page, scrapes the HTML, extracts the needed elements from the page, and stores the results in the data frame.

There is an awesome R package called rvest, created by Hadley Wickham, which I find essential for web scraping in R. With this package, I am able to iterate through thousands of pages with just a few lines of code.

Below is my script. I had all of my URLs in a CSV file, so I read the file in using the readr package (another Hadley Wickham package). Then, I used a for loop to iterate through all the page URLs, scrape the HTML, extract the title tag, and store the extracted data in a new column titled PAGE_TITLE. Finally, I write the data frame out to a CSV file.

library(readr)
library(rvest)
library(dplyr)

allPages = read_csv(...)
allPages$PAGE_TITLE = ""

for (i in 1:nrow(allPages)) {
  
  webAddy = allPages[i,1]
  pgTitle = read_html(webAddy) %>% html_nodes("title") %>% html_text()
  
  allPages$PAGE_TITLE[i] = pgTitle
}

write.csv(allPages, "allPages-PageTitles.csv", row.names = F)

In a little over 30 minutes, I was able to scrape 2,146 pages and identify those pages without a title. Luckily, it was a low percentage of pages missing a title. This sure beats the manual process of going through each page to find the problems.  I’ll probably end up restructuring this script to extract the meta descriptions from each page and run the script over again–I’m now finding that meta descriptions are also not being set when the pages are created.

 

Update: After posting this, I did indeed modify the script to capture the meta description content to identify which pages need to be updated.  Below is the updated script.  I ended up storing the scraped HTML into a variable so that I could pass it to the html_nodes() function to create the pgTitle and pgMetaDescr variables.

library(readr)
library(rvest)
library(dplyr)

allPages = read_csv(...)
allPages$PAGE_TITLE = ""

for (i in 1:nrow(allPages)) {
  
  webAddy = allPages[i,1]
  pgHTML = read_html(webAddy)
  pgTitle = pgHTML %>% html_nodes("title") %>% html_text()
  pgMetaDescr = pgHTML %>% html_nodes("meta[name=description]") %>% html_attr("content")
  allPages$PAGE_TITLE[i] = pgTitle
  allPages$META_DESCR[i] = pgMetaDescr
}

write.csv(allPages, "allPages-PageTitles.csv", row.names = F)

 


Also published on Medium.


Leave a Reply