Guide To R For SEO
If you’ve heard of the R language and think that SEOs who use it are aliens, you’re not totally incorrect.
Originally designed for data scientists and statisticians, the R programming language has gained popularity in recent years, and the rationale is simple:
Automate tasks, extract and aggregate data via APIs, scrape web pages, cross-reference numerous files (for example, keywords), or do text mining, machine learning, NLP, and semantic analysis with R and its many SEO tools.
But let’s be clear: R isn’t an SEO secret!
If you’ve ever wanted to build your own SEO tools or transition from classic empirical techniques to data-driven SEO, you’re on your way to becoming an extra-terrestrial as well.
What Exactly Is SEO?
Search Engine Optimization, or SEO, is the practice of increasing the quality and quantity of internet traffic from search engines to a website or web page.
What Exactly Is R?
R is a programming language and software environment that is free to use.
Why Would You Want To Use R For SEO?
R is a programming language that specialises in data mining, statistical and data analysis, and data visualisation. It also has a good crawling ability. Basically, everything that can help you with SEO.
R’ is likewise extremely simple to read and write after you’ve grasped the principles. Even if you don’t want to understand it thoroughly, you can crawl a website or extract Google Analytics data and export it to a CSV by copying and pasting three lines of code.
When Is It OK To Use R For SEO?
When dealing with large websites with thousands of pages, R comes in handy. I’m a great admirer of automation, but its use must be carefully assessed. There are a number of wonderful SEO tools available; R’ will never replace them, but it is a really welcome addition to your toolbox.
Where Should I Write The R?
Begin by downloading R and installing the open-source and free software R Studio.
After installing R Studio, you may test the following R scripts directly in the console (bottom left section) or copy and paste them into a new script: R Script is a file that can be created by going to File > New File.
sessionInfo() #View environmental information
getwd() #View the Working directory
setwd(“/Users/remi/dossier/”) #Set the working directory
list.files() #See the contents of the directory
dir.create(“nomdudossier”) #Create a folder in the directory
R comes with a plethora of packages (= “functionalities” to download). The list is available on the cran-r website.
install.packages(“nomdupackage”) #Install a package
install.packages(c(“packageA”, “packageB”, “packageC”))#Install several packages at a time
#Install a package list only if they are not already installed
list.of.packages <- c(“dplyr”, “ggplot2″,”tidyverse”, “tidytext”, “wordcloud”, “wordcloud2”, “gridExtra”, “grid”)
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,”Package”])]
library(“nomdupackage”) #Load an installed package
?nomdupackage #View package documentation
packageVersion(“nomdupackage”) #Get the version of a package
detach(“package:dplyr”, unload=TRUE)#Forcing a package to close without closing R Studio
Here are some of the most important R programmes for SEO:
- dplyr: Working with data from a dataframe (filter, sort, select, summarize, etc)
- SEMrushR (French): Make use of the SEMrush API
- majesticR (French): Make use of the Majestic API
- kwClustersR: Group a collection of keywords.
- duplicateContentR (French): Determine a similarity score between two pages in order to detect duplicate content.
- text2vec: Extract n-grams.
- eVenn: Make Venn diagrams (useful for semantic audits)
- tm: Handle accents and stopwords
- ggplot: Create graphs
- Shiny: Build a real-world application based on your scripts.
- searchConsoleR (French): Use the Google Search Console API
- httr: Perform GET, POST, PUT, and DELETE operations
- Rcurl: For making requests that are more comprehensive than httr.
- XML: For parsing web documents.
- jsonlite: Obtain json
- googleAuthR: To manage Google authentication.
- googleAnalyticsR: Using the Google Analytics API
- searchConsoleR: Downloading data from the Google Search Console into R
- urltools: Perform URL processing
Manage Large Volumes Of Data
In every SEO project, data is used methodically, whether it comes from Screaming Frog, SEMrush, Search Console, your Web Analysis tool, or another source. They can be obtained directly through APIs or by manual exports.
Tips on how to process these datasets can be found in the sections that follow.
Save And Open A Dataset
mondataframe <- data.frame() #Create a dataframe (allows you to mix digital and text data)
merge <- data.frame(df1, df2) #Merge 2 dataframes
#Open a TXT file
Fichier_txt <- read.table(“filename.txt”,header=FALSE, col.names = c(“nomcolonne1”, “nomcolonne2”, “nomcolonne3”))
#Open an XLS
Fichier_xls <- read_excel(“cohorte.xls”, sheet = 1, col_names = FALSE, col_types = NULL, skip = 1)
#Open a CSV
Fichier_csv <- read.csv2(“df.csv”, header = TRUE, sep=”;”, stringsAsFactors = FALSE)
#Save your dataset
write.csv() #Create a csv
write.table() #Create a txt
#Change column names
cnames <- c(“keywords”, “searchvolume”, “competition”, “CPC”) #we define names for the 4 columns of the dataframe
> colnames(mydataset) <- cnames #the column names are assigned to the dataframe
Know The Dataset
object.size(dataset) #Get the object size in bytes
head(dataset) #See the first lines
tail(dataset) #See the last lines
colnames(dataset) #Know the names of the columns
apply(dataset, 2, function(x) length(unique(x))) #Know how many different values there are in each column of the dataset
summary(dataset) #Have a summary of each column of the dataset (minimum, median, average, maximum, etc.)
summary(dataset$colonne) #Same thing for a particular column
dim(dataset) #Dataset dimensions (number of columns and rows)
str(dataset) #More complete than dim() : Dataset dimensions + Data type in each column
which(dataset$Colonne == “seo”) #Look for the rows in the “Column” column that contain the value “seo”.
Prioritize The Dplyr Package
DPLYR is THE bundle you must be aware of. It will allow you to do a variety of operations on your datasets, including selection, filtering, sorting, and classification.
#Select columns and rows
select(df,colA,colB) #Select colA and colB in the df dataset
select(df, colA:colG) #Select from colA to colG
select(df, -colD) #Delete the column D
select(df, -(colC:colH)) #Delete a series of columns
slice(df, 18:23) #Select lines 18 to 23
#Create a filter
filter(df, country==”FR” & page==”home”) #Filter the rows that contain FR (country column) and home (page column)
filter(df, country==”US” | country==”IN”) #Filter lines whose country is US or IN
filter(df, size>100500, r_os==”linux-gnu”)
filter(cran, !is.na(r_version)) #Filter the rows of the r_version column that are not empty
#Sort your data
arrange(keywordDF, volume) #Sort the dataset according to the values in the volume column (ascending classification)
arrange(keywordDF, desc(volume)) #Sort the dataset in descending order
arrange(keywordDF, concurrece, volume) #Sort the data according to several variables (several columns)
arrange(keywordDF, concurrece, desc(volumes), prixAdwords)
Other Points To Consider
The following are commands that we frequently employ to execute operations on huge keyword datasets, such as SEMrush, Ranxplorer, or Screaming Frog outputs.
These procedures allow us to move more quickly in our quest for SEO chances.
For Screaming Frog exports, you’ll find several commands here to count items like the number of URLs crawled, the number of empty cells in a column, and the number of URLs for each status code.
#Convert a column to digital format
keywords$Volume <- as.numeric(as.character(keywords$Volume))
#Add a column with a default value
keywordDF$nouvellecolonne <- 1 #create a new column with the value 1
#Add a column whose value is based on an operation
mutate(keywordDF, TraficEstime = keywordDF$CTRranking * keywordDF$volume) #create a new column (TrafficEstime) based on 2 others (CTRranking and volume)
mutate(keywordDF, volumereel = volume / 2)
#Split a dataset into several datasets #Very useful to divide a list of keywords by theme split(keywords, keywords$Thematique)
Extraction Of Content From The Internet And Web Scrapping
Making a crawler is a great way to rapidly get certain components of a web page. This will be used to track the progress of a competitor’s website, such as its pricing strategy, content revisions, and so on.
You may use the following script to download an XML file, parse it, and get variables of interest to you. You’ll also learn how to turn it into a dataframe.
#1. Load packages
#2. Get the source code
url <- “https://www.w3schools.com/xml/simple.xml”
xml <- getURL(url,followlocation = TRUE, ssl.verifypeer = FALSE)
#3. Format the code and retrieve the root XML node
doc <- xmlParse(xml)
rootNode <- xmlRoot(doc)
#3.1 Save the source code in an html file in order to see it in its entirety
#4. Get web page contents
xmlName(rootNode) #The name of the XML (1st node)
rootNode[] #All content of the first node
rootNode[][] #The 1st element of the 1st node
xmlSApply(rootNode, xmlValue) #Remove the tags
xpathSApply(rootNode,”//name”,xmlValue) #Some nodes with xPath
xpathSApply(rootNode,”/breakfast_menu//food[calories=900]”,xmlValue) #Filter XML nodes by value (here recipes with 900 calories)
#5. Create a data frame or list
menusample <- xmlToDataFrame(doc)
menusample <- xmlToList(doc)
DU HTML SCRAPER
Obtaining links from a website, or retrieving a list of articles, are just a few instances of what you may accomplish using the script below.
#1. Load packages
#2. Get the source code
url <- “https://remibacha.com”
request <- GET(url)
doc <- htmlParse(request, asText = TRUE)
#3. Get the title and count the number of characters
PageTitle <- xpathSApply(doc, “//title”, xmlValue)
#4. Get posts names
PostTitles <- data.frame(xpathSApply(doc, “//h2[@class=’entry-title h1′]”, xmlValue))
PostTitles <- data.frame(xpathSApply(doc, “//h2”, xmlValue))
#5. Retrieve all the links on the page and make a list of them
hrefs <- xpathSApply(doc, “//div/a”, xmlGetAttr, ‘href’)
hrefs <- data.frame(matrix(unlist(hrefs), byrow=T))
#6. Retrieve links from the menu
liensmenu <- xpathSApply(doc, “//ul[@id=’menu-menu’]//a”, xmlGetAttr, ‘href’)
liensmenu <- data.frame(matrix(unlist(liensmenu), byrow=T))
#7. Retrieve the status code and header
header <- headers(request)
header <- data.frame(matrix(unlist(header), byrow=T))
#1. Load the package
#2. Get the JSON
jsonData <- fromJSON(“https://api.github.com/users/jtleek/repos”)
#3. Retrieve the names of all nodes
#4. Retrieve the names of all nodes in the “owner” node
#5. Retrieve values from the login node