General introduction to R

Authors
Affiliation

Floriane Remy

CNRS, Univ. Bordeaux, MCC – UMR 5199 PACEA

Frédéric Santos

CNRS, Univ. Bordeaux, MCC – UMR 5199 PACEA

1 What is R?

R is primarily a statistical programming language, although it has become a more general-purpose language in recent years. It is open source, free, cross-platform software, developed by volunteers.

Although it can be used through various graphical interfaces such as R Commander (Fox 2017), we can only take full advantage of its power by writing R scripts.

2 First interactions with R

To get used to R, lets begin with simple arithmetic operations in the R console:

5 + 3
[1] 8
2^3
[1] 8

However, we usually interact with R using functions, as in the following example:

log(10)
[1] 2.302585

A function can have one or more (mandatory or optional) arguments that modify its behavior. For example, we can specify a given base in the calculation of a logarithm by using the argument “base”:

log(10, base = 2)
[1] 3.321928

In R, the different arguments of a function are therefore separated by a comma.

To know what are the arguments of a given R function, just read its help page; for instance:

help(log)

3 R and its packages

R comes natively with a limited collection of basic functions, allowing you to carry out the most common tasks (usual graphical representations, basic tests, etc.). To implement more advanced (or simply less common) methods, there are nearly 20,000 (as of May 2024) additional packages freely downloadable from CRAN.

It is only necessary to install them once (using the install.packages() function), but they must then be loaded (when you need them) each time the software is started (via the library() function). Here is an example:

## Install the R package geomorph:
install.packages("geomorph")
## Load this package for the current R session:
library(geomorph)

4 General workflow to use R

Generally speaking, we never write instructions in the R console1, but in a separate plain-text file, which is then called an R script (or source code file). An R script is thus a sequence of R statements stored in a plain-text file, whose the extension must be .R.

If you are not used to programming, and have only used graphical interfaces so far, R introduces a new paradigm, in the sense that we no longer save results, but instructions allowing these results to be obtained:

“The source code is real. The objects are realizations of the source code” — Manual of Emacs Speaks Statistics

A script must remain “clean”: well organized, clear, understandable. It must be both complete (contain all necessary commands) and minimal (contain nothing superfluous).

For instance in the IDE RStudio2, you can create a new R script by visiting the menu File > New File > R Script. The keyboard shortcut Ctrl + Shift + N is another quick way to do it.

When writing an R script, it is useful (essential?) to enter comments, to explain in plain language at least the most technical parts of the code. The comment character in R is #:

sqrt(2) # this is the function square root

Now, let’s begin with a first concrete use case of R!

Figure 1

5 RthuR, GuineveRe et al.

We will work on a fictional dataset based on the TV show Kaamelott.

5.1 Creating R objects

First, let’s create objects with specific values. We say that we assign values to these objects, using the assignment operator <-3. You can then display the value stored in an object with the function print():

King <- 'Arthur'
Queen <- 'Guenievre'
print(King)
[1] "Arthur"

You can create an object from other pre-existing objects, so that:

RoyalCouple <- c('Arthur', 'Guenievre')
print(RoyalCouple)
[1] "Arthur"    "Guenievre"

is equivalent to:

RoyalCouple <- c(King, Queen)
print(RoyalCouple)
[1] "Arthur"    "Guenievre"

5.2 Vectors

Note that we used above the function c() to create vectors, i.e. unidimensional and ordered sequences of values. This name “c()” stands for “combine”, “concatenate” or “collection”. Since vectors are ordered, we can extract, say, the first element of a vector by using the following syntax:

print(RoyalCouple[1])
[1] "Arthur"

Now, let’s create a vector of three characters of the TV show, along with their gender and their (not so) arbitrary scores in courage and intelligence:

characters <- c('Arthur', 'Bohort', 'Karadoc', 'Séli')
gender <- c('m', 'm', 'm', 'f')
courage <- c(500, 0, 100, 300)
intelligence <- c(500, 400, 100, 400)

5.3 Dataframes

We can combine these three vectors into a dataframe, which is the standard way in R to build a data structure with \(n\) individuals described by \(p\) variables:

kaamelott <- data.frame(
    characters,
    gender,
    courage,
    intelligence
)
print(kaamelott)
  characters gender courage intelligence
1     Arthur      m     500          500
2     Bohort      m       0          400
3    Karadoc      m     100          100
4       Séli      f     300          400

In R, it’s easy to select of subset of individuals according to one of several conditions:

## Select only males:
subset(kaamelott, gender == "m")
  characters gender courage intelligence
1     Arthur      m     500          500
2     Bohort      m       0          400
3    Karadoc      m     100          100
## Select only brave people:
subset(kaamelott, courage >= 300)
  characters gender courage intelligence
1     Arthur      m     500          500
4       Séli      f     300          400

Dataframes are 2-dimensional data structures: each element stored in a dataframe can be extracted by specifying the indices of its row and column. For instance:

## Display the value stored in row 1, column 3:
print(kaamelott[1, 3])
[1] 500

Note that if you only specify a column index and no row index, the whole column is displayed4:

## Display the whole 3rd column:
print(kaamelott[, 3])
[1] 500   0 100 300

5.4 Arrays

Another data structure5 in R, which is of a very frequent use in morphometrics, is the array. Arrays can be seen as generalized matrices: in morphometrics, you’ll often find 3D-arrays \(A=(a_{ijk})\) where each value \(a_{ijk}\) has three indices. It’s particularly useful for representing landmark data, since \(a_{ijk}\) will correspond to the \(j\)-th coordinate of the \(i\)-th landmark for the \(k\)-th individual. We will see examples of arrays later on.

6 Useful packages for doing morphometrics

We will use the following packages for subsequent chapters:

library(geomorph)
library(Morpho)
library(rgl)
library(Rvcg)
library(shapes)

The package {shapes} is the historical package for shape analysis in R, and is exhaustively documented in Dryden and Mardia (2016). However, {geomorph} and {Morpho} have a more “modern” design, and more helpers to make analyses easier. The choice of one package or another is purely a matter of taste.

Some other more general packages will also be used for multivariate statistical analysis and graphical representations:

library(factoextra)
library(FactoMineR)
library(ggpubr)
Back to top

References

Dryden, Ian L., and Kanti V. Mardia. 2016. Statistical Shape Analysis, with Applications in R. 2nd ed. Wiley Series in Probability and Statistics. Chichester, UK: John Wiley & Sons, Ltd. https://doi.org/10.1002/9781119072492.
Fox, John. 2017. Using the R Commander: A Point-and-Click Interface for R. The R Series. Milton: CRC Press.

Footnotes

  1. Except for “one-shot” instructions you don’t want to keep track of, such as installing packages, asking for help, etc.↩︎

  2. There exists a lot of other excellent interfaces to interact efficiently with R: Emacs ESS, JupyterLab, etc. Rstudio is not the only way!↩︎

  3. You can also use = as the assignment operator, so that x <- 2 and x = 2 are exact synonyms for R. However, the recommendation is to prefer <-.↩︎

  4. And similarly, of course, kaamelott[1, ] is a way to extract the whole first row of the dataframe.↩︎

  5. There are many other data structures in R, such as factors, matrices, lists and so on. We will not cover them here, but you can find a good description here.↩︎