Loading data

Authors
Affiliation

Floriane Remy

CNRS, Univ. Bordeaux, MCC – UMR 5199 PACEA

Frédéric Santos

CNRS, Univ. Bordeaux, MCC – UMR 5199 PACEA

1 Preamble: create your Rstudio project

A Git repository was created to host some data files, in various formats. As a first step, you will have to make a local copy of this repository on your computer, and turn it into an Rstudio project.

  1. Download the whole content of the repo as a zip file, and unzip it in a given folder on your computer.
  2. Open Rstudio, and open the menu File > New Project.
  3. Choose the type “Existing directory”, and create a project in the folder created in step 1.

2 Usual data formats in morphometrics

Usually, the data that we handle in morphometrics are not natively in a classic “tidy” format (one row per individual, one column per variable). They can also sometimes be in specific file formats (.dta, .xml, …).

2.1 Landmarks coordinates in text files

In the best and simplest case, the software you used to place 2D or 3D landmarks on your specimens will produce a plain text file (.csv or .txt) for each individual, with one row per landmark and one column per coordinate (x, y and optionally z) each. The basic R function read.table() will then be convenient to load your data, as in the example below.

## Exemple d'importation d'un fichier txt :
colb <- read.table(
  file = "./scallops/colb01.txt",
  header = FALSE,   # pas d'intitulés de colonnes
  sep = ",",        # séparateur de colonnes
  dec = "."         # séparateur décimal
)
head(colb)          # visualiser un extrait du fichier
           V1        V2        V3
1 -14.1384844 -17.58396  0.575319
2  -7.2604844 -26.83156 -1.190281
3   0.1155156 -26.66646 -0.728681
4   7.9187156 -26.67076 -1.044281
5  14.6561156 -18.05436  1.049719
6  23.2123156 -13.11266  0.581219

2.2 Specific data formats

However, some data acquisition software will return non-standard file formats. A (non-exhaustive) list of examples is given below; the corresponding files can be explored in the formats folder of the Git repository.

2.2.1 DTA files

They are created (for instance) by the software Landmark. The data for all your individuals is generally stored in one single file. The function read.lmdta() from the R package {Morpho} can load this kind of files:

## Load a dta file:
dta <- Morpho::read.lmdta(file = "./formats/example_landmark.dta")
## Structure of the R object created:
str(dta)
List of 3
 $ arr    : num [1:11, 1:3, 1:26] 7.03 7.62 7.81 8.02 8.13 ...
  ..- attr(*, "dimnames")=List of 3
  .. ..$ : NULL
  .. ..$ : NULL
  .. ..$ : chr [1:26] "UMRright_PW0181_manuel" "UMRright_PW0181_semimanual" "UMRright_PW181_ISEM" "UMRright_PW0351" ...
 $ info   : chr [1:6] "1" "26L" "33" "1" ...
 $ idnames: chr [1:26] "UMRright_PW0181_manuel" "UMRright_PW0181_semimanual" "UMRright_PW181_ISEM" "UMRright_PW0351" ...
## Display the coordinates for the first individual:
print(dta$arr[, , 1])
          [,1]     [,2]     [,3]
 [1,] 7.025246 6.244688 16.52390
 [2,] 7.621362 6.192825 16.13980
 [3,] 7.812572 6.128182 15.04538
 [4,] 8.024401 6.092654 13.95987
 [5,] 8.131372 6.293088 13.17749
 [6,] 7.185867 5.709883 15.44929
 [7,] 7.459052 6.031806 14.60554
 [8,] 7.772063 6.196759 13.64813
 [9,] 7.354609 6.849706 15.77832
[10,] 7.607753 6.488539 14.73335
[11,] 7.882711 6.524845 13.81549

2.2.2 PLY files

Polygon files are (for instance) created by Avizo. They may represent either landmarks data, or surfaces. They can be loaded using the function read.ply() from the package {geomorph}, or using the function vcgImport() from the package {Rvcg}.

## Load a surface in PLY format and visualize it:
ply <- Rvcg::vcgImport(file = "./formats/example_avizo.ply", clean = TRUE)
rgl::shade3d(ply, col = "grey")

2.2.3 TPS files

They are created by the software tpsDIG2. Once again, the data for several individuals are generally stored in one single file. The function readland.tps() from the pakage {geomorph} (or the function readallTPS() from the package {Morpho}) allows you to load such files:

## Load a TPS file:
tps <- geomorph::readland.tps(file = "./formats/example_tpsdig.TPS",
                              specID = "imageID")

No curves detected; all points appear to be fixed landmarks.
## Display the coordinates of the first shape:
print(tps[, , 1])
          [,1]      [,2]
 [1,] 226.8553 162.15384
 [2,] 226.3458 142.35768
 [3,] 219.6500 136.06221
 [4,] 236.7169 112.19037
 [5,] 240.9746 112.04481
 [6,] 227.1100 138.24561
 [7,] 231.8407 137.48142
 [8,] 231.9862 142.10295
 [9,] 237.6267 144.43191
[10,] 231.4768 153.27468
[11,] 244.4316 144.94137
[12,] 245.4505 147.63423
[13,] 230.7126 162.40857
[14,] 102.9837  10.88061
[15,] 142.3577  10.58949

2.2.4 XML files

They are created for instance by the software Viewbox. The function read_viewbox() from the package {anthrostat} allows you to load such files:

## Load an XML file:
xml <- anthrostat::read_viewbox("./formats/example_viewbox.xml")
head(xml)
         x         y         z
0  39.3450 -324.3970  -23.4495
1  47.4408 -207.4721  -64.0599
2  39.8136 -169.3606  -18.9716
3  47.9278   24.9335 -173.3149
4 -48.6539 -245.8677  -19.3262
5 121.4894 -233.5452   -7.6302

There are many other classical file formats in morphometrics (e.g., JSON, NTS, …). You can find specific function to load them in the R packages {geomorph} and {Morpho}.

3 Load several files at once

For DTA or TPS formats for instance, we generally have several individuals in one single file, so that the corresponding R functions import all the individuals in one single instruction. In other cases, however, we have one file (CSV, TXT, etc.) per individual, so that we have to load a large set of data files. Obviously, if we have 80 files, we must not write 80 read.table() instructions manually!

Let’s inspect the sub-folder scallops:

files <- list.files("./scallops")
print(files)
[1] "colb01.txt"        "irradiansls13.txt" "subn01.txt"       
[4] "ZXkuhn01.txt"      "ZXkuhn03.txt"     

There are 5 TXT files in this folder. How to load them with one single R instruction?

For CSV or TXT files, the function Morpho::read.csv.folder()1 loads all the files from a given directory at once, and returns an array. Of course, all files must have the same number of rows and columns, and have the same extension.

## Load all TXT files:
lmarray2 <- Morpho::read.csv.folder(
  folder = "./scallops/", # the folder to load
  x = 1:44,               # the rows to read
  y = 1:3,                # the columns to read
  header = FALSE,         # no column names in the files
  dec = ".",              # decimal point
  sep = ",",              # field separator
  pattern = "*.txt"       # pattern of the files to load
)
## Print the array:
print(lmarray2$arr)

4 In practice

In this course, we will use the data contained in the file rats.TPS. In this file are stored the coordinates of 15 fixed landmarks (these are the first 15 rows of each individual) and 88 semilandmarks taken on rats mandibles, as shown on Figure 1.

Figure 1: Protocol for landmarks and semilandmarks.
Exercise
  1. (Optional) Open the file rats.TPS using a text editor, inspect it and try to understand its structure.
  2. Load this TPS file in R.
  3. How many individuals do we have in this file?

In this TPS file, the rows LM=103 suggest that we have 103 landmarks per individual. Each landmark has two coordinates \((x,y)\), so that we have an array of 103 rows and 2 columns per individual. The file ID indicates the identifier of each individual.

To load the TPS file in R:

## Load TPS file:
rats <- geomorph::readland.tps(
  file = "./data/rats.TPS",
  specID = "imageID"
)

No curves detected; all points appear to be fixed landmarks.

As we can see, rats is an array:

class(rats)
[1] "array"

We can display the dimension of this array:

dim(rats)
[1] 103   2 159

This means that we have 159 individuals in total, each of them with 103 bi-dimensional landmarks.

Finally, we can display the coordinates of the first five landmarks for the first individual:

print(rats[1:5, , 1])
         [,1]     [,2]
[1,] 0.421344 0.985446
[2,] 0.754677 0.887040
[3,] 0.943173 1.049895
[4,] 1.208592 1.128897
[5,] 1.378377 1.145529

5 Loading CSV files

Finally, we have some metadata (sex, age, weight, …) about the individuals. These data are stored in a CSV file, which is the recommended format for loading “usual” flat data in R. This data can be loaded using the built-in read.csv2() function:

## Load CSV file:
meta <- read.csv2(
  file = "./data/rats.csv",
  row.names = 1,
  stringsAsFactors = TRUE,
  na.strings = ""
)
## Summarise the dataframe:
summary(meta)
          Species                   Location  Season  
 Rattus rattus:159   Dry forest         :38   Dry:67  
                     Field              : 6   Wet:92  
                     Hygrophilous forest:51           
                     Mesophilic forest  :49           
                     Swamp forest       :15           
                                                      
     Weight        Body_length     Tail_length         Age         
 Min.   : 31.00   Min.   :112.0   Min.   :143.0   Min.   :  37.10  
 1st Qu.: 62.00   1st Qu.:145.0   1st Qu.:191.5   1st Qu.:  87.95  
 Median : 98.00   Median :160.0   Median :210.0   Median : 225.60  
 Mean   : 94.25   Mean   :161.3   Mean   :204.7   Mean   : 227.04  
 3rd Qu.:120.00   3rd Qu.:176.5   3rd Qu.:220.0   3rd Qu.: 306.10  
 Max.   :173.00   Max.   :205.0   Max.   :255.0   Max.   :1015.60  
       Age_class  Sex   
 Adult      :88   F:92  
 Juvenile   : 5   M:67  
 Older adult: 4         
 Sub-adult  :59         
 NA's       : 3         
                        

One useful thing, for subsequent statistical analyses, may be to convert the still unordered factor Age_class into an ordered factor. By default, the levels (i.e., age classes) have no specific order and appear in alphabetical order in all plots, which is clearly not convenient here.

## Convert age casses into an ordered factor:
meta$Age_class <- factor(
  meta$Age_class,
  ordered = TRUE,
  levels = c("Juvenile", "Sub-adult", "Adult", "Older adult")
)

As a (very strongly recommended!) good practice, we should make sure that this dataframe includes the same number of individuals as the TPS file, and that they are given in the same order.

## Check dimension of CSV file:
dim(meta)
[1] 159   9
## Check consistency of ordering between TPS and CSV:
all(dimnames(rats)[[3]] == rownames(rats))
[1] TRUE
Back to top

Footnotes

  1. I.e., the function read.csv.folder() from the package {Morpho}. In R, we can write a function with the general syntax package::function().↩︎