PR2 version 4.11.0
Eukref Ciliate Integration

Load the variables common to the different scripts and the necessary libraries

  source('PR2_init.R', echo=FALSE)
  knitr::opts_chunk$set(eval=TRUE)

1 Introduction

1.1 Main changes to PR2

  • Update Ciliate with database provided by Eukref (except for Spirotrichea that had been done before)

1.2 References used

  • Boscaro, V., Santoferrara, L.F., Zhang, Q., Gentekaki, E., Syberg-Olsen, M.J., del Campo, J. & Keeling, P.J. 2018. EukRef-Ciliophora: A manually curated, phylogeny-based database of small subunit rRNA gene sequences of ciliates. Environ. Microbiol.

1.3 Data used

Type Number of sequences Remark
Number of PR2 sequences updated. 4577
Number of PR2 new sequences added 2489
Number of PR2 sequences not updated and removed 652

1.4 Summary of the data

See the worksheet attached EukRef Ciliates 2018.xlsx

Sheet Contains To do
pr2_ciliates_not_updated PR2 sequences assigned as sequences that have not been annotated by EukRef These sequences will be removed from PR2
pr2_new_metadata Metadata for new sequences Need to add end for sequences that include ITS
pr2_update_taxo Summary of edited taxonomy with number of sequences OK
eukref_processed All changes made based on the edited taxonomy as well as to medata OK

1.5 Comments on the EukRef database

  • The EUKREF.tsv file contains some non-ASCII caracters that do not import well in Excel. Must use tsv importer instead. Please make sure that your file is encoded as UTF-8 and does not contain any non-ASCII caracters.

  • PR2 has only 8 levels, so I did the following selection which seems the most logical
    • kingdom = Taxo0, supergroup = Taxo2, division = Taxo3 , class = Taxo6, order= Taxo7, family = Taxo10, genus = Taxo12 (for clones only)
  • For cultures, I created the species and the genus fields based on the Eukref_name field. However for the genera that have been shown to be polyphyletic I used the Taxo12 field for the genus and Taxo12_sp. for the species field.

  • I changed some of the rank names to follow PR2.
    • I replaced the Taxon[i] by Taxon_i
    • No taxon name can appear in two different columns.
      • For example Ciliophora Armophorea Armophorea Armophorea Armophorea Armophorea_sp.
      • is replaced by
      • Ciliophora Armophorea Armophorea_X Armophorea_XX Armophorea_XXX Armophorea_XXX_sp
  • Protocruzia has been placed in Spirotrichea by Charles and Wei Ting and in Ciliophora_X by you… What is correct ? OK corrected to Ciliophora_X and included in update

  • If more than one ENVO keywords, separate by ; not by and or or

1.6 Small Problems found in the EukRef file

All these problems have been corrected in the EUKREF-CILIOPHORA_corrected_v2.0.tsv file

  • Some sequences do not have GenBank accession. They correspond to PDB sequences and have not been included since the species Tetrahymena thermophila has other sequences associated.
4BPE_A    Eukaryota    SAR    Alveolata    Ciliophora    Intramacronucleata    CONthreeP    Oligohymenophorea    Hymenostomatia    Tetrahymenida    Tetrahymenida    Tetrahymenida    Tetrahymenida    Tetrahymenida    Tetrahymena thermophila     
4BPN_A    Eukaryota    SAR    Alveolata    Ciliophora    Intramacronucleata    CONthreeP    Oligohymenophorea    Hymenostomatia    Tetrahymenida    Tetrahymenida    Tetrahymenida    Tetrahymenida    Tetrahymenida    Tetrahymena thermophila     
4BPO_A    Eukaryota    SAR    Alveolata    Ciliophora    Intramacronucleata    CONthreeP    Oligohymenophorea    Hymenostomatia    Tetrahymenida    Tetrahymenida    Tetrahymenida    Tetrahymenida    Tetrahymenida    Tetrahymena thermophila     
4BPP_A    Eukaryota    SAR    Alveolata    Ciliophora    Intramacronucleata    CONthreeP    Oligohymenophorea    Hymenostomatia    Tetrahymenida    Tetrahymenida    Tetrahymenida    Tetrahymenida    Tetrahymenida    Tetrahymena thermophila        
2XZM_A    Eukaryota    SAR    Alveolata    Ciliophora    Intramacronucleata    CONthreeP    Oligohymenophorea    Hymenostomatia    Tetrahymenida    Tetrahymenida    Tetrahymenida    Tetrahymenida    Tetrahymenida    Tetrahymena thermophila  
2XZN_A    Eukaryota    SAR    Alveolata    Ciliophora    Intramacronucleata    CONthreeP    Oligohymenophorea    Hymenostomatia    Tetrahymenida    Tetrahymenida    Tetrahymenida    Tetrahymenida    Tetrahymenida    Tetrahymena thermophila   
  • One sequence has a wrong name in the 0 column “Frontonia_lynni” (should be Eukaryota) OK corrected in Text file

    DQ190463.3    Frontonia_lynni    SAR    Alveolata    Ciliophora    Intramacronucleata    CONthreeP    Oligohymenophorea    Peniculia[1]    Peniculida    Peniculida    Frontoniidae[1]    Frontoniidae[1]    Frontoniidae[1]    Frontonia lynni  
  • I found one error in the taxonomy : All the Dysteriidae are in Cyrtophoria_8 but one is Cyrtophoria_7 OK corrected in Text file

LC031485    Eukaryota    Alveolata    Ciliophora    Phyllopharyngea    **Cyrtophoria_7**    Dysteriidae    Dysteriidae_X    Dysteriidae_X_sp.    SS 284

Should be ?

LC031485    Eukaryota    Alveolata    Ciliophora    Phyllopharyngea    **Cyrtophoria_8**    Dysteriidae    Dysteriidae_X    Dysteriidae_X_sp.    SS 284
  • You have different groups OLIGO1 to 10 that are in columns Taxo10 to Taxo12
  • However you have some other groups called OLIGO in Taxo12 and which have Operculariidae[1] in the Taxo10 column and that go up to OLIGO18. These groups are not mentionned in the Note file (ciliophora_notes.md). Are these OLIGO the same thing that the OLIGO that appear in the Taxo 1 to 12 OK corrected in text file

Lines with OLIGO in the Taxo10 to Taxo12

genbank_accession   Taxo0   Taxo1   Taxo2   Taxo3   Taxo4   Taxo5   Taxo6   Taxo7   Taxo8   Taxo9   Taxo10  Taxo11  Taxo12  eukref_name
HQ164099.1  Eukaryota   SAR Alveolata   Ciliophora  Intramacronucleata  CONthreeP   Oligohymenophorea   Peritrichia[2]  Sessilida   Sessilida   OLIGO2  OLIGO2  OLIGO2  PR-53
LC150080.1  Eukaryota   SAR Alveolata   Ciliophora  Intramacronucleata  CONthreeP   Oligohymenophorea   Peritrichia[2]  Sessilida   Sessilida   OLIGO2  OLIGO2  OLIGO2  AS_seed_33
LC150125.1  Eukaryota   SAR Alveolata   Ciliophora  Intramacronucleata  CONthreeP   Oligohymenophorea   Peritrichia[2]  Sessilida   Sessilida   OLIGO2  OLIGO2  OLIGO2  A21_ek_8
LC150193.1  Eukaryota   SAR Alveolata   Ciliophora  Intramacronucleata  CONthreeP   Oligohymenophorea   Peritrichia[2]  Sessilida   Sessilida   OLIGO2  OLIGO2  OLIGO2  B21_ps_6
LC150194.1  Eukaryota   SAR Alveolata   Ciliophora  Intramacronucleata  CONthreeP   Oligohymenophorea   Peritrichia[2]  Sessilida   Sessilida   OLIGO2  OLIGO2  OLIGO2  B21_ps_7

Lines with Operculariidae[1] in Taxo 11 and going up to OLIGO18

genbank_accession   Taxo0   Taxo1   Taxo2   Taxo3   Taxo4   Taxo5   Taxo6   Taxo7   Taxo8   Taxo9   Taxo10  Taxo11  Taxo12  eukref_name
    QCBT2ZC111
GU922676.1  Eukaryota   SAR Alveolata   Ciliophora  Intramacronucleata  CONthreeP   Oligohymenophorea   Peritrichia[2]  Sessilida   Sessilida   Operculariidae[1]   OLIGO10 OLIGO10 F5K2Q4C04H79Z6
GU922807.1  Eukaryota   SAR Alveolata   Ciliophora  Intramacronucleata  CONthreeP   Oligohymenophorea   Peritrichia[2]  Sessilida   Sessilida   Operculariidae[1]   OLIGO10 OLIGO10 F5K2Q4C04JDJJY
GU922677.1  Eukaryota   SAR Alveolata   Ciliophora  Intramacronucleata  CONthreeP   Oligohymenophorea   Peritrichia[2]  Sessilida   Sessilida   Operculariidae[1]   OLIGO11 OLIGO11 F5K2Q4C04JEMU6
GU923387.1  Eukaryota   SAR Alveolata   Ciliophora  Intramacronucleata  CONthreeP   Oligohymenophorea   Peritrichia[2]  Sessilida   Sessilida   Operculariidae[1]   OLIGO11 OLIGO11 F5K2Q4C04JEO8B
GU922798.1  Eukaryota   SAR Alveolata   Ciliophora  Intramacronucleata  CONthreeP   Oligohymenophorea   Peritrichia[2]  Sessilida   Sessilida   Operculariidae[1]   OLIGO12 OLIGO12 F5K2Q4C04I6ZU6
KY355481.1  Eukaryota   SAR Alveolata   Ciliophora  Intramacronucleata  CONthreeP   Oligohymenophorea   Peritrichia[2]  Sessilida   Sessilida   Operculariidae[1]   OLIGO12 OLIGO12 QCCA2ZA071
GU922902.1  Eukaryota   SAR Alveolata   Ciliophora  Intramacronucleata  CONthreeP   Oligohymenophorea   Peritrichia[2]  Sessilida   Sessilida   Operculariidae[1]   OLIGO13 OLIGO13 F5K2Q4C04IVM5V
GU922989.1  Eukaryota   SAR Alveolata   Ciliophora  Intramacronucleata  CONthreeP   Oligohymenophorea   Peritrichia[2]  Sessilida   Sessilida   Operculariidae[1]   OLIGO14 OLIGO14 F5K2Q4C04H6E06
GU923005.1  Eukaryota   SAR Alveolata   Ciliophora  Intramacronucleata  CONthreeP   Oligohymenophorea   Peritrichia[2]  Sessilida   Sessilida   Operculariidae[1]   OLIGO15 OLIGO15 F5K2Q4C04IA55J
GU923014.1  Eukaryota   SAR Alveolata   Ciliophora  Intramacronucleata  CONthreeP   Oligohymenophorea   Peritrichia[2]  Sessilida   Sessilida   Operculariidae[1]   OLIGO16 OLIGO16 F5K2Q4C04I4F5W
GU923146.1  Eukaryota   SAR Alveolata   Ciliophora  Intramacronucleata  CONthreeP   Oligohymenophorea   Peritrichia[2]  Sessilida   Sessilida   Operculariidae[1]   OLIGO17 OLIGO17 F5K2Q4C04IMU19
GU923168.1  Eukaryota   SAR Alveolata   Ciliophora  Intramacronucleata  CONthreeP   Oligohymenophorea   Peritrichia[2]  Sessilida   Sessilida   Operculariidae[1]   OLIGO18 OLIGO18 F5K2Q4C04IA8IG
GU919118.1  Eukaryota   SAR Alveolata   Ciliophora  Intramacronucleata  CONthreeP   Oligohymenophorea   Peritrichia[2]  Sessilida   Sessilida   Operculariidae[1]   OLIGO2  OLIGO2  F5K2Q4C04H224O
GU919396.1  Eukaryota   SAR Alveolata   Ciliophora  Intramacronucleata  CONthreeP   Oligohymenophorea   Peritrichia[2]  Sessilida   Sessilida   Operculariidae[1]   OLIGO2  OLIGO2  
  • You have also some lines

EF586159.1  Eukaryota   SAR Alveolata   Ciliophora  Intramacronucleata  CONthreeP   Oligohymenophorea   Peritrichia[2]  Sessilida   Sessilida   Operculariidae[1]   Operculariidae[1]   Operculariidae[1]   384-O6
AF401525.1  Eukaryota   SAR Alveolata   Ciliophora  Intramacronucleata  CONthreeP   Oligohymenophorea   Peritrichia[2]  Sessilida   Sessilida   Operculariidae[1]   Operculariidae[1]   Operculariidae[1]   Opercularia microdiscum
KU363257.1  Eukaryota   SAR Alveolata   Ciliophora  Intramacronucleata  CONthreeP   Oligohymenophorea   Peritrichia[2]  Sessilida   Sessilida   Operculariidae[1]   Operculariidae[1]   Operculariidae[1]   102operCul
KU363267.1  Eukaryota   SAR Alveolata   Ciliophora  Intramacronucleata  CONthreeP   Oligohymenophorea   Peritrichia[2]  Sessilida   Sessilida   Operculariidae[1]   Operculariidae[1]   Operculariidae[1]   121operCul
HM627238.1  Eukaryota   SAR Alveolata   Ciliophora  Intramacronucleata  CONthreeP   Oligohymenophorea   Peritrichia[2]  Sessilida   Sessilida   Operculariidae[1]   Operculariidae[1]   Operculariidae[1]   Opercularia allensi
HM627239.1  Eukaryota   SAR Alveolata   Ciliophora  Intramacronucleata  CONthreeP   Oligohymenophorea   Peritrichia[2]  Sessilida   Sessilida   Operculariidae[1]   Operculariidae[1]   Operculariidae[1]   Opercularia sp. 1 LRPU-2010

OK corrected in text file

  * area of moss-dominated vegetation   ENVO:01000890
  * cloaca  UBERON:0000162
  * stomach UBERON:0000945
  * hydrotermal vent    ENVO:00000215
  * caecum  UBERON:0001153
  * colon   UBERON:0001155
  * rhizosphere ENVO:00005801
  * ruminal fluid   ENVO:0010228 - Could not find this term (replace by stomach ?)

2 Read pr2

  • Use the PR2 on local computer
# Info for the local dabase
pr2_db <- db_info("pr2_local")

pr2_db_con <- db_connect(pr2_db)
pr2_main <- tbl(pr2_db_con, "pr2_main") %>% collect()
# pr2_main <- pr2_main %>% filter (is.na(removed_version))

pr2_seq <- tbl(pr2_db_con, "pr2_sequences") %>% collect()

pr2_taxo <- tbl(pr2_db_con, "pr2_taxonomy")
pr2_taxo <- pr2_taxo %>% filter(is.na(taxo_removed_version)) %>% collect()

pr2_metadata <- tbl(pr2_db_con, "pr2_metadata") %>% collect()

# Join the tables and keep only sequences that are not removed

pr2 <- left_join(pr2_main, pr2_taxo, by = c(species = "species"))
pr2 <- left_join(pr2, pr2_seq)
pr2 <- left_join(pr2, pr2_metadata)
db_disconnect(pr2_db_con)
[1] TRUE
pr2 <- pr2 %>% filter(is.na(removed_version))
pr2_sample_types <- pr2 %>% group_by(pr2_sample_type) %>% count()

3 Ciliate database - Eukref

  • Round 1 - 2018-07
  • Round 2 - 2018-10

3.1 Read the data and reformat

3.1.1 Read the data

  • Read the Tab delimited file because problems with strange characters and CR in some fields
  • The first line of the file is skipped, the second line contains the name of the fields to be used in PR2
  • The last column is empty and discarded (named “empty”)
  • Number of sequences = 11622
dir_pr2_update <- "../updates/2018 Ciliates Eukref"
pr2.env$editor <- "Eukref - Boscaro V."

full_path <- function(file_name) {
    str_c(dir_pr2_update, "/", file_name)
}

file_pr2_update_excel <- full_path("EukRef Ciliates 2018 2.0.xlsx")
# Import the tsv skipping the first line

pr2_update_raw <- read_tsv(full_path("EUKREF-CILIOPHORA_corrected_v2.0.tsv"), 
    col_names = TRUE, guess_max = 2e+05, na = c("", "NA"), skip = 1)

problems(pr2_update_raw)
# tibble [0 x 4]
# ... with 4 variables: row <int>, col <int>, expected <chr>, actual <chr>
spec(pr2_update_raw)
cols(
  genbank_accession = col_character(),
  Taxo0 = col_character(),
  Taxo1 = col_character(),
  Taxo2 = col_character(),
  Taxo3 = col_character(),
  Taxo4 = col_character(),
  Taxo5 = col_character(),
  Taxo6 = col_character(),
  Taxo7 = col_character(),
  Taxo8 = col_character(),
  Taxo9 = col_character(),
  Taxo10 = col_character(),
  Taxo11 = col_character(),
  Taxo12 = col_character(),
  eukref_name = col_character(),
  eukref_source = col_character(),
  eukref_env_material = col_character(),
  eukref_env_biome = col_character(),
  eukref_biotic_relationship = col_character(),
  eukref_specific_host = col_character(),
  eukref_geo_loc_name = col_character(),
  eukref_publication = col_character(),
  eukref_authors = col_character(),
  eukref_journal = col_character(),
  eukref_notes = col_character(),
  gb_taxonomy = col_character(),
  gb_environment = col_character(),
  gb_host = col_character(),
  gb_country = col_character(),
  gb_publication = col_character(),
  gb_authors = col_character(),
  gb_journal = col_character(),
  empty = col_character()
)
# Remove last column which is empty
pr2_update_raw <- select(pr2_update_raw, -empty)

str_c("Number of sequences : ", nrow(pr2_update_raw))
[1] "Number of sequences : 11620"
  • Rename the columns as a function of the column selected for taxonomy. The “genus” column (Taxo12) is not yet selected because it will be constructed latter.
  • Remove ending digits from the accession number
  • Filter out accession number that starts by a number (sequences not from GenBank but from PDB)
  • Number of sequences with valid GenBank number : 11616
pr2_update <- pr2_update_raw %>% rename(kingdom = Taxo0, supergroup = Taxo2, 
    division = Taxo3, class = Taxo6, order = Taxo7, family = Taxo10) %>% mutate(genbank_accession = str_replace(genbank_accession, 
    "[.][0-9]+", "")) %>% filter(str_detect(genbank_accession, "^[A-Z]+"))
str_c("Number of sequences with valid GenBank number : ", nrow(pr2_update))
[1] "Number of sequences with valid GenBank number : 11614"
  • Remove Spirotrichea
pr2_update <- pr2_update %>% filter(class != "Spirotrichea")
str_c("Number of sequences after removing Spirotrichea : ", nrow(pr2_update))
[1] "Number of sequences after removing Spirotrichea : 7060"

3.1.2 Build the genus and species field

eukref_name genus species Remark
Genus sp. taxo12 taxo12_sp. This case corresponds to clones…
Genus cf. species Genus Genus_species
Genus species Genus Genus_cf_species
other cases taxo12 taxo12_sp.

Operations must be done in this order because of not, then case #3 will also find Genus _sp. 1. Locate in column eukref_name “Genus sp.”
2. Locate in eukref_name “Genus cf. species” 3. Locate in eukref_name “Genus species” 4. For the rest using the Taxo12 column

# Select only relevant columns
pr2_taxo_update <- pr2_update %>% select_(.dots = c("genbank_accession", pr2.env$taxo_levels[1:6]), 
    "Taxo12", "eukref_name")

# Case 1
pr2_taxo_update1 <- pr2_taxo_update %>% filter(str_detect(eukref_name, pr2.env$genus_sp_regex)) %>% 
    mutate(genus = Taxo12, species = str_c(Taxo12, "_sp."))

pr2_taxo_update <- pr2_taxo_update %>% filter(!(genbank_accession %in% pr2_taxo_update1$genbank_accession))

str_c("Number of sequences Genus sp. : ", nrow(pr2_taxo_update1))
[1] "Number of sequences Genus sp. : 495"
# Case 2
pr2_taxo_update2 <- pr2_taxo_update %>% filter(str_detect(eukref_name, pr2.env$genus_cf_regex)) %>% 
    mutate(species = str_extract(eukref_name, pr2.env$genus_cf_regex)) %>% mutate(genus = str_replace(species, 
    pr2.env$genus_cf_regex, "\\1")) %>% mutate(species = str_replace(species, 
    pr2.env$genus_cf_regex, "\\1_cf._\\2"))

pr2_taxo_update <- pr2_taxo_update %>% filter(!(genbank_accession %in% pr2_taxo_update2$genbank_accession))

str_c("Number of sequences Genus cf. : ", nrow(pr2_taxo_update2))
[1] "Number of sequences Genus cf. : 8"
# Case 3
pr2_taxo_update3 <- pr2_taxo_update %>% filter(str_detect(eukref_name, pr2.env$genus_species_regex)) %>% 
    mutate(species = str_extract(eukref_name, pr2.env$genus_species_regex)) %>% 
    mutate(genus = str_replace(species, pr2.env$genus_species_regex, "\\1")) %>% 
    mutate(species = str_replace(species, pr2.env$genus_species_regex, "\\1_\\2"))

pr2_taxo_update <- pr2_taxo_update %>% filter(!(genbank_accession %in% pr2_taxo_update3$genbank_accession))

str_c("Number of sequences Genus species. : ", nrow(pr2_taxo_update3))
[1] "Number of sequences Genus species. : 1315"
# Case 4
pr2_taxo_update4 <- pr2_taxo_update %>% mutate(genus = Taxo12, species = str_c(Taxo12, 
    "_sp."))
str_c("Number of sequences with no species info : ", nrow(pr2_taxo_update4))
[1] "Number of sequences with no species info : 5242"
# Merge the four data frames

pr2_taxo_update <- bind_rows(pr2_taxo_update1, pr2_taxo_update2, pr2_taxo_update3, 
    pr2_taxo_update4) %>% select_(.dots = c("genbank_accession", pr2.env$taxo_levels[1:8]), 
    "Taxo12", "eukref_name")

# write_tsv(pr2_taxo_update, full_path'junk.txt'))

3.1.3 Correct specific cases

  • Modify the genus and species for genus that are split as Genus_1 and Genus_2
  • Modify the genus and species that are split into 2 different lineages upstream. Two genera are created _1 and _2
  • One error in the order “Cyrtophoria_7” instead of" Cyrtophoria_8"
  • Remove all the different “Neobalantidium_coli_1, _2 etc…"
  • One family “Schizocaryidae” (single sequence) changed to “Philasterida”
  • “AB074079” -> “Chilodonella_sp.
  • When a taxon has a subscript [1] and there are not taxa with [2] then remove the [1]

Note the following problem has been corrected in the file itself (problem of OLIGO1 that was OLIGO2… 13)

  • Family Operculariidae and genus containing OLIGO, then the genus is named with the family followed by _X
  • idem for species
# List of genera that are split between 2 lineages but not named [1] and [2]
# genus_split <- 'Triadinium|Zoothamnium'

pr2_taxo_update <- pr2_taxo_update %>% mutate_all(funs(str_replace(., "Metopidae\\[1\\]", 
    "Metopidae"))) %>% mutate_all(funs(str_replace(., "Metopus\\[1\\]", "Metopus"))) %>% 
    mutate_all(funs(str_replace(., "Nyctotheridae\\[1\\]", "Nyctotheridae"))) %>% 
    mutate_all(funs(str_replace(., "Spathidiida\\[1\\]", "Spathidiida"))) %>% 
    mutate_all(funs(str_replace(., "Spathidiidae\\[1\\]", "Spathidiidae"))) %>% 
    mutate_all(funs(str_replace(., "Blepharocorythidae\\[1\\]", "Blepharocorythidae"))) %>% 
    mutate_all(funs(str_replace(., "Operculariidae\\[1\\]", "Operculariidae"))) %>% 
    mutate_all(funs(str_replace(., "Zoothamniidae\\[1\\]", "Zoothamniidae"))) %>% 
    mutate_all(funs(str_replace(., "Pleuronematida\\[1\\]", "Pleuronematida"))) %>% 
    mutate_all(funs(str_replace(., "Peniculia\\[1\\]", "Peniculia"))) %>% mutate_all(funs(str_replace(., 
    "Balantidium\\[1\\]", "Balantidium"))) %>% mutate(order = case_when(str_detect(family, 
    "Dysteriidae") ~ "Cyrtophoria_8", TRUE ~ order), family = case_when(str_detect(family, 
    "Schizocaryidae") ~ "Philasterida", str_detect(genus, "Caenomorphidae") ~ 
    "Caenomorphidae", TRUE ~ family), genus = case_when(str_detect(genus, "Neobalantidium") ~ 
    "Neobalantidium", str_detect(Taxo12, "Dysteria\\[|Pleuronema\\[|Trichodina\\[|Entodinium\\[") ~ 
    Taxo12, str_detect(genus, "Tardigrada") ~ "Scuticociliatia_2_XX", str_detect(genus, 
    "Uncultured") ~ "Philasterida_X", str_detect(genus, "Arcuospathidium") ~ 
    case_when(str_detect(family, "Litostomatea") ~ "Arcuospathidium_1", TRUE ~ 
        "Arcuospathidium_2"), str_detect(genus, "Balantidium") ~ case_when(str_detect(family, 
    "Trichostomatia") ~ "Balantidium_1", TRUE ~ "Balantidium_2"), str_detect(genus, 
    "Cyclidium") ~ case_when(str_detect(family, "Oligohymenophorea") ~ "Cyclidium_1", 
    str_detect(family, "Pleuronematida") ~ "Cyclidium_2", TRUE ~ "Cyclidium_3"), 
    str_detect(genus, "Epispathidium") ~ case_when(str_detect(family, "Spathidiida") ~ 
        "Epispathidium_1", TRUE ~ "Epispathidium_2"), str_detect(genus, "Frontonia") ~ 
        case_when(str_detect(family, "Frontoniidae") ~ "Frontonia_1", TRUE ~ 
            "Frontonia_2"), str_detect(genus, "Metopus") ~ case_when(str_detect(family, 
        "Metopidae") ~ "Metopus_1", TRUE ~ "Metopus_2"), str_detect(genus, "Nyctotherus") ~ 
        case_when(str_detect(family, "Nyctotheridae") ~ "Nyctotherus_1", TRUE ~ 
            "Nyctotherus_2"), str_detect(genus, "Spathidium") ~ case_when(str_detect(family, 
        "Litostomatea") ~ "Spathidium_1", TRUE ~ "Spathidium_2"), str_detect(genus, 
        "Triadinium") ~ case_when(str_detect(family, "Buetschliidae") ~ "Triadinium_1", 
        TRUE ~ "Triadinium_2"), str_detect(genus, "Zoothamnium") ~ case_when(str_detect(family, 
        "Sessilida") ~ "Zoothamnium_1", TRUE ~ "Zoothamnium_2"), genbank_accession == 
        "AB074079" ~ "Chilodonella", genbank_accession == "HM140400" ~ family, 
    TRUE ~ genus), species = case_when(str_detect(Taxo12, "Dysteria\\[|Pleuronema\\[|Trichodina\\[|Entodinium\\[") ~ 
    str_c(Taxo12, "_sp."), str_detect(genus, "Caenomorphidae") ~ "Caenomorphidae", 
    genbank_accession == "AB074079" ~ "Chilodonella_sp.", str_detect(genus, 
        "Tardigrada") ~ "Scuticociliatia_2_XX_sp.", str_detect(genus, "Uncultured") ~ 
        "Philasterida_X_sp.", str_detect(species, "Spathidium_muscicola") ~ 
        case_when(str_detect(family, "Litostomatea") ~ "Spathidium_1_muscicola", 
            TRUE ~ "Spathidium_2_muscicola"), str_detect(species, "Frontonia_elegans") ~ 
        case_when(str_detect(family, "Frontoniidae") ~ "Frontonia_1_elegans", 
            TRUE ~ "Frontonia_2_elegans"), str_detect(species, "Triadinium_caudatum") ~ 
        case_when(str_detect(family, "Buetschliidae") ~ "Triadinium_1_caudatum", 
            TRUE ~ "Triadinium_2_caudatum"), TRUE ~ species))
# write_tsv(pr2_taxo_update, full_path'junk2.txt')

3.1.4 Correct taxonomy names to follow PR2 rules

  • Taxon[1] changed to Taxon_1
pr2_taxo_update <- pr2_taxo_update %>% mutate_all(funs(str_replace(., "\\[(\\d+)\\]", 
    "_\\1")))

3.1.5 Replace the taxons that are repeated across columns by Taxa_X, Taxa_XX

  • If ranks i and i+1 are similar then add _X to the taxonomy
pr2_taxo_update <- pr2_taxo_X(pr2_taxo_update, pr2.env$taxo_levels)

3.1.6 Merge back the assignation into the initial table

pr2_update_2 <- left_join(select(pr2_taxo_update, -Taxo12, -eukref_name), select(pr2_update, 
    genbank_accession, eukref_name:gb_journal))

3.2 Construct taxonomy and compare to existing taxonomy

3.2.1 Construct a taxonomy file

  • Based on the taxonomy analysis the data are corrected upstream until no more duplicates
# Construct the taxonomy table
pr2_taxo_update_summary <- pr2_taxo_update %>% group_by_(.dots = pr2.env$taxo_levels) %>% 
    summarise(sequence_number = n()) %>% ungroup() %>% arrange_(.dots = pr2.env$taxo_levels)

write_tsv(pr2_taxo_update_summary, full_path("Ciliates_taxo_summary.txt"))

# Check it for duplicate names and duplicate parents
pr2_taxo_check(select(pr2_taxo_update_summary, kingdom:species), dir_pr2_update)

# Extract the sequence lines for which species are duplicates from the
# pr2_taxo file
duplicate_species <- read_tsv(full_path("taxo_species_duplicates.txt"))
pr2_taxo_duplicate_species <- filter(pr2_taxo_update, species %in% duplicate_species$species)
write_tsv(pr2_taxo_duplicate_species, full_path("Ciliates_duplicate_species.txt"))

3.2.2 Compare the new taxonomy to what was existing before

# Compare to existing taxonomy
pr2_update_taxo_compared <- left_join(pr2_taxo_update_summary, pr2_taxo, by = c(species = "species")) %>% 
    rename_(kingdom = "kingdom.x", supergroup = "supergroup.x", division = "division.x", 
        class = "class.x", order = "order.x", family = "family.x", genus = "genus.x") %>% 
    arrange(class, order, family, genus, species)

write_tsv(pr2_update_taxo_compared, full_path("Ciliates_taxo_compared.txt"), 
    na = "")

pr2_taxo_updated <- pr2_update_taxo_compared %>% filter(!is.na(taxo_id))

pr2_taxo_added <- pr2_update_taxo_compared %>% filter(is.na(taxo_id))

pr2_taxo_not_updated <- pr2_taxo %>% filter((division == "Ciliophora") & (class != 
    "Spirotrichea") & !(species %in% pr2_taxo_update_summary$species))

write_tsv(pr2_taxo_not_updated, full_path("Ciliates_taxo_not_updated.txt"), 
    na = "")

str_c("Ciliates: number of taxa updated : ", nrow(pr2_taxo_updated))
[1] "Ciliates: number of taxa updated : 839"
str_c("Ciliates: number of taxa added : ", nrow(pr2_taxo_added))
[1] "Ciliates: number of taxa added : 1"
str_c("Ciliates: number of taxa not updated (removed) : ", nrow(pr2_taxo_not_updated))
[1] "Ciliates: number of taxa not updated (removed) : 100"

3.2.3 Verify any problem with the full pr2 taxonomy

pr2_taxo_full_check <- pr2_taxo %>% select(kingdom:species) %>% # filter(!((division == 'Ciliophora') & (class != 'Spirotrichea'))) %>%
bind_rows(select(pr2_taxo_update_summary, kingdom:species))
pr2_taxo_check(pr2_taxo_full_check, dir_taxo = "C:/Users/vaulot/Desktop")

3.3 Metadata

3.3.1 Clean up the metadata

  • Replace _ENVOxxx_ by (ENVOxxx)
  • Keep only one keyword (and/or)
  • Put everything in lower case except for ENVO
pr2_update_2 <- pr2_update_2 %>% mutate(eukref_env_material_corrected = str_replace(eukref_env_material, 
    "_(.+)_", "\\(\\1\\)")) %>% mutate(eukref_env_material_corrected = str_replace(eukref_env_material_corrected, 
    "and .+$|or .+$", "")) %>% mutate(eukref_env_material_corrected = str_to_lower(eukref_env_material_corrected)) %>% 
    mutate(eukref_env_material_corrected = str_replace(eukref_env_material_corrected, 
        "envo", "ENVO"))

pr2_update_2 <- pr2_update_2 %>% mutate(eukref_env_biome_corrected = str_replace(eukref_env_biome, 
    "_(.+)_", "\\(\\1\\)")) %>% mutate(eukref_env_biome_corrected = str_replace(eukref_env_biome_corrected, 
    "([a-z]{1})$", "\\1(ENVO missing)")) %>% mutate(eukref_env_biome_corrected = str_to_lower(eukref_env_biome_corrected)) %>% 
    mutate(eukref_env_biome_corrected = str_replace(eukref_env_biome_corrected, 
        "envo", "ENVO"))

pr2_update_2 <- pr2_update_2 %>% mutate(eukref_biotic_relationship_corrected = str_to_lower(eukref_biotic_relationship)) %>% 
    mutate(eukref_biotic_relationship_corrected = str_replace(eukref_biotic_relationship_corrected, 
        "free living", "free-living"))

3.3.2 Summarize fields

eukref_env_materials <- pr2_update_2 %>% group_by(eukref_env_material, eukref_env_material_corrected) %>% 
    count() %>% mutate(env_material = str_extract(eukref_env_material_corrected, 
    "^[a-z0-9- /()]+\\(ENVO"), ENVO_id = str_extract(eukref_env_material_corrected, 
    "\\(ENVO.+\\)")) %>% mutate(env_material = str_replace(env_material, "\\(ENVO", 
    ""), ENVO_id = str_replace_all(ENVO_id, "[()]", ""))

write_tsv(eukref_env_materials, full_path("eukref_env_material.txt"), na = "")

eukref_env_biomes <- pr2_update_2 %>% group_by(eukref_env_biome, eukref_env_biome_corrected) %>% 
    count() %>% mutate(env_biome = str_extract(eukref_env_biome_corrected, "^[a-z0-9- /()]+\\(ENVO"), 
    ENVO_id = str_extract(eukref_env_biome_corrected, "\\(ENVO.+\\)")) %>% mutate(env_biome = str_replace(env_biome, 
    "\\(ENVO", ""), ENVO_id = str_replace_all(ENVO_id, "[()]", ""))

write_tsv(eukref_env_biomes, full_path("eukref_env_biomes.txt"), na = "")


eukref_biotic_relationships <- pr2_update_2 %>% group_by(eukref_biotic_relationship_corrected) %>% 
    count()

eukref_source <- pr2_update_2 %>% group_by(eukref_source) %>% count()

3.3.3 Finalize fields

pr2_update_2 <- pr2_update_2 %>% mutate(eukref_env_material_pr2 = str_extract(eukref_env_material_corrected, 
    "^[a-z0-9- /()]+\\(ENVO")) %>% mutate(eukref_env_material_pr2 = str_replace(eukref_env_material_pr2, 
    "\\(ENVO", ""))

pr2_update_2 <- pr2_update_2 %>% mutate(eukref_env_biome_pr2 = str_extract(eukref_env_biome_corrected, 
    "^[a-z0-9- /()]+\\(ENVO")) %>% mutate(eukref_env_biome_pr2 = str_replace(eukref_env_biome_pr2, 
    "\\(ENVO", ""))

pr2_update_2 <- pr2_update_2 %>% mutate(eukref_biotic_relationship_pr2 = eukref_biotic_relationship_corrected)

3.3.4 Write the final Eukref table

write_tsv(pr2_update_2, full_path("Ciliates_Eukref_processed.txt"))

3.4 Check sequences against PR2

3.4.1 Sequences already in PR2

# Extract lines with accession number already in PR2, then one need just to
# update the species and add the eukref metadata
pr2_update_old <- pr2_update_2 %>% filter(genbank_accession %in% pr2$genbank_accession)
str_c("Number of sequences to update in PR2 : ", nrow(pr2_update_old))
[1] "Number of sequences to update in PR2 : 7028"
pr2_update_old_Spirotrichea <- pr2 %>% filter(genbank_accession %in% pr2_update_old$genbank_accession) %>% 
    filter(class == "Spirotrichea")
str_c("Number of sequences to update in PR2 that belonged before to Spirotrichea : ", 
    nrow(pr2_update_old_Spirotrichea))
[1] "Number of sequences to update in PR2 that belonged before to Spirotrichea : 0"
select(pr2_update_old_Spirotrichea, genbank_accession, class:genus, species)
# A tibble: 0 x 6
# ... with 6 variables: genbank_accession <chr>, class <chr>, order <chr>,
#   family <chr>, genus <chr>, species <chr>
pr2_update_old_main <- select(pr2_update_old, genbank_accession, species)
  • For the metadata use only the eukref fields.
  • For the sequences for which sample_type exists, keep the data, if not update with eukref_sourcefield.
pr2_update_old_metadata <- pr2_update_old %>% select(genbank_accession, eukref_name, 
    eukref_source, eukref_env_material = eukref_env_material_pr2, eukref_env_biome = eukref_env_biome_pr2, 
    eukref_biotic_relationship = eukref_biotic_relationship_pr2, eukref_specific_host, 
    eukref_geo_loc_name, eukref_publication, eukref_authors, eukref_journal, 
    eukref_notes, gb_country, gb_publication, gb_authors, gb_journal) %>% left_join(select(pr2, 
    genbank_accession, pr2_sample_type)) %>% mutate(pr2_sample_type = case_when(!(is.na(pr2_sample_type)) ~ 
    pr2_sample_type, TRUE ~ str_to_lower(eukref_source)))

3.4.2 Ciliates sequences from PR2 not updated by EukRef

These sequences are removed for the time being from PR2

pr2_ciliates <- pr2 %>% filter((division == "Ciliophora") & (class != "Spirotrichea"))

pr2_ciliates_not_updated <- pr2_ciliates %>% filter(!(genbank_accession %in% 
    pr2_update_2$genbank_accession))

str_c("Number of PR2 sequences not updated : ", nrow(pr2_ciliates_not_updated))
[1] "Number of PR2 sequences not updated : 0"
write_tsv(pr2_ciliates_not_updated, full_path("Ciliates_pr2_not_updated.txt"), 
    na = "")


# Are any of these sequences found in the Spirotrichea ?

pr2_wrongly_assigned <- pr2_ciliates_not_updated %>% filter((genbank_accession %in% 
    pr2_update_raw$genbank_accession))

str_c("Number of PR2 sequences that should be moved to Spirotrichea : ", nrow(pr2_wrongly_assigned))
[1] "Number of PR2 sequences that should be moved to Spirotrichea : 0"

3.4.3 New sequences not found in PR2

pr2_update_new <- pr2_update_2 %>% filter(!(genbank_accession %in% pr2$genbank_accession))
str_c("Number of new sequences : ", nrow(pr2_update_new))
[1] "Number of new sequences : 32"

3.4.3.1 Download the sequences and parse them

pr2_update_new_metadata <- genbank_download_parse(pr2_update_new$genbank_accession, 
    pr2.env$genbank_directory)

write_tsv(pr2_update_new_metadata, full_path("Ciliates_new_metadata.txt"), na = "")
knitr::kable(head(pr2_update_new_metadata, 5))

3.4.3.2 Extract features

Features that can be used to determine the start and end of the 18S - This is stored in a file.

pr2_update_new_features <- genbank_features(pr2_update_new$genbank_accession, 
    pr2.env$genbank_directory)
write_tsv(pr2_update_new_features, full_path("Ciliates_new_features.txt"), na = "")
knitr::kable(head(pr2_update_new_features, 3))

3.4.3.3 Look manually for sequence start and end for sequences that contain ITS

  • Export the sequence which contain ITS
  • Import into Geneious
  • Align and look for probe GTGAACCTGCRGAWGGATCA which marks the end of the 18S
  • Update manually start and end in the pr2_new_metadata sheet
pr2_update_new_metadata <- read_excel(full_path("EukRef Ciliates 2018 version 2.0.xlsx"), 
    sheet = "pr2_new_metadata", na = "NA", col_names = TRUE, guess_max = 2e+05)

pr2_update_new_ITS <- pr2_update_new_metadata %>% filter(is.na(end))
fasta_write(select(pr2_update_new_ITS, seq_name = genbank_accession, sequence), 
    file_name = full_path("sequences_with_ITS.fasta"), compress = FALSE, taxo_include = FALSE)

3.4.3.4 Reload the metadata sheet

Now the start and end are OK

pr2_update_new_metadata <- read_excel(full_path("EukRef Ciliates 2018 version 2.0.xlsx"), 
    sheet = "pr2_new_metadata", na = "NA", col_names = TRUE, guess_max = 2e+05)

3.4.3.5 Merge the accession, taxonomy, sequence and metadata

pr2_update_new <- left_join(select(pr2_update_new, -gb_host, -gb_environment, 
    -gb_country), pr2_update_new_metadata)

3.4.3.6 Extract sequences and get length and number of ambiguities

# Extract the subsequence if the whole sequence is not to be used
index <- !is.na(pr2_update_new$start)
pr2_update_new$sequence[index] <- str_sub(pr2_update_new$sequence[index], pr2_update_new$start[index], 
    pr2_update_new$end[index])
pr2_update_new$sequence_length_old <- pr2_update_new$sequence_length

# Update sequence length and ambiguities for all sequences since some may
# have been shortened
pr2_update_new$sequence_length <- str_length(pr2_update_new$sequence)
pr2_update_new$ambiguities <- str_count(pr2_update_new$sequence, pattern = pr2.env$ambig_regex)

# Remove sequences that are shorter than minimum lenght (500) and those that
# have too many ambiguities
pr2_update_new_rejected <- pr2_update_new %>% filter((sequence_length < pr2.env$sequence_length_min) | 
    (ambiguities > pr2.env$ambiguities_max))
write_tsv(pr2_update_new_rejected, full_path("Ciliates_new_rejected.txt"), na = "")

str_c("Number of sequences rejected (length <500 or ambuiguities > 20): ", nrow(pr2_update_new_rejected))
[1] "Number of sequences rejected (length <500 or ambuiguities > 20): 11"
select(pr2_update_new_rejected, genbank_accession, sequence_length, ambiguities)
# A tibble: 11 x 3
   genbank_accession sequence_length ambiguities
   <chr>                       <int>       <int>
 1 KJ755359                     1220          21
 2 Z29441                       1516          42
 3 AB684393                      322           0
 4 AF070701                     1688          65
 5 AF300284                     1633          21
 6 GU922412                      464           0
 7 HM799981                     1698         100
 8 FJ032688                     1856          30
 9 LC031485                      541          21
10 GU922866                      485           0
11 AJ862469                      461           4
pr2_update_new <- pr2_update_new %>% filter((sequence_length >= pr2.env$sequence_length_min) & 
    (ambiguities <= pr2.env$ambiguities_max))

3.4.3.7 Create PR2 fields

pr2_update_new$label <- "U"
pr2_update_new <- pr2_update_new %>% mutate(pr2_accession = paste0(genbank_accession, 
    ".", start, ".", end, "_", label))
pr2_update_new$reference_sequence = NA


write_tsv(pr2_update_new, full_path("Ciliates pr2_update_new.txt"), na = "")

4 Update the PR2 database

4.1 Update new taxo

Use three data frames * pr2_taxo_updated : 573 lines edited but 661 records edited (if a genus is edited all from the same genus have been edited) * pr2_taxo_added : 267 * pr2_taxo_not_updated : 101 - This is done manually in order not to remove taxa that have been updated

pr2_taxo_updated <- pr2_taxo_update(pr2_taxo_updated, method = "update")
pr2_taxo_added <- pr2_taxo_update(pr2_taxo_added, method = "add")

4.2 Re-assign the old PR2 sequences and add the eukref metadata

  • 4556 sequences re-assigned (corresponding to 4562 entries)
  • 4549 metadata updated (number of rows is 4568, so some rows correspond to the same Genbank sequence ??)

Use two data frames * pr2_update_old_main

pr2_update_old_main <- pr2_sequence_reassign(pr2_update_old_main)
  • pr2_update_old_metadata

[1] "genbank_accession" "eukref_name" "eukref_source" "eukref_env_material" "eukref_env_biome" "eukref_biotic_relationship" [7] "eukref_specific_host" "eukref_geo_loc_name" "eukref_publication" "eukref_authors" "eukref_journal" "eukref_notes" [13] "gb_country" "gb_publication" "gb_authors" "gb_journal" "pr2_sample_type"

pr2_update_old_metadata <- pr2_update_old_metadata %>% mutate(query = str_c("UPDATE pr2_metadata", 
    " SET ", db_update_field_string("eukref_name", eukref_name), db_update_field_string("eukref_source", 
        eukref_source), db_update_field_string("eukref_env_material", eukref_env_material), 
    db_update_field_string("eukref_env_biome", eukref_env_biome), db_update_field_string("eukref_biotic_relationship", 
        eukref_biotic_relationship), db_update_field_string("eukref_specific_host", 
        eukref_specific_host), db_update_field_string("eukref_publication", 
        eukref_publication), db_update_field_string("eukref_authors", eukref_authors), 
    db_update_field_string("eukref_journal", eukref_journal), db_update_field_string("eukref_notes", 
        eukref_notes), db_update_field_string("gb_country", gb_country), db_update_field_string("gb_publication", 
        gb_publication), db_update_field_string("gb_authors", gb_authors), db_update_field_string("gb_journal", 
        gb_journal), db_update_field_string("pr2_sample_type", pr2_sample_type, 
        append_comma = FALSE), " WHERE genbank_accession = ", db_sql_escape(genbank_accession)))

write_tsv(data.frame(pr2_update_old_metadata$query), full_path("pr2_update_query.txt"), 
    na = "")

pr2_update_old_metadata$query_result <- db_execute_query_vector(db_info("pr2_local"), 
    pr2_update_old_metadata$query)

# Find if there are some sequences not in the metadata table (none)
pr2_metatadata_missing <- pr2_update_old_metadata %>% filter(!(genbank_accession %in% 
    pr2_metadata$genbank_accession))

# find How many distinct genbank sequences : 4550 (4568 lines though)
nrow(distinct(data.frame(pr2_update_old_metadata$genbank_accession)))

4.3 Remove sequences that are not annotated by EukRef

Use data frame * pr2_ciliates_not_updated: 652

pr2_ciliates_not_updated <- pr2_ciliates_not_updated %>% mutate(query = str_c("UPDATE pr2_main", 
    " SET ", db_update_field_string("removed_version", pr2.env$version), db_update_field_string("edited_remark", 
        "Ciliophora sequences not annotated by Eukref"), db_update_field_string("edited_by", 
        pr2.env$editor, append_comma = FALSE), " WHERE genbank_accession = ", 
    db_sql_escape(genbank_accession)))
pr2_ciliates_not_updated$query_result <- db_execute_query_vector(db_info("pr2_local"), 
    pr2_ciliates_not_updated$query)

4.4 Add the new sequences

Use data frame * pr2_update_new: 2478 lines added

# Convert to ASCII all columns
pr2_update_new <- pr2_update_new %>% mutate_if(is.character, ~iconv(., "latin1", 
    "ASCII", sub = "?"))

pr2_main_add <- pr2_update_new %>% select(pr2_accession, genbank_accession, 
    start, end, label, species) %>% mutate(added_version = pr2.env$version, 
    edited_by = pr2.env$editor)

pr2_sequences_add <- pr2_update_new %>% select(pr2_accession, sequence, sequence_length, 
    ambiguities)

pr2_metadata_add <- pr2_update_new %>% select(genbank_accession, starts_with("gb_"), 
    starts_with("pr2_"), starts_with("eukref_")) %>% mutate(eukref_env_material = eukref_env_material_pr2, 
    eukref_env_biome = eukref_env_biome_pr2, eukref_biotic_relationship = eukref_biotic_relationship_pr2) %>% 
    select(-pr2_accession)

# Write to files that are then uploaded into the datbase (easier for error
# tracking)

write_tsv(pr2_main_add, full_path("pr2_main_add.txt"), na = "")
write_tsv(pr2_sequences_add, full_path("pr2_sequences_add.txt"), na = "")
write_tsv(pr2_metadata_add, full_path("pr2_metadata_add.txt"), na = "")

4.5 Mark sequences that are reference

  • 1740 sequences
pr2_reference <- read_excel(full_path("EukRef Ciliates 2018 version 2.0.xlsx"), 
    sheet = "Eukref reference sequences", na = "NA", col_names = TRUE, guess_max = 2e+05)
pr2_reference <- pr2_reference %>% mutate(query = str_c("UPDATE pr2_main", " SET ", 
    db_update_field_string("reference_sequence", 1, append_comma = FALSE), " WHERE genbank_accession = ", 
    db_sql_escape(genbank_accession)))

pr2_reference$query_result <- db_execute_query_vector(db_info("pr2_local"), 
    pr2_reference$query)

4.6 Final checks

Reload the PR2 database

4.6.1 Taxonomy

pr2_taxo_check(pr2_taxo, dir_taxo = "C:/Users/vaulot/Desktop")

4.6.2 Full data

pr2_taxo_missing <- pr2 %>% filter(is.na(division))


pr2_ciliophora_not_updated <- pr2 %>% filter(division == "Ciliophora") %>% filter(!str_detect(edited_version, 
    "4.7|4.11"))

Daniel Vaulot

28 10 2018