PR2 version 4.11.0
Eukref Ciliate Integration
PR2 version 4.11.0
Eukref Ciliate Integration
Load the variables common to the different scripts and the necessary libraries
source('PR2_init.R', echo=FALSE)
knitr::opts_chunk$set(eval=TRUE)
1 Introduction
1.1 Main changes to PR2
- Update Ciliate with database provided by Eukref (except for Spirotrichea that had been done before)
1.2 References used
- Boscaro, V., Santoferrara, L.F., Zhang, Q., Gentekaki, E., Syberg-Olsen, M.J., del Campo, J. & Keeling, P.J. 2018. EukRef-Ciliophora: A manually curated, phylogeny-based database of small subunit rRNA gene sequences of ciliates. Environ. Microbiol.
1.3 Data used
- Latest version of : https://github.com/eukref/curation/blob/master/EukRef_Ciliophora/EUKREF-CILIOPHORA.tsv
- Spirotrichea will not be updated because a recent update has been done by C. Bachy and W. Ting
Type | Number of sequences | Remark |
---|---|---|
Number of PR2 sequences updated. | 4577 | |
Number of PR2 new sequences added | 2489 | |
Number of PR2 sequences not updated and removed | 652 |
1.4 Summary of the data
See the worksheet attached EukRef Ciliates 2018.xlsx
Sheet | Contains | To do |
---|---|---|
pr2_ciliates_not_updated | PR2 sequences assigned as sequences that have not been annotated by EukRef | These sequences will be removed from PR2 |
pr2_new_metadata | Metadata for new sequences | Need to add end for sequences that include ITS |
pr2_update_taxo | Summary of edited taxonomy with number of sequences | OK |
eukref_processed | All changes made based on the edited taxonomy as well as to medata | OK |
1.6 Small Problems found in the EukRef file
All these problems have been corrected in the EUKREF-CILIOPHORA_corrected_v2.0.tsv file
- Some sequences do not have GenBank accession. They correspond to PDB sequences and have not been included since the species Tetrahymena thermophila has other sequences associated.
4BPE_A Eukaryota SAR Alveolata Ciliophora Intramacronucleata CONthreeP Oligohymenophorea Hymenostomatia Tetrahymenida Tetrahymenida Tetrahymenida Tetrahymenida Tetrahymenida Tetrahymena thermophila
4BPN_A Eukaryota SAR Alveolata Ciliophora Intramacronucleata CONthreeP Oligohymenophorea Hymenostomatia Tetrahymenida Tetrahymenida Tetrahymenida Tetrahymenida Tetrahymenida Tetrahymena thermophila
4BPO_A Eukaryota SAR Alveolata Ciliophora Intramacronucleata CONthreeP Oligohymenophorea Hymenostomatia Tetrahymenida Tetrahymenida Tetrahymenida Tetrahymenida Tetrahymenida Tetrahymena thermophila
4BPP_A Eukaryota SAR Alveolata Ciliophora Intramacronucleata CONthreeP Oligohymenophorea Hymenostomatia Tetrahymenida Tetrahymenida Tetrahymenida Tetrahymenida Tetrahymenida Tetrahymena thermophila
2XZM_A Eukaryota SAR Alveolata Ciliophora Intramacronucleata CONthreeP Oligohymenophorea Hymenostomatia Tetrahymenida Tetrahymenida Tetrahymenida Tetrahymenida Tetrahymenida Tetrahymena thermophila
2XZN_A Eukaryota SAR Alveolata Ciliophora Intramacronucleata CONthreeP Oligohymenophorea Hymenostomatia Tetrahymenida Tetrahymenida Tetrahymenida Tetrahymenida Tetrahymenida Tetrahymena thermophila
One sequence has a wrong name in the 0 column “Frontonia_lynni” (should be Eukaryota) OK corrected in Text file
DQ190463.3 Frontonia_lynni SAR Alveolata Ciliophora Intramacronucleata CONthreeP Oligohymenophorea Peniculia[1] Peniculida Peniculida Frontoniidae[1] Frontoniidae[1] Frontoniidae[1] Frontonia lynni
I found one error in the taxonomy : All the Dysteriidae are in Cyrtophoria_8 but one is Cyrtophoria_7 OK corrected in Text file
LC031485 Eukaryota Alveolata Ciliophora Phyllopharyngea **Cyrtophoria_7** Dysteriidae Dysteriidae_X Dysteriidae_X_sp. SS 284
Should be ?
LC031485 Eukaryota Alveolata Ciliophora Phyllopharyngea **Cyrtophoria_8** Dysteriidae Dysteriidae_X Dysteriidae_X_sp. SS 284
- You have different groups OLIGO1 to 10 that are in columns Taxo10 to Taxo12
- However you have some other groups called OLIGO in Taxo12 and which have Operculariidae[1] in the Taxo10 column and that go up to OLIGO18. These groups are not mentionned in the Note file (ciliophora_notes.md). Are these OLIGO the same thing that the OLIGO that appear in the Taxo 1 to 12 OK corrected in text file
Lines with OLIGO in the Taxo10 to Taxo12
genbank_accession Taxo0 Taxo1 Taxo2 Taxo3 Taxo4 Taxo5 Taxo6 Taxo7 Taxo8 Taxo9 Taxo10 Taxo11 Taxo12 eukref_name
HQ164099.1 Eukaryota SAR Alveolata Ciliophora Intramacronucleata CONthreeP Oligohymenophorea Peritrichia[2] Sessilida Sessilida OLIGO2 OLIGO2 OLIGO2 PR-53
LC150080.1 Eukaryota SAR Alveolata Ciliophora Intramacronucleata CONthreeP Oligohymenophorea Peritrichia[2] Sessilida Sessilida OLIGO2 OLIGO2 OLIGO2 AS_seed_33
LC150125.1 Eukaryota SAR Alveolata Ciliophora Intramacronucleata CONthreeP Oligohymenophorea Peritrichia[2] Sessilida Sessilida OLIGO2 OLIGO2 OLIGO2 A21_ek_8
LC150193.1 Eukaryota SAR Alveolata Ciliophora Intramacronucleata CONthreeP Oligohymenophorea Peritrichia[2] Sessilida Sessilida OLIGO2 OLIGO2 OLIGO2 B21_ps_6
LC150194.1 Eukaryota SAR Alveolata Ciliophora Intramacronucleata CONthreeP Oligohymenophorea Peritrichia[2] Sessilida Sessilida OLIGO2 OLIGO2 OLIGO2 B21_ps_7
Lines with Operculariidae[1] in Taxo 11 and going up to OLIGO18
genbank_accession Taxo0 Taxo1 Taxo2 Taxo3 Taxo4 Taxo5 Taxo6 Taxo7 Taxo8 Taxo9 Taxo10 Taxo11 Taxo12 eukref_name
QCBT2ZC111
GU922676.1 Eukaryota SAR Alveolata Ciliophora Intramacronucleata CONthreeP Oligohymenophorea Peritrichia[2] Sessilida Sessilida Operculariidae[1] OLIGO10 OLIGO10 F5K2Q4C04H79Z6
GU922807.1 Eukaryota SAR Alveolata Ciliophora Intramacronucleata CONthreeP Oligohymenophorea Peritrichia[2] Sessilida Sessilida Operculariidae[1] OLIGO10 OLIGO10 F5K2Q4C04JDJJY
GU922677.1 Eukaryota SAR Alveolata Ciliophora Intramacronucleata CONthreeP Oligohymenophorea Peritrichia[2] Sessilida Sessilida Operculariidae[1] OLIGO11 OLIGO11 F5K2Q4C04JEMU6
GU923387.1 Eukaryota SAR Alveolata Ciliophora Intramacronucleata CONthreeP Oligohymenophorea Peritrichia[2] Sessilida Sessilida Operculariidae[1] OLIGO11 OLIGO11 F5K2Q4C04JEO8B
GU922798.1 Eukaryota SAR Alveolata Ciliophora Intramacronucleata CONthreeP Oligohymenophorea Peritrichia[2] Sessilida Sessilida Operculariidae[1] OLIGO12 OLIGO12 F5K2Q4C04I6ZU6
KY355481.1 Eukaryota SAR Alveolata Ciliophora Intramacronucleata CONthreeP Oligohymenophorea Peritrichia[2] Sessilida Sessilida Operculariidae[1] OLIGO12 OLIGO12 QCCA2ZA071
GU922902.1 Eukaryota SAR Alveolata Ciliophora Intramacronucleata CONthreeP Oligohymenophorea Peritrichia[2] Sessilida Sessilida Operculariidae[1] OLIGO13 OLIGO13 F5K2Q4C04IVM5V
GU922989.1 Eukaryota SAR Alveolata Ciliophora Intramacronucleata CONthreeP Oligohymenophorea Peritrichia[2] Sessilida Sessilida Operculariidae[1] OLIGO14 OLIGO14 F5K2Q4C04H6E06
GU923005.1 Eukaryota SAR Alveolata Ciliophora Intramacronucleata CONthreeP Oligohymenophorea Peritrichia[2] Sessilida Sessilida Operculariidae[1] OLIGO15 OLIGO15 F5K2Q4C04IA55J
GU923014.1 Eukaryota SAR Alveolata Ciliophora Intramacronucleata CONthreeP Oligohymenophorea Peritrichia[2] Sessilida Sessilida Operculariidae[1] OLIGO16 OLIGO16 F5K2Q4C04I4F5W
GU923146.1 Eukaryota SAR Alveolata Ciliophora Intramacronucleata CONthreeP Oligohymenophorea Peritrichia[2] Sessilida Sessilida Operculariidae[1] OLIGO17 OLIGO17 F5K2Q4C04IMU19
GU923168.1 Eukaryota SAR Alveolata Ciliophora Intramacronucleata CONthreeP Oligohymenophorea Peritrichia[2] Sessilida Sessilida Operculariidae[1] OLIGO18 OLIGO18 F5K2Q4C04IA8IG
GU919118.1 Eukaryota SAR Alveolata Ciliophora Intramacronucleata CONthreeP Oligohymenophorea Peritrichia[2] Sessilida Sessilida Operculariidae[1] OLIGO2 OLIGO2 F5K2Q4C04H224O
GU919396.1 Eukaryota SAR Alveolata Ciliophora Intramacronucleata CONthreeP Oligohymenophorea Peritrichia[2] Sessilida Sessilida Operculariidae[1] OLIGO2 OLIGO2
- You have also some lines
EF586159.1 Eukaryota SAR Alveolata Ciliophora Intramacronucleata CONthreeP Oligohymenophorea Peritrichia[2] Sessilida Sessilida Operculariidae[1] Operculariidae[1] Operculariidae[1] 384-O6
AF401525.1 Eukaryota SAR Alveolata Ciliophora Intramacronucleata CONthreeP Oligohymenophorea Peritrichia[2] Sessilida Sessilida Operculariidae[1] Operculariidae[1] Operculariidae[1] Opercularia microdiscum
KU363257.1 Eukaryota SAR Alveolata Ciliophora Intramacronucleata CONthreeP Oligohymenophorea Peritrichia[2] Sessilida Sessilida Operculariidae[1] Operculariidae[1] Operculariidae[1] 102operCul
KU363267.1 Eukaryota SAR Alveolata Ciliophora Intramacronucleata CONthreeP Oligohymenophorea Peritrichia[2] Sessilida Sessilida Operculariidae[1] Operculariidae[1] Operculariidae[1] 121operCul
HM627238.1 Eukaryota SAR Alveolata Ciliophora Intramacronucleata CONthreeP Oligohymenophorea Peritrichia[2] Sessilida Sessilida Operculariidae[1] Operculariidae[1] Operculariidae[1] Opercularia allensi
HM627239.1 Eukaryota SAR Alveolata Ciliophora Intramacronucleata CONthreeP Oligohymenophorea Peritrichia[2] Sessilida Sessilida Operculariidae[1] Operculariidae[1] Operculariidae[1] Opercularia sp. 1 LRPU-2010
- Accession numbers appearing more than once
- GU819455 : 2 lines
- HE655310 : 2 lines
- I noticed some errors in the ENVO keywords see:
OK corrected in text file
* area of moss-dominated vegetation ENVO:01000890
* cloaca UBERON:0000162
* stomach UBERON:0000945
* hydrotermal vent ENVO:00000215
* caecum UBERON:0001153
* colon UBERON:0001155
* rhizosphere ENVO:00005801
* ruminal fluid ENVO:0010228 - Could not find this term (replace by stomach ?)
2 Read pr2
- Use the PR2 on local computer
# Info for the local dabase
pr2_db <- db_info("pr2_local")
pr2_db_con <- db_connect(pr2_db)
pr2_main <- tbl(pr2_db_con, "pr2_main") %>% collect()
# pr2_main <- pr2_main %>% filter (is.na(removed_version))
pr2_seq <- tbl(pr2_db_con, "pr2_sequences") %>% collect()
pr2_taxo <- tbl(pr2_db_con, "pr2_taxonomy")
pr2_taxo <- pr2_taxo %>% filter(is.na(taxo_removed_version)) %>% collect()
pr2_metadata <- tbl(pr2_db_con, "pr2_metadata") %>% collect()
# Join the tables and keep only sequences that are not removed
pr2 <- left_join(pr2_main, pr2_taxo, by = c(species = "species"))
pr2 <- left_join(pr2, pr2_seq)
pr2 <- left_join(pr2, pr2_metadata)
db_disconnect(pr2_db_con)
[1] TRUE
pr2 <- pr2 %>% filter(is.na(removed_version))
pr2_sample_types <- pr2 %>% group_by(pr2_sample_type) %>% count()
3 Ciliate database - Eukref
- Round 1 - 2018-07
- Round 2 - 2018-10
3.1 Read the data and reformat
3.1.1 Read the data
- Read the Tab delimited file because problems with strange characters and CR in some fields
- The first line of the file is skipped, the second line contains the name of the fields to be used in PR2
- The last column is empty and discarded (named “empty”)
- Number of sequences = 11622
dir_pr2_update <- "../updates/2018 Ciliates Eukref"
pr2.env$editor <- "Eukref - Boscaro V."
full_path <- function(file_name) {
str_c(dir_pr2_update, "/", file_name)
}
file_pr2_update_excel <- full_path("EukRef Ciliates 2018 2.0.xlsx")
# Import the tsv skipping the first line
pr2_update_raw <- read_tsv(full_path("EUKREF-CILIOPHORA_corrected_v2.0.tsv"),
col_names = TRUE, guess_max = 2e+05, na = c("", "NA"), skip = 1)
problems(pr2_update_raw)
# tibble [0 x 4]
# ... with 4 variables: row <int>, col <int>, expected <chr>, actual <chr>
spec(pr2_update_raw)
cols(
genbank_accession = col_character(),
Taxo0 = col_character(),
Taxo1 = col_character(),
Taxo2 = col_character(),
Taxo3 = col_character(),
Taxo4 = col_character(),
Taxo5 = col_character(),
Taxo6 = col_character(),
Taxo7 = col_character(),
Taxo8 = col_character(),
Taxo9 = col_character(),
Taxo10 = col_character(),
Taxo11 = col_character(),
Taxo12 = col_character(),
eukref_name = col_character(),
eukref_source = col_character(),
eukref_env_material = col_character(),
eukref_env_biome = col_character(),
eukref_biotic_relationship = col_character(),
eukref_specific_host = col_character(),
eukref_geo_loc_name = col_character(),
eukref_publication = col_character(),
eukref_authors = col_character(),
eukref_journal = col_character(),
eukref_notes = col_character(),
gb_taxonomy = col_character(),
gb_environment = col_character(),
gb_host = col_character(),
gb_country = col_character(),
gb_publication = col_character(),
gb_authors = col_character(),
gb_journal = col_character(),
empty = col_character()
)
# Remove last column which is empty
pr2_update_raw <- select(pr2_update_raw, -empty)
str_c("Number of sequences : ", nrow(pr2_update_raw))
[1] "Number of sequences : 11620"
- Rename the columns as a function of the column selected for taxonomy. The “genus” column (Taxo12) is not yet selected because it will be constructed latter.
- Remove ending digits from the accession number
- Filter out accession number that starts by a number (sequences not from GenBank but from PDB)
- Number of sequences with valid GenBank number : 11616
pr2_update <- pr2_update_raw %>% rename(kingdom = Taxo0, supergroup = Taxo2,
division = Taxo3, class = Taxo6, order = Taxo7, family = Taxo10) %>% mutate(genbank_accession = str_replace(genbank_accession,
"[.][0-9]+", "")) %>% filter(str_detect(genbank_accession, "^[A-Z]+"))
str_c("Number of sequences with valid GenBank number : ", nrow(pr2_update))
[1] "Number of sequences with valid GenBank number : 11614"
- Remove Spirotrichea
pr2_update <- pr2_update %>% filter(class != "Spirotrichea")
str_c("Number of sequences after removing Spirotrichea : ", nrow(pr2_update))
[1] "Number of sequences after removing Spirotrichea : 7060"
3.1.2 Build the genus and species field
eukref_name | genus | species | Remark |
---|---|---|---|
Genus sp. | taxo12 | taxo12_sp. | This case corresponds to clones… |
Genus cf. species | Genus | Genus_species | |
Genus species | Genus | Genus_cf_species | |
other cases | taxo12 | taxo12_sp. |
Operations must be done in this order because of not, then case #3 will also find Genus _sp. 1. Locate in column eukref_name “Genus sp.”
2. Locate in eukref_name “Genus cf. species” 3. Locate in eukref_name “Genus species” 4. For the rest using the Taxo12 column
# Select only relevant columns
pr2_taxo_update <- pr2_update %>% select_(.dots = c("genbank_accession", pr2.env$taxo_levels[1:6]),
"Taxo12", "eukref_name")
# Case 1
pr2_taxo_update1 <- pr2_taxo_update %>% filter(str_detect(eukref_name, pr2.env$genus_sp_regex)) %>%
mutate(genus = Taxo12, species = str_c(Taxo12, "_sp."))
pr2_taxo_update <- pr2_taxo_update %>% filter(!(genbank_accession %in% pr2_taxo_update1$genbank_accession))
str_c("Number of sequences Genus sp. : ", nrow(pr2_taxo_update1))
[1] "Number of sequences Genus sp. : 495"
# Case 2
pr2_taxo_update2 <- pr2_taxo_update %>% filter(str_detect(eukref_name, pr2.env$genus_cf_regex)) %>%
mutate(species = str_extract(eukref_name, pr2.env$genus_cf_regex)) %>% mutate(genus = str_replace(species,
pr2.env$genus_cf_regex, "\\1")) %>% mutate(species = str_replace(species,
pr2.env$genus_cf_regex, "\\1_cf._\\2"))
pr2_taxo_update <- pr2_taxo_update %>% filter(!(genbank_accession %in% pr2_taxo_update2$genbank_accession))
str_c("Number of sequences Genus cf. : ", nrow(pr2_taxo_update2))
[1] "Number of sequences Genus cf. : 8"
# Case 3
pr2_taxo_update3 <- pr2_taxo_update %>% filter(str_detect(eukref_name, pr2.env$genus_species_regex)) %>%
mutate(species = str_extract(eukref_name, pr2.env$genus_species_regex)) %>%
mutate(genus = str_replace(species, pr2.env$genus_species_regex, "\\1")) %>%
mutate(species = str_replace(species, pr2.env$genus_species_regex, "\\1_\\2"))
pr2_taxo_update <- pr2_taxo_update %>% filter(!(genbank_accession %in% pr2_taxo_update3$genbank_accession))
str_c("Number of sequences Genus species. : ", nrow(pr2_taxo_update3))
[1] "Number of sequences Genus species. : 1315"
# Case 4
pr2_taxo_update4 <- pr2_taxo_update %>% mutate(genus = Taxo12, species = str_c(Taxo12,
"_sp."))
str_c("Number of sequences with no species info : ", nrow(pr2_taxo_update4))
[1] "Number of sequences with no species info : 5242"
# Merge the four data frames
pr2_taxo_update <- bind_rows(pr2_taxo_update1, pr2_taxo_update2, pr2_taxo_update3,
pr2_taxo_update4) %>% select_(.dots = c("genbank_accession", pr2.env$taxo_levels[1:8]),
"Taxo12", "eukref_name")
# write_tsv(pr2_taxo_update, full_path'junk.txt'))
3.1.3 Correct specific cases
- Modify the genus and species for genus that are split as Genus_1 and Genus_2
- Modify the genus and species that are split into 2 different lineages upstream. Two genera are created _1 and _2
- One error in the order “Cyrtophoria_7” instead of" Cyrtophoria_8"
- Remove all the different “Neobalantidium_coli_1, _2 etc…"
- One family “Schizocaryidae” (single sequence) changed to “Philasterida”
- “AB074079” -> “Chilodonella_sp.
- When a taxon has a subscript [1] and there are not taxa with [2] then remove the [1]
Note the following problem has been corrected in the file itself (problem of OLIGO1 that was OLIGO2… 13)
- Family Operculariidae and genus containing OLIGO, then the genus is named with the family followed by _X
- idem for species
# List of genera that are split between 2 lineages but not named [1] and [2]
# genus_split <- 'Triadinium|Zoothamnium'
pr2_taxo_update <- pr2_taxo_update %>% mutate_all(funs(str_replace(., "Metopidae\\[1\\]",
"Metopidae"))) %>% mutate_all(funs(str_replace(., "Metopus\\[1\\]", "Metopus"))) %>%
mutate_all(funs(str_replace(., "Nyctotheridae\\[1\\]", "Nyctotheridae"))) %>%
mutate_all(funs(str_replace(., "Spathidiida\\[1\\]", "Spathidiida"))) %>%
mutate_all(funs(str_replace(., "Spathidiidae\\[1\\]", "Spathidiidae"))) %>%
mutate_all(funs(str_replace(., "Blepharocorythidae\\[1\\]", "Blepharocorythidae"))) %>%
mutate_all(funs(str_replace(., "Operculariidae\\[1\\]", "Operculariidae"))) %>%
mutate_all(funs(str_replace(., "Zoothamniidae\\[1\\]", "Zoothamniidae"))) %>%
mutate_all(funs(str_replace(., "Pleuronematida\\[1\\]", "Pleuronematida"))) %>%
mutate_all(funs(str_replace(., "Peniculia\\[1\\]", "Peniculia"))) %>% mutate_all(funs(str_replace(.,
"Balantidium\\[1\\]", "Balantidium"))) %>% mutate(order = case_when(str_detect(family,
"Dysteriidae") ~ "Cyrtophoria_8", TRUE ~ order), family = case_when(str_detect(family,
"Schizocaryidae") ~ "Philasterida", str_detect(genus, "Caenomorphidae") ~
"Caenomorphidae", TRUE ~ family), genus = case_when(str_detect(genus, "Neobalantidium") ~
"Neobalantidium", str_detect(Taxo12, "Dysteria\\[|Pleuronema\\[|Trichodina\\[|Entodinium\\[") ~
Taxo12, str_detect(genus, "Tardigrada") ~ "Scuticociliatia_2_XX", str_detect(genus,
"Uncultured") ~ "Philasterida_X", str_detect(genus, "Arcuospathidium") ~
case_when(str_detect(family, "Litostomatea") ~ "Arcuospathidium_1", TRUE ~
"Arcuospathidium_2"), str_detect(genus, "Balantidium") ~ case_when(str_detect(family,
"Trichostomatia") ~ "Balantidium_1", TRUE ~ "Balantidium_2"), str_detect(genus,
"Cyclidium") ~ case_when(str_detect(family, "Oligohymenophorea") ~ "Cyclidium_1",
str_detect(family, "Pleuronematida") ~ "Cyclidium_2", TRUE ~ "Cyclidium_3"),
str_detect(genus, "Epispathidium") ~ case_when(str_detect(family, "Spathidiida") ~
"Epispathidium_1", TRUE ~ "Epispathidium_2"), str_detect(genus, "Frontonia") ~
case_when(str_detect(family, "Frontoniidae") ~ "Frontonia_1", TRUE ~
"Frontonia_2"), str_detect(genus, "Metopus") ~ case_when(str_detect(family,
"Metopidae") ~ "Metopus_1", TRUE ~ "Metopus_2"), str_detect(genus, "Nyctotherus") ~
case_when(str_detect(family, "Nyctotheridae") ~ "Nyctotherus_1", TRUE ~
"Nyctotherus_2"), str_detect(genus, "Spathidium") ~ case_when(str_detect(family,
"Litostomatea") ~ "Spathidium_1", TRUE ~ "Spathidium_2"), str_detect(genus,
"Triadinium") ~ case_when(str_detect(family, "Buetschliidae") ~ "Triadinium_1",
TRUE ~ "Triadinium_2"), str_detect(genus, "Zoothamnium") ~ case_when(str_detect(family,
"Sessilida") ~ "Zoothamnium_1", TRUE ~ "Zoothamnium_2"), genbank_accession ==
"AB074079" ~ "Chilodonella", genbank_accession == "HM140400" ~ family,
TRUE ~ genus), species = case_when(str_detect(Taxo12, "Dysteria\\[|Pleuronema\\[|Trichodina\\[|Entodinium\\[") ~
str_c(Taxo12, "_sp."), str_detect(genus, "Caenomorphidae") ~ "Caenomorphidae",
genbank_accession == "AB074079" ~ "Chilodonella_sp.", str_detect(genus,
"Tardigrada") ~ "Scuticociliatia_2_XX_sp.", str_detect(genus, "Uncultured") ~
"Philasterida_X_sp.", str_detect(species, "Spathidium_muscicola") ~
case_when(str_detect(family, "Litostomatea") ~ "Spathidium_1_muscicola",
TRUE ~ "Spathidium_2_muscicola"), str_detect(species, "Frontonia_elegans") ~
case_when(str_detect(family, "Frontoniidae") ~ "Frontonia_1_elegans",
TRUE ~ "Frontonia_2_elegans"), str_detect(species, "Triadinium_caudatum") ~
case_when(str_detect(family, "Buetschliidae") ~ "Triadinium_1_caudatum",
TRUE ~ "Triadinium_2_caudatum"), TRUE ~ species))
# write_tsv(pr2_taxo_update, full_path'junk2.txt')
3.1.4 Correct taxonomy names to follow PR2 rules
- Taxon[1] changed to Taxon_1
pr2_taxo_update <- pr2_taxo_update %>% mutate_all(funs(str_replace(., "\\[(\\d+)\\]",
"_\\1")))
3.1.5 Replace the taxons that are repeated across columns by Taxa_X, Taxa_XX
- If ranks i and i+1 are similar then add _X to the taxonomy
pr2_taxo_update <- pr2_taxo_X(pr2_taxo_update, pr2.env$taxo_levels)
3.1.6 Merge back the assignation into the initial table
pr2_update_2 <- left_join(select(pr2_taxo_update, -Taxo12, -eukref_name), select(pr2_update,
genbank_accession, eukref_name:gb_journal))
3.2 Construct taxonomy and compare to existing taxonomy
3.2.1 Construct a taxonomy file
- Based on the taxonomy analysis the data are corrected upstream until no more duplicates
# Construct the taxonomy table
pr2_taxo_update_summary <- pr2_taxo_update %>% group_by_(.dots = pr2.env$taxo_levels) %>%
summarise(sequence_number = n()) %>% ungroup() %>% arrange_(.dots = pr2.env$taxo_levels)
write_tsv(pr2_taxo_update_summary, full_path("Ciliates_taxo_summary.txt"))
# Check it for duplicate names and duplicate parents
pr2_taxo_check(select(pr2_taxo_update_summary, kingdom:species), dir_pr2_update)
# Extract the sequence lines for which species are duplicates from the
# pr2_taxo file
duplicate_species <- read_tsv(full_path("taxo_species_duplicates.txt"))
pr2_taxo_duplicate_species <- filter(pr2_taxo_update, species %in% duplicate_species$species)
write_tsv(pr2_taxo_duplicate_species, full_path("Ciliates_duplicate_species.txt"))
3.2.2 Compare the new taxonomy to what was existing before
# Compare to existing taxonomy
pr2_update_taxo_compared <- left_join(pr2_taxo_update_summary, pr2_taxo, by = c(species = "species")) %>%
rename_(kingdom = "kingdom.x", supergroup = "supergroup.x", division = "division.x",
class = "class.x", order = "order.x", family = "family.x", genus = "genus.x") %>%
arrange(class, order, family, genus, species)
write_tsv(pr2_update_taxo_compared, full_path("Ciliates_taxo_compared.txt"),
na = "")
pr2_taxo_updated <- pr2_update_taxo_compared %>% filter(!is.na(taxo_id))
pr2_taxo_added <- pr2_update_taxo_compared %>% filter(is.na(taxo_id))
pr2_taxo_not_updated <- pr2_taxo %>% filter((division == "Ciliophora") & (class !=
"Spirotrichea") & !(species %in% pr2_taxo_update_summary$species))
write_tsv(pr2_taxo_not_updated, full_path("Ciliates_taxo_not_updated.txt"),
na = "")
str_c("Ciliates: number of taxa updated : ", nrow(pr2_taxo_updated))
[1] "Ciliates: number of taxa updated : 839"
str_c("Ciliates: number of taxa added : ", nrow(pr2_taxo_added))
[1] "Ciliates: number of taxa added : 1"
str_c("Ciliates: number of taxa not updated (removed) : ", nrow(pr2_taxo_not_updated))
[1] "Ciliates: number of taxa not updated (removed) : 100"
3.2.3 Verify any problem with the full pr2 taxonomy
pr2_taxo_full_check <- pr2_taxo %>% select(kingdom:species) %>% # filter(!((division == 'Ciliophora') & (class != 'Spirotrichea'))) %>%
bind_rows(select(pr2_taxo_update_summary, kingdom:species))
pr2_taxo_check(pr2_taxo_full_check, dir_taxo = "C:/Users/vaulot/Desktop")
3.3 Metadata
3.3.1 Clean up the metadata
- Replace
_ENVOxxx_
by(ENVOxxx)
- Keep only one keyword (and/or)
- Put everything in lower case except for ENVO
pr2_update_2 <- pr2_update_2 %>% mutate(eukref_env_material_corrected = str_replace(eukref_env_material,
"_(.+)_", "\\(\\1\\)")) %>% mutate(eukref_env_material_corrected = str_replace(eukref_env_material_corrected,
"and .+$|or .+$", "")) %>% mutate(eukref_env_material_corrected = str_to_lower(eukref_env_material_corrected)) %>%
mutate(eukref_env_material_corrected = str_replace(eukref_env_material_corrected,
"envo", "ENVO"))
pr2_update_2 <- pr2_update_2 %>% mutate(eukref_env_biome_corrected = str_replace(eukref_env_biome,
"_(.+)_", "\\(\\1\\)")) %>% mutate(eukref_env_biome_corrected = str_replace(eukref_env_biome_corrected,
"([a-z]{1})$", "\\1(ENVO missing)")) %>% mutate(eukref_env_biome_corrected = str_to_lower(eukref_env_biome_corrected)) %>%
mutate(eukref_env_biome_corrected = str_replace(eukref_env_biome_corrected,
"envo", "ENVO"))
pr2_update_2 <- pr2_update_2 %>% mutate(eukref_biotic_relationship_corrected = str_to_lower(eukref_biotic_relationship)) %>%
mutate(eukref_biotic_relationship_corrected = str_replace(eukref_biotic_relationship_corrected,
"free living", "free-living"))
3.3.2 Summarize fields
eukref_env_materials <- pr2_update_2 %>% group_by(eukref_env_material, eukref_env_material_corrected) %>%
count() %>% mutate(env_material = str_extract(eukref_env_material_corrected,
"^[a-z0-9- /()]+\\(ENVO"), ENVO_id = str_extract(eukref_env_material_corrected,
"\\(ENVO.+\\)")) %>% mutate(env_material = str_replace(env_material, "\\(ENVO",
""), ENVO_id = str_replace_all(ENVO_id, "[()]", ""))
write_tsv(eukref_env_materials, full_path("eukref_env_material.txt"), na = "")
eukref_env_biomes <- pr2_update_2 %>% group_by(eukref_env_biome, eukref_env_biome_corrected) %>%
count() %>% mutate(env_biome = str_extract(eukref_env_biome_corrected, "^[a-z0-9- /()]+\\(ENVO"),
ENVO_id = str_extract(eukref_env_biome_corrected, "\\(ENVO.+\\)")) %>% mutate(env_biome = str_replace(env_biome,
"\\(ENVO", ""), ENVO_id = str_replace_all(ENVO_id, "[()]", ""))
write_tsv(eukref_env_biomes, full_path("eukref_env_biomes.txt"), na = "")
eukref_biotic_relationships <- pr2_update_2 %>% group_by(eukref_biotic_relationship_corrected) %>%
count()
eukref_source <- pr2_update_2 %>% group_by(eukref_source) %>% count()
3.3.3 Finalize fields
pr2_update_2 <- pr2_update_2 %>% mutate(eukref_env_material_pr2 = str_extract(eukref_env_material_corrected,
"^[a-z0-9- /()]+\\(ENVO")) %>% mutate(eukref_env_material_pr2 = str_replace(eukref_env_material_pr2,
"\\(ENVO", ""))
pr2_update_2 <- pr2_update_2 %>% mutate(eukref_env_biome_pr2 = str_extract(eukref_env_biome_corrected,
"^[a-z0-9- /()]+\\(ENVO")) %>% mutate(eukref_env_biome_pr2 = str_replace(eukref_env_biome_pr2,
"\\(ENVO", ""))
pr2_update_2 <- pr2_update_2 %>% mutate(eukref_biotic_relationship_pr2 = eukref_biotic_relationship_corrected)
3.3.4 Write the final Eukref table
write_tsv(pr2_update_2, full_path("Ciliates_Eukref_processed.txt"))
3.4 Check sequences against PR2
3.4.1 Sequences already in PR2
# Extract lines with accession number already in PR2, then one need just to
# update the species and add the eukref metadata
pr2_update_old <- pr2_update_2 %>% filter(genbank_accession %in% pr2$genbank_accession)
str_c("Number of sequences to update in PR2 : ", nrow(pr2_update_old))
[1] "Number of sequences to update in PR2 : 7028"
pr2_update_old_Spirotrichea <- pr2 %>% filter(genbank_accession %in% pr2_update_old$genbank_accession) %>%
filter(class == "Spirotrichea")
str_c("Number of sequences to update in PR2 that belonged before to Spirotrichea : ",
nrow(pr2_update_old_Spirotrichea))
[1] "Number of sequences to update in PR2 that belonged before to Spirotrichea : 0"
select(pr2_update_old_Spirotrichea, genbank_accession, class:genus, species)
# A tibble: 0 x 6
# ... with 6 variables: genbank_accession <chr>, class <chr>, order <chr>,
# family <chr>, genus <chr>, species <chr>
pr2_update_old_main <- select(pr2_update_old, genbank_accession, species)
- For the metadata use only the eukref fields.
- For the sequences for which sample_type exists, keep the data, if not update with
eukref_source
field.
pr2_update_old_metadata <- pr2_update_old %>% select(genbank_accession, eukref_name,
eukref_source, eukref_env_material = eukref_env_material_pr2, eukref_env_biome = eukref_env_biome_pr2,
eukref_biotic_relationship = eukref_biotic_relationship_pr2, eukref_specific_host,
eukref_geo_loc_name, eukref_publication, eukref_authors, eukref_journal,
eukref_notes, gb_country, gb_publication, gb_authors, gb_journal) %>% left_join(select(pr2,
genbank_accession, pr2_sample_type)) %>% mutate(pr2_sample_type = case_when(!(is.na(pr2_sample_type)) ~
pr2_sample_type, TRUE ~ str_to_lower(eukref_source)))
3.4.2 Ciliates sequences from PR2 not updated by EukRef
These sequences are removed for the time being from PR2
pr2_ciliates <- pr2 %>% filter((division == "Ciliophora") & (class != "Spirotrichea"))
pr2_ciliates_not_updated <- pr2_ciliates %>% filter(!(genbank_accession %in%
pr2_update_2$genbank_accession))
str_c("Number of PR2 sequences not updated : ", nrow(pr2_ciliates_not_updated))
[1] "Number of PR2 sequences not updated : 0"
write_tsv(pr2_ciliates_not_updated, full_path("Ciliates_pr2_not_updated.txt"),
na = "")
# Are any of these sequences found in the Spirotrichea ?
pr2_wrongly_assigned <- pr2_ciliates_not_updated %>% filter((genbank_accession %in%
pr2_update_raw$genbank_accession))
str_c("Number of PR2 sequences that should be moved to Spirotrichea : ", nrow(pr2_wrongly_assigned))
[1] "Number of PR2 sequences that should be moved to Spirotrichea : 0"
3.4.3 New sequences not found in PR2
pr2_update_new <- pr2_update_2 %>% filter(!(genbank_accession %in% pr2$genbank_accession))
str_c("Number of new sequences : ", nrow(pr2_update_new))
[1] "Number of new sequences : 32"
3.4.3.1 Download the sequences and parse them
pr2_update_new_metadata <- genbank_download_parse(pr2_update_new$genbank_accession,
pr2.env$genbank_directory)
write_tsv(pr2_update_new_metadata, full_path("Ciliates_new_metadata.txt"), na = "")
knitr::kable(head(pr2_update_new_metadata, 5))
3.4.3.2 Extract features
Features that can be used to determine the start and end of the 18S - This is stored in a file.
pr2_update_new_features <- genbank_features(pr2_update_new$genbank_accession,
pr2.env$genbank_directory)
write_tsv(pr2_update_new_features, full_path("Ciliates_new_features.txt"), na = "")
knitr::kable(head(pr2_update_new_features, 3))
3.4.3.3 Look manually for sequence start and end for sequences that contain ITS
- Export the sequence which contain ITS
- Import into Geneious
- Align and look for probe GTGAACCTGCRGAWGGATCA which marks the end of the 18S
- Update manually start and end in the pr2_new_metadata sheet
pr2_update_new_metadata <- read_excel(full_path("EukRef Ciliates 2018 version 2.0.xlsx"),
sheet = "pr2_new_metadata", na = "NA", col_names = TRUE, guess_max = 2e+05)
pr2_update_new_ITS <- pr2_update_new_metadata %>% filter(is.na(end))
fasta_write(select(pr2_update_new_ITS, seq_name = genbank_accession, sequence),
file_name = full_path("sequences_with_ITS.fasta"), compress = FALSE, taxo_include = FALSE)
3.4.3.4 Reload the metadata sheet
Now the start and end are OK
pr2_update_new_metadata <- read_excel(full_path("EukRef Ciliates 2018 version 2.0.xlsx"),
sheet = "pr2_new_metadata", na = "NA", col_names = TRUE, guess_max = 2e+05)
3.4.3.5 Merge the accession, taxonomy, sequence and metadata
pr2_update_new <- left_join(select(pr2_update_new, -gb_host, -gb_environment,
-gb_country), pr2_update_new_metadata)
3.4.3.6 Extract sequences and get length and number of ambiguities
# Extract the subsequence if the whole sequence is not to be used
index <- !is.na(pr2_update_new$start)
pr2_update_new$sequence[index] <- str_sub(pr2_update_new$sequence[index], pr2_update_new$start[index],
pr2_update_new$end[index])
pr2_update_new$sequence_length_old <- pr2_update_new$sequence_length
# Update sequence length and ambiguities for all sequences since some may
# have been shortened
pr2_update_new$sequence_length <- str_length(pr2_update_new$sequence)
pr2_update_new$ambiguities <- str_count(pr2_update_new$sequence, pattern = pr2.env$ambig_regex)
# Remove sequences that are shorter than minimum lenght (500) and those that
# have too many ambiguities
pr2_update_new_rejected <- pr2_update_new %>% filter((sequence_length < pr2.env$sequence_length_min) |
(ambiguities > pr2.env$ambiguities_max))
write_tsv(pr2_update_new_rejected, full_path("Ciliates_new_rejected.txt"), na = "")
str_c("Number of sequences rejected (length <500 or ambuiguities > 20): ", nrow(pr2_update_new_rejected))
[1] "Number of sequences rejected (length <500 or ambuiguities > 20): 11"
select(pr2_update_new_rejected, genbank_accession, sequence_length, ambiguities)
# A tibble: 11 x 3
genbank_accession sequence_length ambiguities
<chr> <int> <int>
1 KJ755359 1220 21
2 Z29441 1516 42
3 AB684393 322 0
4 AF070701 1688 65
5 AF300284 1633 21
6 GU922412 464 0
7 HM799981 1698 100
8 FJ032688 1856 30
9 LC031485 541 21
10 GU922866 485 0
11 AJ862469 461 4
pr2_update_new <- pr2_update_new %>% filter((sequence_length >= pr2.env$sequence_length_min) &
(ambiguities <= pr2.env$ambiguities_max))
3.4.3.7 Create PR2 fields
pr2_update_new$label <- "U"
pr2_update_new <- pr2_update_new %>% mutate(pr2_accession = paste0(genbank_accession,
".", start, ".", end, "_", label))
pr2_update_new$reference_sequence = NA
write_tsv(pr2_update_new, full_path("Ciliates pr2_update_new.txt"), na = "")
4 Update the PR2 database
4.1 Update new taxo
Use three data frames * pr2_taxo_updated : 573 lines edited but 661 records edited (if a genus is edited all from the same genus have been edited) * pr2_taxo_added : 267 * pr2_taxo_not_updated : 101 - This is done manually in order not to remove taxa that have been updated
pr2_taxo_updated <- pr2_taxo_update(pr2_taxo_updated, method = "update")
pr2_taxo_added <- pr2_taxo_update(pr2_taxo_added, method = "add")
4.2 Re-assign the old PR2 sequences and add the eukref metadata
- 4556 sequences re-assigned (corresponding to 4562 entries)
- 4549 metadata updated (number of rows is 4568, so some rows correspond to the same Genbank sequence ??)
Use two data frames * pr2_update_old_main
pr2_update_old_main <- pr2_sequence_reassign(pr2_update_old_main)
- pr2_update_old_metadata
[1] "genbank_accession" "eukref_name" "eukref_source" "eukref_env_material" "eukref_env_biome" "eukref_biotic_relationship" [7] "eukref_specific_host" "eukref_geo_loc_name" "eukref_publication" "eukref_authors" "eukref_journal" "eukref_notes" [13] "gb_country" "gb_publication" "gb_authors" "gb_journal" "pr2_sample_type"
pr2_update_old_metadata <- pr2_update_old_metadata %>% mutate(query = str_c("UPDATE pr2_metadata",
" SET ", db_update_field_string("eukref_name", eukref_name), db_update_field_string("eukref_source",
eukref_source), db_update_field_string("eukref_env_material", eukref_env_material),
db_update_field_string("eukref_env_biome", eukref_env_biome), db_update_field_string("eukref_biotic_relationship",
eukref_biotic_relationship), db_update_field_string("eukref_specific_host",
eukref_specific_host), db_update_field_string("eukref_publication",
eukref_publication), db_update_field_string("eukref_authors", eukref_authors),
db_update_field_string("eukref_journal", eukref_journal), db_update_field_string("eukref_notes",
eukref_notes), db_update_field_string("gb_country", gb_country), db_update_field_string("gb_publication",
gb_publication), db_update_field_string("gb_authors", gb_authors), db_update_field_string("gb_journal",
gb_journal), db_update_field_string("pr2_sample_type", pr2_sample_type,
append_comma = FALSE), " WHERE genbank_accession = ", db_sql_escape(genbank_accession)))
write_tsv(data.frame(pr2_update_old_metadata$query), full_path("pr2_update_query.txt"),
na = "")
pr2_update_old_metadata$query_result <- db_execute_query_vector(db_info("pr2_local"),
pr2_update_old_metadata$query)
# Find if there are some sequences not in the metadata table (none)
pr2_metatadata_missing <- pr2_update_old_metadata %>% filter(!(genbank_accession %in%
pr2_metadata$genbank_accession))
# find How many distinct genbank sequences : 4550 (4568 lines though)
nrow(distinct(data.frame(pr2_update_old_metadata$genbank_accession)))
4.3 Remove sequences that are not annotated by EukRef
Use data frame * pr2_ciliates_not_updated: 652
pr2_ciliates_not_updated <- pr2_ciliates_not_updated %>% mutate(query = str_c("UPDATE pr2_main",
" SET ", db_update_field_string("removed_version", pr2.env$version), db_update_field_string("edited_remark",
"Ciliophora sequences not annotated by Eukref"), db_update_field_string("edited_by",
pr2.env$editor, append_comma = FALSE), " WHERE genbank_accession = ",
db_sql_escape(genbank_accession)))
pr2_ciliates_not_updated$query_result <- db_execute_query_vector(db_info("pr2_local"),
pr2_ciliates_not_updated$query)
4.4 Add the new sequences
Use data frame * pr2_update_new: 2478 lines added
# Convert to ASCII all columns
pr2_update_new <- pr2_update_new %>% mutate_if(is.character, ~iconv(., "latin1",
"ASCII", sub = "?"))
pr2_main_add <- pr2_update_new %>% select(pr2_accession, genbank_accession,
start, end, label, species) %>% mutate(added_version = pr2.env$version,
edited_by = pr2.env$editor)
pr2_sequences_add <- pr2_update_new %>% select(pr2_accession, sequence, sequence_length,
ambiguities)
pr2_metadata_add <- pr2_update_new %>% select(genbank_accession, starts_with("gb_"),
starts_with("pr2_"), starts_with("eukref_")) %>% mutate(eukref_env_material = eukref_env_material_pr2,
eukref_env_biome = eukref_env_biome_pr2, eukref_biotic_relationship = eukref_biotic_relationship_pr2) %>%
select(-pr2_accession)
# Write to files that are then uploaded into the datbase (easier for error
# tracking)
write_tsv(pr2_main_add, full_path("pr2_main_add.txt"), na = "")
write_tsv(pr2_sequences_add, full_path("pr2_sequences_add.txt"), na = "")
write_tsv(pr2_metadata_add, full_path("pr2_metadata_add.txt"), na = "")
4.5 Mark sequences that are reference
- 1740 sequences
pr2_reference <- read_excel(full_path("EukRef Ciliates 2018 version 2.0.xlsx"),
sheet = "Eukref reference sequences", na = "NA", col_names = TRUE, guess_max = 2e+05)
pr2_reference <- pr2_reference %>% mutate(query = str_c("UPDATE pr2_main", " SET ",
db_update_field_string("reference_sequence", 1, append_comma = FALSE), " WHERE genbank_accession = ",
db_sql_escape(genbank_accession)))
pr2_reference$query_result <- db_execute_query_vector(db_info("pr2_local"),
pr2_reference$query)
4.6 Final checks
Reload the PR2 database
4.6.1 Taxonomy
pr2_taxo_check(pr2_taxo, dir_taxo = "C:/Users/vaulot/Desktop")
4.6.2 Full data
pr2_taxo_missing <- pr2 %>% filter(is.na(division))
pr2_ciliophora_not_updated <- pr2 %>% filter(division == "Ciliophora") %>% filter(!str_detect(edited_version,
"4.7|4.11"))
1.5 Comments on the EukRef database
The EUKREF.tsv file contains some non-ASCII caracters that do not import well in Excel. Must use tsv importer instead. Please make sure that your file is encoded as UTF-8 and does not contain any non-ASCII caracters.
For cultures, I created the species and the genus fields based on the Eukref_name field. However for the genera that have been shown to be polyphyletic I used the Taxo12 field for the genus and Taxo12_sp. for the species field.
Protocruzia has been placed in Spirotrichea by Charles and Wei Ting and in Ciliophora_X by you… What is correct ? OK corrected to Ciliophora_X and included in update
If more than one ENVO keywords, separate by ; not by and or or