PR2 version 4.13.0
Suessiales from J. del Campo
PR2 version 4.13.0
Suessiales from J. del Campo
1 Init
Load the variables common to the different scripts and the necessary libraries
2 Read pr2
- Use the PR2 from Google database
# Info for the local dabase
pr2_db <- db_info("pr2_google")
pr2_db_con <- db_connect(pr2_db)
pr2_main <- tbl(pr2_db_con, "pr2_main") %>% collect()
# pr2_main <- pr2_main %>% filter (is.na(removed_version))
pr2_seq <- tbl(pr2_db_con, "pr2_sequences")%>% collect()
pr2_taxo <- tbl(pr2_db_con, "pr2_taxonomy")
pr2_taxo <- pr2_taxo %>% filter (is.na(taxo_removed_version))%>% collect()
pr2_metadata <- tbl(pr2_db_con, "pr2_metadata")%>% collect()
# Join the tables and keep only sequences that are not removed
pr2 <- left_join(pr2_main, pr2_taxo, by = c("species"="species"))
pr2 <- left_join (pr2, pr2_seq)
## Joining, by = "pr2_accession"
## Joining, by = "genbank_accession"
## [1] TRUE
3 pr2sues_curation notes
The excel file has three columns
- Suessiales. This includes only two of the groups defined as Suessiales and that I know for sure that are monophyletic based on my trees. Find tree attached (ML, best tree of 1000 and 1000 Boostraps). Those are the Suessiaceae (paraphyletic) and the new Symbiodiniaceae (Monophyletic). There is a discrepancy between the two main papers I used for the curation: Janouškovec et al. 2017 PNAS consider all the Suessiales = Symbiodiniaceae, because they argue that all the extant species of the Suessiales are not related to the extint Suessia. However, LaJeneusse et al. 2018 CB they use Suessiales and within Suessiales they include Suessiaceae and Symbiodiniaceae. I keep the latest because currently all the community working on Symbiodinids they name the group Stmbidiniaceae, and that does not include Polarella, Pisdidodinium or any of the Suessiaceae.
Janouškovec, Jan, Gregory S. Gavelis, Fabien Burki, Donna Dinh, Tsvetan R. Bachvaroff, Sebastian G. Gornik, Kelley J. Bright, et al. 2017. “Major Transitions in Dinoflagellate Evolution Unveiled by Phylotranscriptomics.” Proceedings of the National Academy of Sciences 114 (2): E171–80. https://doi.org/10.1073/pnas.1614842114.
LaJeunesse, Todd C., John Everett Parkinson, Paul W. Gabrielson, Hae Jin Jeong, James Davis Reimer, Christian R. Voolstra, and Scott R. Santos. 2018. “Systematic Revision of Symbiodiniaceae Highlights the Antiquity and Diversity of Coral Endosymbionts.” Current Biology 28 (16): 2570-2580.e6. https://doi.org/10.1016/j.cub.2018.07.008.
Not Suessiales. This includes the Borghiellaceae, that by some are considered Suessiales and by others are not and a bunch of sequences previously named as Suessiales XXXX or missanotated as Polarella. Some of them might be related to the Borghiellaceae and some others belong to other dino groups based on a general tree of the dinos. I just renamed them as Dinos and XXXs. In the case of the Borghiellaceae I have not touch their affiliating to the Suessiales, because I am not sure what to do, as I mentioned I believe the Dinos need a phylogenetic based curation treatment. As a note, Fabien has showed me some dinos phylogenomic trees and these incongruencies are also present there as a result of imposing a morphologic criteria to the taxonomy ID. 18S might be wrong when placing dinos on a tree but genomes are unlikely to be wrong in my opinion.
Weird. One weird sequence that I do not know what to do with.
So, tab 1 is for sure legit and kosher, monophyletic and supported by other papers. Tab 2, the curations of the Borghiellaceae is also correct, the only issue is that I do not thing they belong to the same group as the rest of the Suessiales. Everything else is just a Dropbox that needs to be properly assign to a group but I have not time to deal with that.
4 Set up the files
5 Read the original data and reformat
5.1 Read the data
- Number of sequences = 511
pr2_update <- read_excel(file_pr2_update_excel, sheet = "all", guess_max=200000, na=c("", "-"))
str_c("Number of sequences : ", nrow(pr2_update))
## [1] "Number of sequences : 511"
5.2 Compare with sequences in PR2
str_c("Number of sequences of target group in pr2 : ",
nrow(filter(pr2_18S, order == target_group)))
## [1] "Number of sequences of target group in pr2 : 472"
str_c("Number of updated sequences that are active in PR2 : ",
nrow(filter(pr2_update, !(genbank_accession %in% pr2_18S$genbank_accession))))
## [1] "Number of updated sequences that are active in PR2 : 15"
str_c("Number of updated sequences that are not active in PR2 : ",
nrow(filter(pr2_update, (genbank_accession %in% pr2_18S_removed$genbank_accession))))
## [1] "Number of updated sequences that are not active in PR2 : 0"
## [1] "Sequences of PR2 that are misclassified - df "
## # A tibble: 2 x 2
## order n
## <chr> <int>
## 1 Dinophyceae_X 24
## 2 Suessiales 472
## [1] "Sequences from target group in PR2 that are not in update - df "
filter(pr2_18S, (order == target_group) &
!(genbank_accession %in% pr2_update$genbank_accession)) # %>%
## # A tibble: 0 x 97
## # ... with 97 variables: pr2_main_id <int>, pr2_accession <chr>,
## # genbank_accession <chr>, start <dbl>, end <dbl>, label <chr>, gene <chr>,
## # organelle <chr>, species <chr>, chimera <int>, chimera_remark <chr>,
## # reference_sequence <int>, added_version <chr>, removed_version <chr>,
## # edited_version <chr>, edited_by <chr>, edited_remark <chr>, remark <chr>,
## # taxo_id <int>, kingdom <chr>, supergroup <chr>, division <chr>,
## # class <chr>, order <chr>, family <chr>, genus <chr>,
## # taxon_trophic_mode <chr>, taxo_edited_version <chr>, taxo_edited_by <chr>,
## # taxo_removed_version <chr>, taxo_remark <chr>, reference <chr>,
## # seq_id <int>, sequence <chr>, sequence_length <int>, ambiguities <int>,
## # sequence_hash <chr>, gb_sequence <chr>, pr2_metadata_id <int>,
## # gb_date <chr>, gb_locus <chr>, gb_definition <chr>, gb_organism <chr>,
## # gb_organelle <chr>, gb_taxonomy <chr>, gb_strain <chr>,
## # gb_culture_collection <chr>, gb_clone <chr>, gb_isolate <chr>,
## # gb_isolation_source <chr>, gb_specimen_voucher <chr>, gb_host <chr>,
## # gb_collection_date <chr>, gb_environmental_sample <chr>, gb_country <chr>,
## # gb_lat_lon <chr>, gb_collected_by <chr>, gb_note <chr>,
## # gb_references <chr>, gb_publication <chr>, gb_authors <chr>,
## # gb_journal <chr>, pubmed_id <int>, eukref_name <chr>, eukref_source <chr>,
## # eukref_env_material <chr>, eukref_env_biome <chr>,
## # eukref_biotic_relationship <chr>, eukref_specific_host <chr>,
## # eukref_geo_loc_name <chr>, eukref_notes <chr>, pr2_sample_type <chr>,
## # pr2_sample_method <chr>, pr2_latitude <dbl>, pr2_longitude <dbl>,
## # pr2_ocean <chr>, pr2_sea <chr>, pr2_sea_lat <dbl>, pr2_sea_lon <dbl>,
## # pr2_continent <chr>, pr2_country <chr>, pr2_location <chr>,
## # pr2_location_geoname <chr>, pr2_location_geotype <chr>,
## # pr2_location_lat <dbl>, pr2_location_lon <dbl>, pr2_country_geocode <chr>,
## # pr2_country_lat <dbl>, pr2_country_lon <dbl>, pr2_sequence_origin <chr>,
## # pr2_size_fraction <chr>, pr2_size_fraction_min <dbl>,
## # pr2_size_fraction_max <dbl>, metadata_remark <chr>, junk_gb_authors <chr>,
## # junk_gb_publication <chr>, junk_gb_journal <chr>
# write_tsv(full_path("api_pr2_not_javier.tsv"), na="")
str_c("Number of PR2 discarded sequences in update : ",
nrow(filter(pr2_18S_removed, genbank_accession %in% pr2_update$genbank_accession)))
## [1] "Number of PR2 discarded sequences in update : 0"
## [1] "Sequences duplicated - df "
## # A tibble: 0 x 2
## # ... with 2 variables: genbank_accession <chr>, n <int>
## [1] "Sequences updated with 2 entries in PR2 (e.g. with and without introns) - df "
pr2_duplicated <- left_join(select(pr2_update, genbank_accession),
select(pr2_18S, genbank_accession, pr2_accession)) %>%
count(genbank_accession)
## Joining, by = "genbank_accession"
6 Taxonomy
6.1 Build and check
pr2_update_taxo <- pr2_update %>%
group_by_at(pr2.env$taxo_levels) %>%
count()
pr2_taxo_check(pr2_update_taxo, "../temp")
## Joining, by = "name"
pr2_update_taxo <- pr2_update_taxo %>%
rename_all(~ str_c(.,"_new" )) %>%
rename(species = species_new) %>%
left_join(pr2_taxo)
## Joining, by = "species"
6.2 Find taxa in PR2 that are not included in the update
7 New sequences
7.1 List of new sequences
- 15 new sequences
7.2 Run the script "
PR2 genbank dowmnload.R` in background
7.3 Read the metadata files and join by genbank accession
metafile <- list()
for (i in c(1:2)) {
metafile[[i]] <- read_tsv(full_path(str_c("pr2_gb_metadata_",i,".txt")),
col_type = cols(gb_collection_date = col_character())) %>%
mutate(gb_isolate=as.character(gb_isolate))
}
# The first file serves as a template for the following one when binding rows
gb_metadata <- reduce(metafile, bind_rows) %>%
distinct() %>%
rename(gb_sequence = sequence)
pr2_new <- left_join(pr2_new, gb_metadata, by = c("genbank_accession" = "genbank_accession"))
7.4 Read the features file and join
gb_features <- read_tsv(full_path("pr2_gb_features.txt")) %>%
filter(product == "18S ribosomal RNA") %>%
rename(start_feature = start, end_feature = end) %>%
distinct()
## Parsed with column specification:
## cols(
## seqnames = col_character(),
## start = col_double(),
## end = col_double(),
## width = col_double(),
## strand = col_character(),
## type = col_character(),
## gene = col_character(),
## product = col_character(),
## loctype = col_character(),
## genbank_accession = col_character(),
## note = col_character()
## )
8 Finalization
8.1 Taxo
8.2 Old sequences
pr2_update_old <- pr2_update %>%
filter((genbank_accession %in% pr2$genbank_accession))
pr2_main_old <- pr2_update_old %>%
select (genbank_accession, species) %>%
mutate(edited_version = "4.13.0", edited_by = "J. del Campo")
nrow(pr2_main_old)
## [1] 496
8.3 New sequences
8.3.1 Read new sequences once the corrections have been done
8.3.2 Add fields
pr2_new <- pr2_new %>%
# mutate(start=1, end=sequence_length,
# label = case_when (intron == "intron" ~ "G",
# TRUE ~"U")) %>%
mutate(sequence_length = str_length(sequence),
label = "U",
pr2_accession = str_c(genbank_accession,".",start,".", end,"_",label),
reference_sequence = NA,
gene="18S rRNA",
organelle = "nucleus",
ambiguities = str_count(sequence,pattern=pr2.env$ambig_regex)
)
pr2_new$sequence_hash = purrr::map_chr(pr2_new$sequence,digest::sha1)
8.3.3 Remove sequences with ambiguities or too short
filter(pr2_new, (sequence_length < pr2.env$sequence_length_min)|(ambiguities > pr2.env$ambiguities_max)) %>%
arrange(desc(sequence_length)) %>%
select(pr2_accession, class, genus, sequence_length,ambiguities, species,gb_definition)
## # A tibble: 0 x 7
## # ... with 7 variables: pr2_accession <chr>, class <chr>, genus <chr>,
## # sequence_length <int>, ambiguities <int>, species <chr>,
## # gb_definition <chr>
8.3.4 Check duplicate entries
## # A tibble: 0 x 2
## # ... with 2 variables: pr2_accession <chr>, n_seq <int>
8.3.5 Final files for uploading
## [1] " Final number of sequences added: 15"
## [1] "genbank_accession" "kingdom"
## [3] "supergroup" "division"
## [5] "class" "order"
## [7] "family" "genus"
## [9] "species" "sequence"
## [11] "gb_definition" "gb_organism"
## [13] "gb_organelle" "gb_strain"
## [15] "gb_isolate" "gb_clone"
## [17] "gb_specimen_voucher" "gb_collected_by"
## [19] "gb_lat_lon" "gb_note"
## [21] "gb_culture_collection" "gb_isolation_source"
## [23] "gb_host" "gb_environmental_sample"
## [25] "gb_collection_date" "gb_country"
## [27] "gb_date" "gb_locus"
## [29] "gb_sequence" "pr2_sample_type"
## [31] "sequence_length" "seqnames"
## [33] "start" "end"
## [35] "width" "strand"
## [37] "type" "gene"
## [39] "product" "loctype"
## [41] "note" "label"
## [43] "pr2_accession" "reference_sequence"
## [45] "organelle" "ambiguities"
## [47] "sequence_hash"
pr2_update_new <- pr2_new
pr2_main_new <- pr2_update_new %>%
select (pr2_accession, genbank_accession, start,end, label, species, organelle, gene) %>%
mutate(added_version = "4.13.0", edited_by = "J. del Campo")
nrow(pr2_main_new)
## [1] 15
pr2_sequences_new <- pr2_update_new %>%
select(pr2_accession, sequence, gb_sequence, sequence_length, ambiguities, sequence_hash)
nrow(pr2_sequences_new)
## [1] 15
pr2_metadata_new <- pr2_update_new %>%
select(genbank_accession, starts_with("gb_"),starts_with("pr2_"), starts_with("eukref_")) %>%
select( -pr2_accession, -gb_sequence) %>%
distinct()
nrow(pr2_metadata_new)
## [1] 15
9 Save everything to an Excel file
file_pr2_imports <- full_path("pr2_imports_final.xlsx")
onglets <- list("pr2_taxo_update" = pr2_taxo_updated,
# "pr2_taxo_remove" = pr2_taxo_remove,
"pr2_main_old" = pr2_main_old,
# "pr2_metadata_old" = pr2_metadata_old,
"pr2_main_new" = pr2_main_new,
"pr2_sequence_new" = pr2_sequences_new,
"pr2_metadata_new" = pr2_metadata_new)
write.xlsx(onglets, file_pr2_imports)