PR2 version 4.13.0
Suessiales from J. del Campo

3 pr2sues_curation notes

The excel file has three columns

  1. Suessiales. This includes only two of the groups defined as Suessiales and that I know for sure that are monophyletic based on my trees. Find tree attached (ML, best tree of 1000 and 1000 Boostraps). Those are the Suessiaceae (paraphyletic) and the new Symbiodiniaceae (Monophyletic). There is a discrepancy between the two main papers I used for the curation: Janouškovec et al. 2017 PNAS consider all the Suessiales = Symbiodiniaceae, because they argue that all the extant species of the Suessiales are not related to the extint Suessia. However, LaJeneusse et al. 2018 CB they use Suessiales and within Suessiales they include Suessiaceae and Symbiodiniaceae. I keep the latest because currently all the community working on Symbiodinids they name the group Stmbidiniaceae, and that does not include Polarella, Pisdidodinium or any of the Suessiaceae.
  • Janouškovec, Jan, Gregory S. Gavelis, Fabien Burki, Donna Dinh, Tsvetan R. Bachvaroff, Sebastian G. Gornik, Kelley J. Bright, et al. 2017. “Major Transitions in Dinoflagellate Evolution Unveiled by Phylotranscriptomics.” Proceedings of the National Academy of Sciences 114 (2): E171–80. https://doi.org/10.1073/pnas.1614842114.

  • LaJeunesse, Todd C., John Everett Parkinson, Paul W. Gabrielson, Hae Jin Jeong, James Davis Reimer, Christian R. Voolstra, and Scott R. Santos. 2018. “Systematic Revision of Symbiodiniaceae Highlights the Antiquity and Diversity of Coral Endosymbionts.” Current Biology 28 (16): 2570-2580.e6. https://doi.org/10.1016/j.cub.2018.07.008.

  1. Not Suessiales. This includes the Borghiellaceae, that by some are considered Suessiales and by others are not and a bunch of sequences previously named as Suessiales XXXX or missanotated as Polarella. Some of them might be related to the Borghiellaceae and some others belong to other dino groups based on a general tree of the dinos. I just renamed them as Dinos and XXXs. In the case of the Borghiellaceae I have not touch their affiliating to the Suessiales, because I am not sure what to do, as I mentioned I believe the Dinos need a phylogenetic based curation treatment. As a note, Fabien has showed me some dinos phylogenomic trees and these incongruencies are also present there as a result of imposing a morphologic criteria to the taxonomy ID. 18S might be wrong when placing dinos on a tree but genomes are unlikely to be wrong in my opinion.

  2. Weird. One weird sequence that I do not know what to do with.

So, tab 1 is for sure legit and kosher, monophyletic and supported by other papers. Tab 2, the curations of the Borghiellaceae is also correct, the only issue is that I do not thing they belong to the same group as the rest of the Suessiales. Everything else is just a Dropbox that needs to be properly assign to a group but I have not time to deal with that.

5 Read the original data and reformat

5.2 Compare with sequences in PR2

## [1] "Number of sequences of target group in pr2 : 472"
## [1] "Number of updated sequences that are active in PR2 : 15"
## [1] "Number of updated sequences that are not active in PR2 : 0"
## [1] "Sequences of PR2 that are misclassified - df "
## # A tibble: 2 x 2
##   order             n
##   <chr>         <int>
## 1 Dinophyceae_X    24
## 2 Suessiales      472
## [1] "Sequences from target group in PR2 that are not in update  - df "
## # A tibble: 0 x 97
## # ... with 97 variables: pr2_main_id <int>, pr2_accession <chr>,
## #   genbank_accession <chr>, start <dbl>, end <dbl>, label <chr>, gene <chr>,
## #   organelle <chr>, species <chr>, chimera <int>, chimera_remark <chr>,
## #   reference_sequence <int>, added_version <chr>, removed_version <chr>,
## #   edited_version <chr>, edited_by <chr>, edited_remark <chr>, remark <chr>,
## #   taxo_id <int>, kingdom <chr>, supergroup <chr>, division <chr>,
## #   class <chr>, order <chr>, family <chr>, genus <chr>,
## #   taxon_trophic_mode <chr>, taxo_edited_version <chr>, taxo_edited_by <chr>,
## #   taxo_removed_version <chr>, taxo_remark <chr>, reference <chr>,
## #   seq_id <int>, sequence <chr>, sequence_length <int>, ambiguities <int>,
## #   sequence_hash <chr>, gb_sequence <chr>, pr2_metadata_id <int>,
## #   gb_date <chr>, gb_locus <chr>, gb_definition <chr>, gb_organism <chr>,
## #   gb_organelle <chr>, gb_taxonomy <chr>, gb_strain <chr>,
## #   gb_culture_collection <chr>, gb_clone <chr>, gb_isolate <chr>,
## #   gb_isolation_source <chr>, gb_specimen_voucher <chr>, gb_host <chr>,
## #   gb_collection_date <chr>, gb_environmental_sample <chr>, gb_country <chr>,
## #   gb_lat_lon <chr>, gb_collected_by <chr>, gb_note <chr>,
## #   gb_references <chr>, gb_publication <chr>, gb_authors <chr>,
## #   gb_journal <chr>, pubmed_id <int>, eukref_name <chr>, eukref_source <chr>,
## #   eukref_env_material <chr>, eukref_env_biome <chr>,
## #   eukref_biotic_relationship <chr>, eukref_specific_host <chr>,
## #   eukref_geo_loc_name <chr>, eukref_notes <chr>, pr2_sample_type <chr>,
## #   pr2_sample_method <chr>, pr2_latitude <dbl>, pr2_longitude <dbl>,
## #   pr2_ocean <chr>, pr2_sea <chr>, pr2_sea_lat <dbl>, pr2_sea_lon <dbl>,
## #   pr2_continent <chr>, pr2_country <chr>, pr2_location <chr>,
## #   pr2_location_geoname <chr>, pr2_location_geotype <chr>,
## #   pr2_location_lat <dbl>, pr2_location_lon <dbl>, pr2_country_geocode <chr>,
## #   pr2_country_lat <dbl>, pr2_country_lon <dbl>, pr2_sequence_origin <chr>,
## #   pr2_size_fraction <chr>, pr2_size_fraction_min <dbl>,
## #   pr2_size_fraction_max <dbl>, metadata_remark <chr>, junk_gb_authors <chr>,
## #   junk_gb_publication <chr>, junk_gb_journal <chr>
## [1] "Number of PR2 discarded sequences in update : 0"
## [1] "Sequences duplicated  - df "
## # A tibble: 0 x 2
## # ... with 2 variables: genbank_accession <chr>, n <int>
## [1] "Sequences updated with 2 entries in PR2  (e.g. with and without introns) - df "
## Joining, by = "genbank_accession"

7 New sequences

7.2 Run the script "PR2 genbank dowmnload.R` in background

7.4 Read the features file and join

## Parsed with column specification:
## cols(
##   seqnames = col_character(),
##   start = col_double(),
##   end = col_double(),
##   width = col_double(),
##   strand = col_character(),
##   type = col_character(),
##   gene = col_character(),
##   product = col_character(),
##   loctype = col_character(),
##   genbank_accession = col_character(),
##   note = col_character()
## )

8 Finalization

8.3 New sequences

8.3.4 Check duplicate entries

## # A tibble: 0 x 2
## # ... with 2 variables: pr2_accession <chr>, n_seq <int>

8.3.5 Final files for uploading

## [1] " Final number of sequences added:  15"
##  [1] "genbank_accession"       "kingdom"                
##  [3] "supergroup"              "division"               
##  [5] "class"                   "order"                  
##  [7] "family"                  "genus"                  
##  [9] "species"                 "sequence"               
## [11] "gb_definition"           "gb_organism"            
## [13] "gb_organelle"            "gb_strain"              
## [15] "gb_isolate"              "gb_clone"               
## [17] "gb_specimen_voucher"     "gb_collected_by"        
## [19] "gb_lat_lon"              "gb_note"                
## [21] "gb_culture_collection"   "gb_isolation_source"    
## [23] "gb_host"                 "gb_environmental_sample"
## [25] "gb_collection_date"      "gb_country"             
## [27] "gb_date"                 "gb_locus"               
## [29] "gb_sequence"             "pr2_sample_type"        
## [31] "sequence_length"         "seqnames"               
## [33] "start"                   "end"                    
## [35] "width"                   "strand"                 
## [37] "type"                    "gene"                   
## [39] "product"                 "loctype"                
## [41] "note"                    "label"                  
## [43] "pr2_accession"           "reference_sequence"     
## [45] "organelle"               "ambiguities"            
## [47] "sequence_hash"
## [1] 15
## [1] 15
## [1] 15

Daniel Vaulot

11 05 2020