PR2 version 4.13.0
Thalassiosirales from L. Arsenieff

3 Reference

  • Arsenieff L., Le Gall F., Rigaut-Jalabert F., Mahé F., Sarno D., Gouhier L., Baudoux A-C., Simon N. 2020. Diversity and dynamics of relevant nanoplanktonic diatoms in the Western English Channel. The ISME Journal. DOI: 10.1038/s41396-020-0659-6.

5 Read the original data and reformat

5.2 Compare with sequences in PR2

[1] "Number of sequences of target group in pr2 : 255"
[1] "Number of updated sequences that are not present in PR2 : 19"
[1] "Number of updated sequences that are not active in PR2 : 2"
[1] "Sequences of PR2 that are misclassified - df "
# A tibble: 3 x 2
  genus             n
  <chr>         <int>
1 Guinardia        13
2 Minidiscus       13
3 Thalassiosira   229
[1] "Sequences from target group in PR2 that are not in update  - df "
# A tibble: 0 x 97
# ... with 97 variables: pr2_main_id <int>, pr2_accession <chr>,
#   genbank_accession <chr>, start <dbl>, end <dbl>, label <chr>, gene <chr>,
#   organelle <chr>, species <chr>, chimera <int>, chimera_remark <chr>,
#   reference_sequence <int>, added_version <chr>, removed_version <chr>,
#   edited_version <chr>, edited_by <chr>, edited_remark <chr>, remark <chr>,
#   taxo_id <int>, kingdom <chr>, supergroup <chr>, division <chr>,
#   class <chr>, order <chr>, family <chr>, genus <chr>,
#   taxon_trophic_mode <chr>, taxo_edited_version <chr>, taxo_edited_by <chr>,
#   taxo_removed_version <chr>, taxo_remark <chr>, reference <chr>,
#   seq_id <int>, sequence <chr>, sequence_length <int>, ambiguities <int>,
#   sequence_hash <chr>, gb_sequence <chr>, pr2_metadata_id <int>,
#   gb_date <chr>, gb_locus <chr>, gb_definition <chr>, gb_organism <chr>,
#   gb_organelle <chr>, gb_taxonomy <chr>, gb_strain <chr>,
#   gb_culture_collection <chr>, gb_clone <chr>, gb_isolate <chr>,
#   gb_isolation_source <chr>, gb_specimen_voucher <chr>, gb_host <chr>,
#   gb_collection_date <chr>, gb_environmental_sample <chr>, gb_country <chr>,
#   gb_lat_lon <chr>, gb_collected_by <chr>, gb_note <chr>,
#   gb_references <chr>, gb_publication <chr>, gb_authors <chr>,
#   gb_journal <chr>, pubmed_id <int>, eukref_name <chr>, eukref_source <chr>,
#   eukref_env_material <chr>, eukref_env_biome <chr>,
#   eukref_biotic_relationship <chr>, eukref_specific_host <chr>,
#   eukref_geo_loc_name <chr>, eukref_notes <chr>, pr2_sample_type <chr>,
#   pr2_sample_method <chr>, pr2_latitude <dbl>, pr2_longitude <dbl>,
#   pr2_ocean <chr>, pr2_sea <chr>, pr2_sea_lat <dbl>, pr2_sea_lon <dbl>,
#   pr2_continent <chr>, pr2_country <chr>, pr2_location <chr>,
#   pr2_location_geoname <chr>, pr2_location_geotype <chr>,
#   pr2_location_lat <dbl>, pr2_location_lon <dbl>, pr2_country_geocode <chr>,
#   pr2_country_lat <dbl>, pr2_country_lon <dbl>, pr2_sequence_origin <chr>,
#   pr2_size_fraction <chr>, pr2_size_fraction_min <dbl>,
#   pr2_size_fraction_max <dbl>, metadata_remark <chr>, junk_gb_authors <chr>,
#   junk_gb_publication <chr>, junk_gb_journal <chr>
[1] "Number of PR2 discarded sequences in update : 2"
[1] "Sequences duplicated  - df "
# A tibble: 0 x 2
# ... with 2 variables: genbank_accession <chr>, n <int>
[1] "Sequences updated with 2 entries in PR2  (e.g. with and without introns) - df "

8 Finalization

8.3 New sequences

8.3.4 Check duplicate entries

# A tibble: 0 x 2
# ... with 2 variables: pr2_accession <chr>, n_seq <int>

8.3.5 Final files for uploading

[1] " Final number of sequences added:  17"
 [1] "genbank_accession"       "start"                  
 [3] "end"                     "label"                  
 [5] "gene"                    "organelle"              
 [7] "kingdom"                 "supergroup"             
 [9] "division"                "class"                  
[11] "order"                   "family"                 
[13] "genus"                   "species"                
[15] "sequence"                "gb_definition"          
[17] "gb_organism"             "gb_organelle"           
[19] "gb_strain"               "gb_isolate"             
[21] "gb_clone"                "gb_specimen_voucher"    
[23] "gb_collected_by"         "gb_lat_lon"             
[25] "gb_note"                 "gb_culture_collection"  
[27] "gb_isolation_source"     "gb_host"                
[29] "gb_environmental_sample" "gb_collection_date"     
[31] "gb_country"              "gb_date"                
[33] "gb_locus"                "gb_sequence"            
[35] "pr2_sample_type"         "sequence_length"        
[37] "pr2_longitude"           "pr2_latitude"           
[39] "seqnames"                "start.y"                
[41] "end.y"                   "width"                  
[43] "strand"                  "type"                   
[45] "product"                 "loctype"                
[47] "pr2_accession"           "reference_sequence"     
[49] "ambiguities"             "sequence_hash"          
[1] 17
[1] 17
[1] 17

Daniel Vaulot

11 05 2020