PR2 version 4.13.0
Adding RCC sequences

4 RCC database

4.3 Reload the file after manual correction

4.4 Compare with sequences in PR2

## [1] "Number of updated sequences that are not present in PR2 : 1244"
## [1] "Number of updated sequences that are not active in PR2 : 0"
## [1] "Sequences duplicated  - df "
## [1] genbank_accession n                
## <0 rows> (or 0-length row.names)

5 Taxonomy

5.1 Build and check

## Warning: `group_by_()` is deprecated as of dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Warning: `transmute_()` is deprecated as of dplyr 0.7.0.
## Please use `transmute()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Joining, by = "name"
## Joining, by = "species"

6 New sequences

6.2 Run the script PR2 genbank download.R in background

6.4 Read the features file and join

## Parsed with column specification:
## cols(
##   seqnames = col_character(),
##   start = col_double(),
##   end = col_double(),
##   width = col_double(),
##   strand = col_character(),
##   type = col_character(),
##   product = col_character(),
##   loctype = col_character(),
##   genbank_accession = col_character(),
##   note = col_character()
## )

7 Finalization

7.2 New sequences

7.2.1 Read new sequences once the corrections have been done

  • Sequences with ITS region removed
  • Sequences with introns removed

7.2.3 Remove sequences with ambiguities or too short

## # A tibble: 1 x 5
##   pr2_accession  species   sequence_length ambiguities gb_definition            
##   <chr>          <chr>               <int>       <int> <chr>                    
## 1 MF179484.1.18~ Hemiselm~            1847          34 Hemiselmis virescens vou~

7.2.4 Check duplicate entries

## # A tibble: 0 x 2
## # ... with 2 variables: pr2_accession <chr>, n <int>

7.2.5 Final files for uploading

FInal number of sequences: 1129

## [1] " Final number of sequences added:  1129"
##  [1] "genbank_accession"       "pr2_longitude"          
##  [3] "pr2_latitude"            "sampling_date"          
##  [5] "pr2_depth"               "pr2_sea"                
##  [7] "pr2_ocean"               "strain_name"            
##  [9] "rcc_id"                  "species"                
## [11] "gb_definition"           "gb_organism"            
## [13] "gb_organelle"            "gb_strain"              
## [15] "gb_isolate"              "gb_clone"               
## [17] "gb_specimen_voucher"     "gb_collected_by"        
## [19] "gb_lat_lon"              "gb_note"                
## [21] "gb_culture_collection"   "gb_isolation_source"    
## [23] "gb_host"                 "gb_environmental_sample"
## [25] "gb_collection_date"      "gb_country"             
## [27] "gb_date"                 "gb_locus"               
## [29] "gb_sequence"             "pr2_sample_type"        
## [31] "sequence_length"         "seqnames"               
## [33] "start"                   "end"                    
## [35] "start_feature"           "end_feature"            
## [37] "width"                   "strand"                 
## [39] "type"                    "product"                
## [41] "loctype"                 "note"                   
## [43] "sequence"                "label"                  
## [45] "pr2_accession"           "reference_sequence"     
## [47] "gene"                    "organelle"              
## [49] "ambiguities"             "sequence_hash"
## [1] 1129
## [1] 1129
## [1] 1129

Daniel Vaulot

22 05 2020