PR2 version 4.13.0
Adding GenBank sequences

4 Get sequence list from GenBank

4.1 Compare with sequences in PR2

## [1] "Number of updated sequences that are not present in PR2 : 12504"
## [1] "Number of updated sequences that are not active in PR2 : 6"
## [1] "Sequences duplicated  - df "
## # A tibble: 0 x 2
## # ... with 2 variables: genbank_accession <chr>, n <int>

6 Finalization

6.2 New sequences

6.2.1 Read new sequences once the corrections have been done

  • Sequences with ITS region removed
  • Sequences with introns removed

6.2.3 Remove sequences with ambiguities or too short

## # A tibble: 27 x 4
##    pr2_accession   sequence_length ambiguities gb_definition                    
##    <chr>                     <dbl>       <int> <chr>                            
##  1 MG972345.1.287~            2875         103 Chaetoceros sp. Na12A3 isolate N~
##  2 MG972343.1.287~            2875         103 Chaetoceros sp. Na12A3 18S ribos~
##  3 MG972335.1.268~            2689         131 Chaetoceros diversus isolate Na3~
##  4 MG972337.1.262~            2621         130 Chaetoceros diversus isolate Na1~
##  5 MG972364.1.245~            2455          87 Chaetoceros sp. Na28A1 18S ribos~
##  6 KL780694.1.243~            2432         100 Manihot esculenta cultivar KU50 ~
##  7 MF497377.1.235~            2356         830 Phryganella paradoxa isolate KD1~
##  8 MG972365.1.220~            2204         120 Chaetoceros cf. vixvisibilis iso~
##  9 MN367945.1.211~            2115         140 Bembidion platyderoides voucher ~
## 10 MG972366.1.209~            2097          74 Chaetoceros cf. vixvisibilis iso~
## # ... with 17 more rows

6.2.4 Check duplicate entries

## # A tibble: 0 x 2
## # ... with 2 variables: pr2_accession <chr>, n <int>

6.3 Get some statistics

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

6.3.1 Final files for uploading

FInal number of sequences: 7032

## [1] " Final number of sequences added:  7032"
##  [1] "genbank_accession"       "gb_definition"          
##  [3] "gb_organism"             "gb_strain"              
##  [5] "gb_organelle"            "gb_isolate"             
##  [7] "gb_clone"                "gb_specimen_voucher"    
##  [9] "gb_collected_by"         "gb_lat_lon"             
## [11] "gb_note"                 "gb_culture_collection"  
## [13] "gb_isolation_source"     "gb_host"                
## [15] "gb_environmental_sample" "gb_collection_date"     
## [17] "gb_country"              "gb_date"                
## [19] "pr2_longitude"           "pr2_latitude"           
## [21] "pr2_sample_type"         "gb_sequence"            
## [23] "sequence_length"         "start"                  
## [25] "end"                     "label"                  
## [27] "pr2_accession"           "reference_sequence"     
## [29] "gene"                    "organelle"              
## [31] "ambiguities"             "sequence_hash"
## [1] 7032
## [1] 7032
## [1] 7032

Daniel Vaulot

27 05 2020