PR2 version 4.12.0
Part A - Plastid 16S

1 Initialize libraries

By default do not run any of the chunks when knitting

Load the variables common to the different scripts and the necessary libraries

3 PhytoRef Euk

3.1 PhytoRef read files

Create a table from existing information

Worksheet PhytoRef.xlsx * phytoref : the final phytoref table that will be imported into PR2 * 01-phytoref_download_2019 : contains the tab delimited file availabel from Roscoff PhytoRef site * 02-taxonomy_table_21_11_2014 : contains taxonomy table from 2014 with some metadata * 04-phytoref_new_seq : contains new sequence deposited to Genbank and that are not in the current PhytoRef

Fasta PhytoRef_with_taxonomy.fasta * Fasta file with taxonomy

3.2 Basic cleaning

3.2.1 Remove Sequences in phytoref_01 not found in phytoref_02: 10 sequences

  • These are duplicated sequences
  • Removed

3.2.2 Sequences from Cyanobacteria: 184 sequences

3.2.4 Remove sequences with missing GenBank ID : 74

  • It seems that these sequences never made it to GenBank

3.6 Finalization

3.6.4 Find how many GenBank sequences have more than one entry - 739 (corresponds to palstid genomes)

4 PhytoRef Cyanos

4.4 Finalization

4.4.4 Find how many GenBank sequences have more than one entry - 0

5 Checking the taxonomy of the unknown sequences using Wang assigner - 927 sequences reassigned

5.2 Read the classifier file

Rules 1. Sequences classified as Bacteria are removed from database (manually) : 9 2. Sequences that have 90 bootstrap at species level are assigned to species 3. Sequences that have 90 bootstrap at genus level are assigned to genus For the rest the assignation does not change

Look also at sequences for which pr2 assignement and wang assginement disagree…

min_boot = 90

pr2_plastid_unknown_reassigned <- read_tsv(full_path("plastid_unknown.dada2.taxo") ) %>% 
  tidyr::separate(into= c("pr2_accession", "seq_label", str_c("taxo", 1:8) ), col=seq_name, sep = "[|]")

print("Sequence assigned to bacteria and removed from pr2")
pr2_plastid_unknown_reassigned %>%  
  filter(kingdom == "Bacteria")

pr2_plastid_unknown_reassigned <- pr2_plastid_unknown_reassigned %>%  
  filter(kingdom != "Bacteria")

print("Sequence contradiction at supergroup level")
pr2_plastid_unknown_reassigned %>%  
  filter(taxo2 != supergroup) %>% 
  select(pr2_accession, seq_label, taxo2, taxo8, supergroup, species, supergroup_boot, species_boot)

print("Sequence contradiction at division level")
pr2_plastid_unknown_reassigned %>%  
  filter((taxo3 != division)&(taxo2 == supergroup)) %>% 
  select(pr2_accession, seq_label, taxo3, taxo8, division, species, division_boot, species_boot)

print("Sequence contradiction at class level")
pr2_plastid_unknown_reassigned %>%  
  filter((taxo4 != class)&(taxo3 != division)&(taxo2 == supergroup)) %>% 
  select(pr2_accession, seq_label, taxo4, taxo8, class, species, class_boot, species_boot)

pr2_plastid_unknown_reassigned <- pr2_plastid_unknown_reassigned %>%  
  filter(genus_boot >= min_boot) %>% 
  mutate(species_new = case_when(species_boot >= min_boot ~ species,
                                 genus_boot >= min_boot ~ str_c(genus,"_sp."),
                                 TRUE ~ NA_character_) ) %>%
  filter(!is.na(species_new)) 

pr2_plastid_unknown_reassigned %>% 
  select(pr2_accession, seq_label, taxo4, taxo7, taxo8, class, species_new, genus_boot, species_boot) %>% 
  write_tsv(full_path("phytoref_reassigned.tsv"), na="")

# Check new species that are not in pr2 and enter them manually

pr2_plastid_unknown_reassigned %>% filter(!(species_new %in% pr2$species)) %>% 
  count(species_new)

6 Save all the files

Daniel Vaulot

08 08 2019