PR2 version 4.13.0
Adding Silva sequences

6 New sequences

6.5 Run the script PR2 genbank download.R in background

7 Finalization

8 One more round of search for metadata

Note: these entries correspond to contigs from metagenomes that point to a single record. They have to be searched one by one (script #6)

8.1 Run the script on server (script_genbank_06.R)

  • Note that some of the pr2 entries correspond to only one GenBank entries for metagenomes. This has to be changed in the pr2_main table

8.2 Read the metadata files and join by genbank accession

metafile <- list()

for (i in c(1:500)) {
  file_name = full_path(str_c("metadata_06/pr2_gb_metadata_6_",i,".txt"))
  if(file.exists(file_name)) {
  metafile[[i]] <- read_tsv(file_name, 
                            col_types = cols(gb_collection_date = col_character(),
                                             gb_strain =  col_character(), 
                                             gb_isolate =  col_character(),
                                             gb_project_id = col_character(),
                                             gb_specimen_voucher = col_character(),
                                             gb_clone = col_character() ),
                            guess_max = 10000) 
  }
}

# The first file serves as a template for the following one when binding rows
gb_metadata <- reduce(metafile, bind_rows) %>% 
  filter(!is.na(genbank_accession_new))  %>% # Remove all entries for which we could not find a GenBank entry
  distinct() 

write.xlsx(gb_metadata, full_path(str_c("metadata_06/pr2_gb_metadata_6.xlsx")))

# Some entries have the same genbank_accession and different pr2_accession
junk <- gb_metadata %>% 
  count(genbank_accession_old) %>% 
  filter(n> 1) %>% 
  arrange(desc(n))
  
pr2_new_6_metadata <- left_join(pr2_new_6, gb_metadata, by = c("genbank_accession" = "genbank_accession_old")) %>% 
  rename(genbank_accession_old = genbank_accession) %>%
  filter(!is.na(genbank_accession_new))  # Remove all entries for which we could not find a GenBank entry 

pr2_main_6 <- pr2_new_6_metadata %>% 
  select(pr2_accession, genbank_accession_new) %>% 
  rename(genbank_accession = genbank_accession_new)

# Some entries have the same genbank_accession and different pr2_accession
junk <- pr2_main_6 %>% 
  count(genbank_accession) %>% 
  filter(n> 1) %>% 
  arrange(desc(n))

pr2_metadata_6 <- pr2_new_6_metadata %>% 
  select(-pr2_accession, -genbank_accession_old) %>% 
  rename(genbank_accession = genbank_accession_new) %>% 
  filter(!(genbank_accession %in% pr2_metadata$genbank_accession)) %>% 
  distinct()

8.3 Save everything to an Excel file

  • pr2_main: 70668 entries corresponding to 55512 updates. replace the old genbank_accession by the new one. Some pr2_accession correspond to the same genbank entry for metagenomes, e.g. OBEP000000000 corresponding to 44921 silva sequences.
  • pr2_medata: 14247 entries

Daniel Vaulot

02 06 2020