PR2 version 4.12.0
Apicomplexa from J. del Campo

1 Init
2 Read pr2
3 Apicomplexa - Round 1 - 2019-05
4 Apicomplexa - Round 3 - 2019-07-29
5 Final
- 5.1 Corrections
- 5.2 Actions

1 Init

Load the variables common to the different scripts and the necessary libraries

  knitr::opts_chunk$set(eval=FALSE, 
                        cache = TRUE)

  source('PR2_init.R', echo=FALSE)

2 Read pr2

Use the PR2 on local computer

# Info for the local dabase 
  pr2_db <- db_info("pr2_local")

  pr2_db_con <- db_connect(pr2_db)
  pr2_main <- tbl(pr2_db_con, "pr2_main") %>% collect()
  # pr2_main  <- pr2_main %>% filter (is.na(removed_version))

  pr2_seq <- tbl(pr2_db_con, "pr2_sequences")%>% collect()

  pr2_taxo <- tbl(pr2_db_con, "pr2_taxonomy")
  pr2_taxo  <- pr2_taxo %>% filter (is.na(taxo_removed_version))%>% collect()

  pr2_metadata <- tbl(pr2_db_con, "pr2_metadata")%>% collect()

# Join the tables and keep only sequences that are not removed

  pr2 <- left_join(pr2_main, pr2_taxo, by = c("species"="species"))
  pr2 <- left_join (pr2, pr2_seq)
  pr2 <- left_join (pr2, pr2_metadata)
  db_disconnect(pr2_db_con)
  
  pr2_18S  <- pr2 %>% filter (is.na(removed_version)) %>% 
    filter(gene == "18S_rRNA")
  pr2_18S_removed <- pr2 %>% filter (!is.na(removed_version)) %>% 
    filter(gene == "18S_rRNA")
  
# Remove pr2 to make sure that we do not use the full pr2
  remove(pr2)

3 Apicomplexa - Round 1 - 2019-05

3.1 Set up the files

dir_pr2_update <- "../updates/2018 Apicomplexa Javier"
  pr2.env$editor <- "Del Campo J."

  full_path <- function(file_name){str_c(dir_pr2_update,"/", file_name)}

  file_pr2_update_excel <- full_path("Apicomplexa.xlsx")  
# Import the tsv skipping the first line

3.2 Read the original data and reformat

3.2.1 Read the data

Number of sequences = 8392

pr2_update <- read_excel(file_pr2_update_excel, sheet = "original", guess_max=200000, na=c("", "-"))
  
  str_c("Number of sequences : ", nrow(pr2_update))

3.2.2 Compare with apicomplexa sequences in PR2

  str_c("Number of sequences Apicomplexa in pr2 : ", 
        nrow(filter(pr2_18S, division == "Apicomplexa")))

  str_c("Number of new sequences of Apicomplexa : ", 
        nrow(filter(pr2_update, !(genbank_accession %in% pr2_18S$genbank_accession))))

  str_c("Number of sequences of Apicomplexa misclassified in PR2 : ")
  filter(pr2_18S, (genbank_accession %in% pr2_update$genbank_accession)) %>% 
             count(division)  
  
  str_c("Number of sequences Apicomplexa sequences in PR2 that are not in Javier table : ") 
  filter(pr2_18S, (division == "Apicomplexa") & 
             !(genbank_accession %in% pr2_update$genbank_accession))  %>% 
  write_tsv(full_path("api_pr2_not_javier.tsv"), na="")
  
   str_c("Number of PR2 discared sequences in Javier table : ",
         nrow(filter(pr2_removed, genbank_accession %in% pr2_update$genbank_accession)))

3.2.3 Build the genus and species field

temp <- pr2_update %>% 
    mutate(genus = case_when(eukref_source == "Isolate" ~ str_remove(eukref_name,"_.+"),
                             TRUE ~ taxo8)) %>% 
    mutate(species = case_when(eukref_source == "Isolate" ~ str_replace(eukref_name, "sp", "sp."),
                             TRUE ~ str_c(taxo8,"_sp.")))

3.2.4 Reexport for editing in Excel (api_01)

write_tsv(temp, full_path("api_01.tsv"), na="")

3.3 Read the reformatted data api_02

The api_01.tsv file is imported into Excel into sheet api_02
Taxonomy is manually edited…
Columns taxo01-> taxo5 correspond to kingdom to order
family is either taxo6 or taxo7 depending on groups.
For genus which are split into different taxo8 level, I created clades based on taxo8 (e.g. for Babesia, Thelleria etc…)
species corresponds to what is in the eukref_name when possible except when this leads to incoherent taxonomy (e.g. the same species in 2 different families). In this case I kept the species for the clade from which it was defined and renamed the other species with “Genus_sp.”.

3.3.1 Read the updated data

pr2_update <- read_excel(file_pr2_update_excel, sheet = "api_02", guess_max=200000, na=c("", "-"))

3.3.2 Build taxonomy for api_02

pr2_update_taxo <- pr2_update %>% 
  group_by_at(pr2.env$taxo_levels) %>%  
  count() 

pr2_taxo_check(pr2_update_taxo, "../temp")

pr2_update_taxo <- pr2_update_taxo %>% 
  rename_all(funs(str_c(.,"_new" ))) %>%
  rename(species = species_new) %>% 
  left_join(pr2_taxo)

write_tsv(pr2_update_taxo, full_path("api_02_taxo.tsv"), na="")

3.3.3 Find taxa in PR2 that are not included in Javier taxonomy

124 taxa not included in Javier taxonomy

temp <- pr2_taxo %>% 
  filter(division == "Apicomplexa") %>% 
  filter(!(species %in% pr2_update_taxo$species))

write_tsv(temp, full_path("pr2_taxo_not_javier.tsv"), na="")

3.4 Download metadata for new sequences

2641 new sequences

pr2_genbank_new <- pr2_update %>% 
  filter(!(genbank_accession %in% pr2$genbank_accession)) %>% 
  select(genbank_accession)

3.5 Add new metadata

The metadata have been extracted previously by a batch file (PR2 genbank download.R): 2535 new metadata. * Metadata from 92 sequences could not be retrieved because they correspond to sequences that have been removed from GenBank

metadata_list <- list()
for (i in 1:10){
  metadata_list[[i]] <- read_tsv(full_path(str_c("pr2_gb_metadata_",i,".txt")))
}
metadata <- purrr::reduce(metadata_list, bind_rows)

pr2_metadata_new <- metadata %>% 
  select(genbank_accession, contains("gb_"), contains("pr2_")) %>% 
  write_tsv(full_path(str_c("pr2_metadata_new.txt")), na="")

4 Apicomplexa - Round 3 - 2019-07-29

From Javier

    I am done with the Apicomplexa.
    All the changes a made on your annotation have been highlighted in red, you will see that almost everything is red..., sadly the paper has been submitted and is under review but I will try my best to adequate the DB we are going to release with the paper with the one I am sending to you that I think it makes more sense.
    Apart from the changes in the taxonomy I have also looked at the seqeunces that are in PR2 but not in my DB. An dI have classified them in 5 categories.
    1. Missed and New. Sequences that are new, most o them and 20 that for some reason I missed.
    2. No 18S. Sequences that contain more than the 18S or genomes that I did not add to my DB.
    3. Quarantine. Sequences that not being in principle chimeras are too divergent or too weird to be assigned to the Apis. We need to build pretty general tree in order to place or discard them. Maybe by blast the best hit is an API but the second best hit is a plant, and both with low similarity.
    4. No apis. Sequences that I know for sure that are no apis, most of them come from a single study.
    5. Chimeras. Chimeras.
    
    BONUS: I don't know what to do with the sequences related to Toxoplasma (Toxoplasma 1 and Toxoplasma 2). Is pure chaos, any suggestion would be welcome.

4.1 Read updated file

4.1.1 File path

dir_pr2_update <- "../updates/2019 Apicomplexa Javier"
  pr2.env$editor <- "Del Campo J."

  full_path <- function(file_name){str_c(dir_pr2_update,"/", file_name)}

  file_pr2_update_excel <- full_path("Apicomplexa 3.0.xlsx")  
  file_pr2_imports_excel <- full_path("Apicomplexa 3.0 imports to PR2.xlsx")

4.1.2 Read the updated file and update all necessary fields

pr2_update <- read_excel(file_pr2_update_excel, sheet = "api_03", guess_max=200000, na=c("", "-"))
  
  str_c("Number of sequences : ", nrow(pr2_update))

4.1.3 Create all necessary fields

pr2_update <- pr2_update %>% 
  mutate(sequence_length_new = str_length(sequence),
         ambiguities = str_count(pr2_update$sequence,pattern=pr2.env$ambig_regex),
         label = "U",
         start = 1,
         end = sequence_length_new,
         pr2_accession = str_c(genbank_accession,".",start,".", end,"_",label),
         gene= "18S_rRNA",
         organelle = "nucleus",
         edited_version = "4.12.0",
         edited_by = "J. del Campo",
         removed_version = NA)

pr2_update$sequence_hash = purrr::map_chr(pr2_update$sequence,digest::sha1)

pr2_update_disagreement <- pr2_update %>% 
  filter(sequence_length != sequence_length_new)

write_tsv(pr2_update, "Apicomplexa 3.0.txt", na="")

4.2 Metadata

4.2.1 Get metadata for sequences which not yet metadata

pr2_genbank_new <- pr2_update %>% 
  filter(!(genbank_accession %in% pr2_metadata$genbank_accession)) %>% 
  select(genbank_accession)

4.2.2 Add new metadata

162 entries added
Of these 145 do not exist anymore in GenBank

metadata_list <- list()
for (i in 1:10){
  metadata_list[[i]] <- read_tsv(full_path(str_c("pr2_gb_metadata_03_",i,".txt")))
}
metadata <- purrr::reduce(metadata_list, bind_rows)

pr2_metadata_new <- metadata %>% 
  select(genbank_accession, contains("gb_"), contains("pr2_")) %>% 
  write_tsv(full_path(str_c("pr2_metadata_new_03.txt")), na="")

4.3 Filter out sequences that do not fit PR2 criteria

30 sequences rejected
Number of sequences accepted): 8362

  # Remove sequences that are 
  # * shorter than minimum length (500)
  # * more than 20 ambiguities
  # * more than 2N consecutive
    pr2_update_rejected <- pr2_update %>% 
      filter((sequence_length < pr2.env$sequence_length_min)|
             (ambiguities > pr2.env$ambiguities_max)|
              str_detect(sequence, "NNN")  )
    write_tsv( pr2_update_rejected, full_path("Apicomplexa_rejected.txt"), na="")
    
    str_c("Number of sequences rejected (length <500 or ambuiguities > 20 or more than 2 consecutive N): ", nrow(pr2_update_rejected))
    
    select(pr2_update_rejected, genbank_accession, sequence_length, ambiguities)
    
    pr2_update_accepted <- pr2_update %>% 
      filter(!(genbank_accession %in% pr2_update_rejected$genbank_accession))
    
    str_c("Number of sequences accepted): ", nrow(pr2_update_accepted))

4.4 Check if some sequences have same genbank

None

  genbank_duplicated <- pr2_update_accepted %>% 
    count(genbank_accession) %>% 
    filter(n>1)

4.5 Taxonomy

4.5.1 Build taxonomy file

pr2_update_taxo <- pr2_update_accepted %>% 
  group_by_at(pr2.env$taxo_levels) %>%  
  count() 

pr2_taxo_check(pr2_update_taxo, "../temp")

pr2_update_taxo <- pr2_update_taxo %>% 
  rename_all(~str_c(.,"_new" )) %>%
  rename(species = species_new) %>% 
  left_join(pr2_taxo) %>% 
  rename_at(vars(kingdom:genus), ~str_c(.,"_old" )) %>% 
  rename_all(~str_replace(., "_new", ""))

write_tsv(pr2_update_taxo, full_path("taxo_api_03.txt"), na="")

4.5.2 Find within missed or new sequences, taxa that are not in the new taxonomy

Number of sequences not in Javier table: 461

pr2_missed_new <- read_excel(file_pr2_imports_excel, sheet = "missing_new", 
                              guess_max=200000, na=c("", "-")) %>% 
  filter(is.na(status_javier)|(status_javier == "missed_new"))
  
  str_c("Number of sequences not in Javier table: ", nrow(pr2_missed_new))
  
taxo_missed_new <- pr2_missed_new %>% 
  group_by_at(pr2.env$taxo_levels) %>%  
  count() 

taxo_missed_new_not_updated <-  taxo_missed_new %>% 
  filter(!(species%in% pr2_update_taxo$species)) %>% 
  write_tsv(full_path("taxo_not_api_03.txt"), na="")

4.6 Add and updated sequences

4.6.1 New sequences that are not in PR2

Number of new sequences of Apicomplexa : 2619

pr2_updated_new <- pr2_update_accepted %>% 
  filter(!(genbank_accession %in% pr2$genbank_accession)) 

str_c("Number of new sequences of Apicomplexa : ", 
        nrow(pr2_genbank_new))

pr2_main_new <- pr2_updated_new %>% 
  select(genbank_accession, species, label:edited_by, -sequence_hash) 
pr2_sequence_new <- pr2_updated_new %>% 
  select(pr2_accession, starts_with("sequence"), ambiguities) 
pr2_metadata_new <- pr2_updated_new %>% 
  select(genbank_accession, starts_with("eukref_"), -eukref_name_2)

4.6.2 Old sequences that are in PR2

Number of old sequences of Apicomplexa updated: 5743

pr2_updated_old <- pr2_update_accepted %>% 
  filter((genbank_accession %in% pr2$genbank_accession)) 

str_c("Number of old sequences of Apicomplexa updated: ", 
        nrow(pr2_updated_old))

pr2_main_old <- pr2_updated_old %>% 
  select(genbank_accession, species, edited_version, edited_by) 
pr2_metadata_old <- pr2_updated_old %>% 
  select(genbank_accession, starts_with("eukref_"), -eukref_name_2)

4.6.3 Sequence in PR2 with same GenBank number

292 sequences corresponding to 146 accession (U and UC)

pr2_main_old_old <- pr2 %>% 
  filter(genbank_accession %in% pr2_main_old$genbank_accession)

genbank_duplicated <- pr2_main_old_old %>% 
    count(genbank_accession) %>% 
    filter(n>1)
  
  pr2_genbank_duplicated <- pr2 %>% 
    filter(genbank_accession %in% genbank_duplicated$genbank_accession) %>% 
    arrange(pr2_accession)

  str_c("Number of old sequences that have the same accession 2 by 2: ", 
        nrow(pr2_genbank_duplicated))

4.6.4 Save everything to an Excel file

  file_pr2_imports <-  full_path("pr2 imports 2.xlsx")
 onglets <- list("pr2_main_old" = pr2_main_old, 
                 "pr2_metadata_old" = pr2_metadata_old,
                 "pr2_main_new" = pr2_main_new, 
                 "pr2_sequence_new" = pr2_sequence_new,
                 "pr2_metadata_new" = pr2_metadata_new)
 openxlsx::write.xlsx(onglets, file_pr2_imports)

5 Final

 pr2_Apicomplexa <- pr2 %>% 
  filter(division == "Apicomplexa") %>% 
  select(pr2_accession:organelle, kingdom:genus, species, chimera:remark, 
         taxo_edited_version:metadata_remark,
        -sequence, -pr2_main_id, -taxo_id, -pr2_metadata_id, -seq_id) %>% 
  arrange(gene, organelle, division, class, order, family, genus, species)

 file_xls <-  full_path("Apicomplexa pr2 4.12.0.xlsx")
 onglets <- list("pr2_Apicomplexa" = pr2_Apicomplexa)
 openxlsx::write.xlsx(onglets, file_xls)

5.1 Corrections

Besnoitia_besnoiti split into 2 species because they belong to different Toxoplasma
- Besnoitia1_besnoiti
- Besnoitia2_besnoiti
Neospora_caninum split into 2 species because they belong to different Toxoplasma
- Neospora1_caninum
- Neospora2_caninum
Toxoplasma_gondii into 2 species because they belong to different Toxoplasma
- Toxoplasma1_gondii
- Toxoplasma2_gondii
For all the genera that have been split in a number of genera the species have also been given the same number (e.g. Babesia1_xxx).

5.2 Actions

Sequences not in Javier table
- Quarantine : removed from active PR2 database (they can be re-added latter)
- Not 18S: removed from active PR2 database (not that HQ219405, HG328237, HG328236, HG328235 seems to be OK)
- Not Apicomplexa: removed from active PR2 database
- Chimera: removed from active PR2 database and tagged as chimeras
- Missed/New: Kept in PR2 - species updated with new taxonomy
Apicoplast sequences updated with new taxonomy
Updates
- New: 2619
- Updated: 5889+239
- Removed: 89

PR2 version 4.12.0 Apicomplexa from J. del Campo

PR2 version 4.12.0 Apicomplexa from J. del Campo

1 Init

2 Read pr2

3 Apicomplexa - Round 1 - 2019-05

3.1 Set up the files

3.2 Read the original data and reformat

3.2.1 Read the data

3.2.2 Compare with apicomplexa sequences in PR2

3.2.3 Build the genus and species field

3.2.4 Reexport for editing in Excel (api_01)

3.3 Read the reformatted data api_02

3.3.1 Read the updated data

3.3.2 Build taxonomy for api_02

3.3.3 Find taxa in PR2 that are not included in Javier taxonomy

3.4 Download metadata for new sequences

3.5 Add new metadata

4 Apicomplexa - Round 3 - 2019-07-29

4.1 Read updated file

4.1.1 File path

4.1.2 Read the updated file and update all necessary fields

4.1.3 Create all necessary fields

4.2 Metadata

4.2.1 Get metadata for sequences which not yet metadata

4.2.2 Add new metadata

4.3 Filter out sequences that do not fit PR2 criteria

4.4 Check if some sequences have same genbank

4.5 Taxonomy

4.5.1 Build taxonomy file

4.5.2 Find within missed or new sequences, taxa that are not in the new taxonomy

4.6 Add and updated sequences

4.6.1 New sequences that are not in PR2

4.6.2 Old sequences that are in PR2

4.6.3 Sequence in PR2 with same GenBank number

4.6.4 Save everything to an Excel file

5 Final

5.1 Corrections

5.2 Actions

PR2 version 4.12.0
Apicomplexa from J. del Campo

PR2 version 4.12.0
Apicomplexa from J. del Campo