PR2 version 4.12.0
Apicomplexa from J. del Campo
PR2 version 4.12.0
Apicomplexa from J. del Campo
1 Init
Load the variables common to the different scripts and the necessary libraries
2 Read pr2
- Use the PR2 on local computer
# Info for the local dabase
pr2_db <- db_info("pr2_local")
pr2_db_con <- db_connect(pr2_db)
pr2_main <- tbl(pr2_db_con, "pr2_main") %>% collect()
# pr2_main <- pr2_main %>% filter (is.na(removed_version))
pr2_seq <- tbl(pr2_db_con, "pr2_sequences")%>% collect()
pr2_taxo <- tbl(pr2_db_con, "pr2_taxonomy")
pr2_taxo <- pr2_taxo %>% filter (is.na(taxo_removed_version))%>% collect()
pr2_metadata <- tbl(pr2_db_con, "pr2_metadata")%>% collect()
# Join the tables and keep only sequences that are not removed
pr2 <- left_join(pr2_main, pr2_taxo, by = c("species"="species"))
pr2 <- left_join (pr2, pr2_seq)
pr2 <- left_join (pr2, pr2_metadata)
db_disconnect(pr2_db_con)
pr2_18S <- pr2 %>% filter (is.na(removed_version)) %>%
filter(gene == "18S_rRNA")
pr2_18S_removed <- pr2 %>% filter (!is.na(removed_version)) %>%
filter(gene == "18S_rRNA")
# Remove pr2 to make sure that we do not use the full pr2
remove(pr2)
3 Apicomplexa - Round 1 - 2019-05
3.1 Set up the files
3.2 Read the original data and reformat
3.2.1 Read the data
- Number of sequences = 8392
3.2.2 Compare with apicomplexa sequences in PR2
str_c("Number of sequences Apicomplexa in pr2 : ",
nrow(filter(pr2_18S, division == "Apicomplexa")))
str_c("Number of new sequences of Apicomplexa : ",
nrow(filter(pr2_update, !(genbank_accession %in% pr2_18S$genbank_accession))))
str_c("Number of sequences of Apicomplexa misclassified in PR2 : ")
filter(pr2_18S, (genbank_accession %in% pr2_update$genbank_accession)) %>%
count(division)
str_c("Number of sequences Apicomplexa sequences in PR2 that are not in Javier table : ")
filter(pr2_18S, (division == "Apicomplexa") &
!(genbank_accession %in% pr2_update$genbank_accession)) %>%
write_tsv(full_path("api_pr2_not_javier.tsv"), na="")
str_c("Number of PR2 discared sequences in Javier table : ",
nrow(filter(pr2_removed, genbank_accession %in% pr2_update$genbank_accession)))
3.2.3 Build the genus and species field
3.3 Read the reformatted data api_02
- The
api_01.tsv
file is imported into Excel into sheet api_02 - Taxonomy is manually edited…
- Columns taxo01-> taxo5 correspond to kingdom to order
- family is either taxo6 or taxo7 depending on groups.
- For genus which are split into different taxo8 level, I created clades based on taxo8 (e.g. for Babesia, Thelleria etc…)
- species corresponds to what is in the eukref_name when possible except when this leads to incoherent taxonomy (e.g. the same species in 2 different families). In this case I kept the species for the clade from which it was defined and renamed the other species with “Genus_sp.”.
3.3.1 Read the updated data
3.3.2 Build taxonomy for api_02
pr2_update_taxo <- pr2_update %>%
group_by_at(pr2.env$taxo_levels) %>%
count()
pr2_taxo_check(pr2_update_taxo, "../temp")
pr2_update_taxo <- pr2_update_taxo %>%
rename_all(funs(str_c(.,"_new" ))) %>%
rename(species = species_new) %>%
left_join(pr2_taxo)
write_tsv(pr2_update_taxo, full_path("api_02_taxo.tsv"), na="")
3.3.3 Find taxa in PR2 that are not included in Javier taxonomy
- 124 taxa not included in Javier taxonomy
3.4 Download metadata for new sequences
- 2641 new sequences
3.5 Add new metadata
- The metadata have been extracted previously by a batch file (
PR2 genbank download.R
): 2535 new metadata. * Metadata from 92 sequences could not be retrieved because they correspond to sequences that have been removed from GenBank
metadata_list <- list()
for (i in 1:10){
metadata_list[[i]] <- read_tsv(full_path(str_c("pr2_gb_metadata_",i,".txt")))
}
metadata <- purrr::reduce(metadata_list, bind_rows)
pr2_metadata_new <- metadata %>%
select(genbank_accession, contains("gb_"), contains("pr2_")) %>%
write_tsv(full_path(str_c("pr2_metadata_new.txt")), na="")
4 Apicomplexa - Round 3 - 2019-07-29
From Javier
I am done with the Apicomplexa.
All the changes a made on your annotation have been highlighted in red, you will see that almost everything is red..., sadly the paper has been submitted and is under review but I will try my best to adequate the DB we are going to release with the paper with the one I am sending to you that I think it makes more sense.
Apart from the changes in the taxonomy I have also looked at the seqeunces that are in PR2 but not in my DB. An dI have classified them in 5 categories.
1. Missed and New. Sequences that are new, most o them and 20 that for some reason I missed.
2. No 18S. Sequences that contain more than the 18S or genomes that I did not add to my DB.
3. Quarantine. Sequences that not being in principle chimeras are too divergent or too weird to be assigned to the Apis. We need to build pretty general tree in order to place or discard them. Maybe by blast the best hit is an API but the second best hit is a plant, and both with low similarity.
4. No apis. Sequences that I know for sure that are no apis, most of them come from a single study.
5. Chimeras. Chimeras.
BONUS: I don't know what to do with the sequences related to Toxoplasma (Toxoplasma 1 and Toxoplasma 2). Is pure chaos, any suggestion would be welcome.
4.1 Read updated file
4.1.1 File path
4.1.2 Read the updated file and update all necessary fields
4.1.3 Create all necessary fields
pr2_update <- pr2_update %>%
mutate(sequence_length_new = str_length(sequence),
ambiguities = str_count(pr2_update$sequence,pattern=pr2.env$ambig_regex),
label = "U",
start = 1,
end = sequence_length_new,
pr2_accession = str_c(genbank_accession,".",start,".", end,"_",label),
gene= "18S_rRNA",
organelle = "nucleus",
edited_version = "4.12.0",
edited_by = "J. del Campo",
removed_version = NA)
pr2_update$sequence_hash = purrr::map_chr(pr2_update$sequence,digest::sha1)
pr2_update_disagreement <- pr2_update %>%
filter(sequence_length != sequence_length_new)
write_tsv(pr2_update, "Apicomplexa 3.0.txt", na="")
4.2 Metadata
4.2.1 Get metadata for sequences which not yet metadata
4.2.2 Add new metadata
- 162 entries added
- Of these 145 do not exist anymore in GenBank
metadata_list <- list()
for (i in 1:10){
metadata_list[[i]] <- read_tsv(full_path(str_c("pr2_gb_metadata_03_",i,".txt")))
}
metadata <- purrr::reduce(metadata_list, bind_rows)
pr2_metadata_new <- metadata %>%
select(genbank_accession, contains("gb_"), contains("pr2_")) %>%
write_tsv(full_path(str_c("pr2_metadata_new_03.txt")), na="")
4.3 Filter out sequences that do not fit PR2 criteria
- 30 sequences rejected
- Number of sequences accepted): 8362
# Remove sequences that are
# * shorter than minimum length (500)
# * more than 20 ambiguities
# * more than 2N consecutive
pr2_update_rejected <- pr2_update %>%
filter((sequence_length < pr2.env$sequence_length_min)|
(ambiguities > pr2.env$ambiguities_max)|
str_detect(sequence, "NNN") )
write_tsv( pr2_update_rejected, full_path("Apicomplexa_rejected.txt"), na="")
str_c("Number of sequences rejected (length <500 or ambuiguities > 20 or more than 2 consecutive N): ", nrow(pr2_update_rejected))
select(pr2_update_rejected, genbank_accession, sequence_length, ambiguities)
pr2_update_accepted <- pr2_update %>%
filter(!(genbank_accession %in% pr2_update_rejected$genbank_accession))
str_c("Number of sequences accepted): ", nrow(pr2_update_accepted))
4.4 Check if some sequences have same genbank
None
4.5 Taxonomy
4.5.1 Build taxonomy file
pr2_update_taxo <- pr2_update_accepted %>%
group_by_at(pr2.env$taxo_levels) %>%
count()
pr2_taxo_check(pr2_update_taxo, "../temp")
pr2_update_taxo <- pr2_update_taxo %>%
rename_all(~str_c(.,"_new" )) %>%
rename(species = species_new) %>%
left_join(pr2_taxo) %>%
rename_at(vars(kingdom:genus), ~str_c(.,"_old" )) %>%
rename_all(~str_replace(., "_new", ""))
write_tsv(pr2_update_taxo, full_path("taxo_api_03.txt"), na="")
4.5.2 Find within missed or new sequences, taxa that are not in the new taxonomy
- Number of sequences not in Javier table: 461
pr2_missed_new <- read_excel(file_pr2_imports_excel, sheet = "missing_new",
guess_max=200000, na=c("", "-")) %>%
filter(is.na(status_javier)|(status_javier == "missed_new"))
str_c("Number of sequences not in Javier table: ", nrow(pr2_missed_new))
taxo_missed_new <- pr2_missed_new %>%
group_by_at(pr2.env$taxo_levels) %>%
count()
taxo_missed_new_not_updated <- taxo_missed_new %>%
filter(!(species%in% pr2_update_taxo$species)) %>%
write_tsv(full_path("taxo_not_api_03.txt"), na="")
4.6 Add and updated sequences
4.6.1 New sequences that are not in PR2
- Number of new sequences of Apicomplexa : 2619
pr2_updated_new <- pr2_update_accepted %>%
filter(!(genbank_accession %in% pr2$genbank_accession))
str_c("Number of new sequences of Apicomplexa : ",
nrow(pr2_genbank_new))
pr2_main_new <- pr2_updated_new %>%
select(genbank_accession, species, label:edited_by, -sequence_hash)
pr2_sequence_new <- pr2_updated_new %>%
select(pr2_accession, starts_with("sequence"), ambiguities)
pr2_metadata_new <- pr2_updated_new %>%
select(genbank_accession, starts_with("eukref_"), -eukref_name_2)
4.6.2 Old sequences that are in PR2
- Number of old sequences of Apicomplexa updated: 5743
pr2_updated_old <- pr2_update_accepted %>%
filter((genbank_accession %in% pr2$genbank_accession))
str_c("Number of old sequences of Apicomplexa updated: ",
nrow(pr2_updated_old))
pr2_main_old <- pr2_updated_old %>%
select(genbank_accession, species, edited_version, edited_by)
pr2_metadata_old <- pr2_updated_old %>%
select(genbank_accession, starts_with("eukref_"), -eukref_name_2)
4.6.3 Sequence in PR2 with same GenBank number
- 292 sequences corresponding to 146 accession (U and UC)
pr2_main_old_old <- pr2 %>%
filter(genbank_accession %in% pr2_main_old$genbank_accession)
genbank_duplicated <- pr2_main_old_old %>%
count(genbank_accession) %>%
filter(n>1)
pr2_genbank_duplicated <- pr2 %>%
filter(genbank_accession %in% genbank_duplicated$genbank_accession) %>%
arrange(pr2_accession)
str_c("Number of old sequences that have the same accession 2 by 2: ",
nrow(pr2_genbank_duplicated))
4.6.4 Save everything to an Excel file
5 Final
pr2_Apicomplexa <- pr2 %>%
filter(division == "Apicomplexa") %>%
select(pr2_accession:organelle, kingdom:genus, species, chimera:remark,
taxo_edited_version:metadata_remark,
-sequence, -pr2_main_id, -taxo_id, -pr2_metadata_id, -seq_id) %>%
arrange(gene, organelle, division, class, order, family, genus, species)
file_xls <- full_path("Apicomplexa pr2 4.12.0.xlsx")
onglets <- list("pr2_Apicomplexa" = pr2_Apicomplexa)
openxlsx::write.xlsx(onglets, file_xls)
5.1 Corrections
- Besnoitia_besnoiti split into 2 species because they belong to different Toxoplasma
- Besnoitia1_besnoiti
- Besnoitia2_besnoiti
- Neospora_caninum split into 2 species because they belong to different Toxoplasma
- Neospora1_caninum
- Neospora2_caninum
- Toxoplasma_gondii into 2 species because they belong to different Toxoplasma
- Toxoplasma1_gondii
- Toxoplasma2_gondii
- For all the genera that have been split in a number of genera the species have also been given the same number (e.g. Babesia1_xxx).
5.2 Actions
- Sequences not in Javier table
- Quarantine : removed from active PR2 database (they can be re-added latter)
- Not 18S: removed from active PR2 database (not that HQ219405, HG328237, HG328236, HG328235 seems to be OK)
- Not Apicomplexa: removed from active PR2 database
- Chimera: removed from active PR2 database and tagged as chimeras
- Missed/New: Kept in PR2 - species updated with new taxonomy
- Apicoplast sequences updated with new taxonomy
- Updates
- New: 2619
- Updated: 5889+239
- Removed: 89