PR2: Curation of Radiolaria sequences

Miguel M. Sandin
02-07-2020
miguelmendezsandin@gmail.com

Dependencies

Optional


Step 1. Building a detailed reference alignment

I have used as reference phylogenetic frameworks the last morpho-molecular classifications of Acantharea (Decelle et al. 2012), Collodaria (Biard et al. 2015), Nassellaria (Sandin et al. 2019) and Spumellaria (Sandin et al. 2020). Following these phylogenetic works I have selected 2 to 4-6 representative sequences of each clade (both morphologically described and environmental) to build a reference alignment (in total 211 sequences), in order to gather the most important and detailed genetic diversity. Such sequences were manually selected due to their high quality and their length, comprising the full 18S rDNA gene (when available). As showed in the reference phylogenetic frameworks, the partial 28S (D1+D2) rDNA gene was also used and concatenated in order to increase phylogenetic signal.

1.1. Align sequences

Sequences were aligned with MAFFT using the L-INS-i algorithm (‘--localpair’) and 1000 iterative refinement cycles for high accuracy:

mafft --localpair --maxiterate 1000 FILE > FILE_aligned

1.2. Check alignment

Alignment was manually check and corrected in SeaView (version 4.6.1) in order to align analogous positions and check for possible missalignments.

1.3. Trimm alignment

Corrected alignment matrix was trimmed with trimal selecting a 30% threshold:

trimal -in FILE_alignedC -out FILE_alignedC_trimmed -gt 0.3

1.4. Phylogenetic assessment

Two (semi-replicate) analysis were used to look for agreement -or not- and resolve possible dubious or low statistically supported groups/patterns using RAxML:

1.4.1 GTR+CAT as substitution model and 1000 bootstraps (BS)

raxmlHPC-PTHREADS-SSE3 -T 8 -m GTRCAT -c 25 -p $RANDOM -x $(date +%s) -d -f a -N 1000 -n FILE_alignedC_trimmed_CAT -s FILE_alignedC_trimmed

1.4.2. GTR+GAMMA with 1000 BS

raxmlHPC-PTHREADS-SSE3 -T 8 -m GTRGAMMA -p $RANDOM -x $(date +%s) -d -f a -N $1000 -n FILE_alignedC_trimmed_GAMMA -s FILE_alignedC_trimmed

1.5. Re-ordering alignment

The resulting alignment was ordered following the phylogenetic tree in order to ease the manual correction of misalignments with the function ‘reorderFastaTree’ from the script ‘fastaFunctions.R’ (recommended to open it in R studio).
Steps 1.2 to 1.5 were repeated until the phylogenetic tree had a consistent topology and in agreement with previous studies (Decelle et al. 2012; Biard et al. 2015; Sandin et al. 2019; 2020). Phylogenetic trees were visualized in FigTree (version 1.4.3). Resulting alignment matrix was set as reference alignment for further steps.

Note: Steps 1.1 to 1.3 were carried out independently for 18S and 28S rDNA genes. Then both genes were concatenated for step 1.4 and further.


Step 2. Creating a reference backbone phylogeny

I continued adding the rest of publicly available and morphologically identified sequences detailed in the reference phylogenetic frameworks to the previous reference alignment (from step 1) using the function ‘--add’ from MAFFT and the FFT-NS-2 algorithm with a high gap opening penalty (‘--op 5’) as follows:

mafft --thread 2 --inputorder --op 5.0 --add FILE2 --6merpair --maxiterate 1000 FILE_alignedC > FILE2_aligned

And repeating steps 1.2 to 1.5 until consistent phylogenetic tree.


Step 3. Download all publicly available environmental sequences

Reference 18S rDNA gene sequences from step 2 were ‘blasted’ against NCBI to retrieve all publicly available environmental sequences of the 18S rDNA gene that belong to Radiolaria, as detailed in ‘Step 6’ of EukRef. Previous versions of PR2 (Guillou et al., 2013) were also blasted against NCBI in order to retrieve sequences not belonging to either Acantharea, Collodaria, Nassellaria or Spumellaria (i.e.; Rad-A, Rad-B, Rad-C or Radiolaria_X). Final dataset was checked in order to remove duplicates and other regions of the rDNA (except for the partial 28S rDNA gene that was left and concatenated, when available, to increase phylogenetic signal).


Step 4. Check for chimeras

Due to the difficulties in chimera detection, I have used two different algorithms implemented in:

4.1. mothur

mothur "#chimera.uchime(fasta=NEW, reference=REFERENCE)"

4.2. and vsearch

vsearch --uchime_ref NEW --db REFERENCE --nonchimeras NEW_noChim --borderline NEW_poChim --chimeras NEW_Chim

Using as REFERENCE sequences the complete PR2 v4.11.0 database and as NEW those sequences extracted from step 3. Sequences detected as chimeras by any of the two methods were automatically removed and no longer considered in downstream analysis.

4.3. Blasting

A third in-house method was also used in order to detect as many chimeric or problematic sequences as possible. Here I blasted against NCBI independently the first and the last 300 bp of each sequence and compared the results. If there were less than 20 exact matches among the first 100 matches, the sequence was considered as maybe chimeric. If there are no exact matches among the first 100 matches, or it matches only with itself, the sequence was detected as chimeric. For further details see the script ‘chimera_blast.R’ (recommended to open it in Rstudio). Sequences detected either as maybe chimeric or chimeric were noted and manually checked in downstream phylogenetic assessments.


Step 5. Annotating sequences

All environmental sequences resulting from step 4 were automatically annotated with vsearch taking as reference sequences those morphologically described:

vsearch --usearch_global ENVIRONMENTAL --db REFERENCE --blast6out OUTPUT --log OUTPUT_log

Step 6. Adding environmental sequences to the backbone phylogeny

Environmental annotated sequences from step 5 were added to the reference alignment created in step 2. These sequences were gradually added according to their ‘reliability’ so it is easier to identify chimeric, dubious* or bad quality sequences:

  1. Firstly, were added those environmental sequences retrieved in previous studies and phylogenetically placed in a reference phylogenetic framework (Decelle et al. 2012; Biard et al. 2015; Sandin et al. 2019; 2020).
  2. Secondly, were added these sequences that do not correspond within Acantharea, Collodaria, Nassellaria or Spumellaria (i.e.; Rad-A, Rad-B, Rad-C and Radiolaria_X).
  3. Thirdly, and lastly, the rest of the sequences retrieved.

*I considered a dubious sequence a sequence that is clearly different (either in pair-wise similarity or phylogenetically) from the others and with a lack of similar sequences. In other words, if a sequence is different from others but has 2 or (ideally) more similar sequences, and preferably from different studies, is not considered as dubious.

For every group of sequences that were added, the following steps were carried out:

6.1. Align sequences against previously created reference alignment

mafft --thread 2 --inputorder --op 5.0 --add FILEX --6merpair --maxiterate 1000 FILE_alignedC > FILEX_aligned

6.2. Trimming alignment

trimal -in FILEX_aligned -out FILEX_aligned_trimmed -gt 0.3

6.3. Phylogenetic assessment

Explore phylogenetic patterns, long branches, branches in unresolved positions, contrasting topology of previously resolved clades, …

6.3.1. RAxML GTR+CAT 100 BS

raxmlHPC-PTHREADS-SSE3 -T 16 -m GTRCAT -c 25 -p $RANDOM -x $(date +%s) -d -f a -N 100 -n FILEX_aligned_trimmed_CAT -s FILEX_aligned_trimmed

6.3.2. RAxML GTR+GAMMA 100 BS

raxmlHPC-PTHREADS-SSE3 -T 16 -m GTRGAMMA -p $RANDOM -x $(date +%s) -d -f a -N $100 -n FILEX_aligned_trimmed_GAMMA -s FILEX_aligned_trimmed

6.4. Reorder fasta

Reordering fasta file with the function ‘reorderFastaTree’ from ‘fastaFunctions.R’, taking as reference tree that of 6.3.1 (or 6.3.2).

6.5. Manual removing of problematic sequences

Manual and final identification of chimeric, dubious or bad quality sequences for its deletion in AliView (version 1.26). Repeat steps 6.3 to 6.5 until consistent phylogeny.

Note: in order to ease and speed step 6, after removing problematic sequences (6.5), I directly run the phylogenetic analysis (6.3), without realigning the raw sequences (6.1) and trimming the alignment (6.2). This is why I added a last confirmatory check:


Step 7. Last confirmatory check

All environmental sequences added in step 6 that passed the stringent criteria were (re-)aligned to the backbone phylogenetic framework from step 2 for a last phylogenetic assessment as described in steps 6.1 to 6.3. And, if needed, step 6.3 was repeated after step 6.4 and manual checking and correction of the alignment in AliView (version 1.26).


Step 8. Final taxonomic annotation and corrections

Final phylogenetic trees were used for the manual curation and correction of the taxonomic annotation of the enviromental sequences according to bootstrap support and phylogenetic relatedness to morphologically described sequences. Annotation was done according to the original description of:

Environmental clades (Rad-A, Rad-B, Rad-C and Radiolaria_X) were left as previously annotated, yet a more detailed taxonomy was applied when clear phylogenetic patterns (bootstrap ≥ 90 and consensus between GTR+GAMMA and GTR+CAT):


Summary and concluding remarks

In total, 4569 publicly available sequences associated to Radiolaria have been taxonomically curated and annotated (files/pr2_Radiolaria_MMS.xlsx). From these, 4556 are longer than 500bp and have less than 20 ambiguities (files/pr2_rads_updated.fasta) and 289 sequences are new regarding older PR2 versions (as of v4.11.0). Main changes are:

Here we built a backbone phylogenetic framework of all the described diversity of Radiolaria to which we have aligned the rest of publicly available environmental sequences, after removing problematic sequences (chimeras, dubious or bad quality). All this gathers fully reliable sequences representing the biggest part of the diversity of Radiolaria, that we can trust and use for phylogenetic reconstruction, phylogenetic placement of short reads or annotation of the still to come full length environmental rDNA sequences from Oxford Nanopore Technologies or Pac-Bio.

Lastly, I would like to mention that taxonomic hierarchy and levels may differ from one reference to another. For example in Cavalier-smith (2017) Polycystinea (gathering Spumellaria, Nassellaria and Collodaria) holds a “superclass” position along with Spasmaria (gathering Acantharea and Sticholonche). On the other side, in Adl et al. (2019) Polycystinea holds a “class” position along with Acantharea and Taxopodida. You may therefore select specific taxa according to the scope of your study bearing in mind these differences in taxonomic hierarchy and levels.


References