Hi all,
I have RNA-seq PE fastq reads for control and infected (at different time points) samples. The host species is lacking a reference genome. Thus, I will be doing a de novo transcriptome assembly and later differential gene expression analysis.
My questions:
- how to get the host only reads to be used for generating the transcriptome with trinity?
- how to specify the percentage of parasite reads in my samples?
In my initial trials, I downloaded and indexed the parasite genome and aligned my preprocessed reads (corrected, trimmed) to the index using HISAT2 but I noticed that the alignment rate is not logic (50-75%) given that the parasite used to infect the samples is very minute (in number and size compared to the host size).
Here are parts of the codes I used: - indexing after obtaining the splicing sites and list of exons files I ran the following:
hisat2-build -p 20 --ss $DATA_DIR/GCF_000699445.3_UoM_Shae.V3_genomic.ss --exon $DATA_DIR/GCF_000699445.3_UoM_Shae.V3_genomic.exon $DATA_DIR/GCF_000699445.3_UoM_Shae.V3_genomic.fna $OUT_DIR/GCF_000699445.3_UoM_Shae.V3_genomic.index
alignment
hisat2 -x $INDEX/GCF_000699445.3_UoM_Shae.V3_genomic.index --threads 20 --rna-strandness RF \
-1 "$DATA_DIR/${i}R1.fq.gz" -2 "$DATA_DIR/${i}R2.fq.gz" \
--bowtie2-dp --phred33 --very-sensitive \
-k 10 --end-to-end \
-S "$OUT_DIR/${i}.sam"
I was planing to use samtools to get the mapped (parasite) and unmapped reads (host) but I got confused with the high percentage of alignment. I appreciate also any suggestion for samtools flags to be used for the separation of host and parasite reads.
Thanks
Looks like the parasite is an eukaryote but what about the host? Unless you are working with some exotic species there is bound to be some data (may be for a related species if not the same) in the databases. What kind of prep was used for the library prep (poly-A capture, ribodepletion?).
I can expect some similarity but not that high!. Both are eukaryotic.The host is a snail mollusk and the parasite is a trematode. Poly-A capture was used for library prep.
You could try to change alignment parameters as suggested by @liorglic but that may not fix this issue instantly. The end result may likely stay similar. It is possible that for whatever reason the RNA from the parasite made it into libraries preferentially and that is what you have. Since the libraries are already made that can't be changed short of making another library.
Since these are short reads you could still be getting crossmapping of host RNA with the parasite so this analysis is going to be tricky.
Thanks @GenoMax for your comments. I also tried bbduk filtering using the parasite fasta file against the PE reads.
I am not sure if using bbduk is a vaild approach bu I got percentage of parasite reads in my samples that seems logic (~0.4-0.5%) but the problem is that there was no difference between controls and treatments samples. I have the same pattern in the replicates (n=7) of control and different time points of infections.
bbduk.sh
is not appropriate to separate out reads like this. You should usebbsplit.sh
for this application (use genome for the host that is close to the one you are using + genome of parasite) or evenbbmap.sh
with just parasite genome.There is a genome for a very similar host species that I can use but is this reliable? Do you have suggestions for the parameters to be used with bbsplit.sh with paired end reads?
It is better than having no reference at all. The similar it is the better the result will be.
Use the default parameters for
bbsplit.sh
to begin with. Pay attention to theambig2=
parameter since it decides what happens with reads that align to both genomes. You may want to keep reads that align to both genomes since you can't be sure which one they come from. If you choose to "toss" them then you may lose data. Userefstats=<file>
option and write results to a file and then post those stats here.Here is some info: BBSplit syntax for generating builds for the reference genome and how to call different builds.
Thanks @GenoMax for your patience and time. Is the following approach correct?
1- indexing the two genomes
2- classification of PE reads
Based on basename=out_%.fq I should get two files for host_reads.fq and parasite_reads.fq then I use reformat.sh to obtain r1 and r2, right? what about the clean reads? what should I do with reads that does not map to either genomes? it could be reads related to differential expression!
for the ambiguous reads, I want to keep those that map to each reference as shared reads added to the host genome. is ambig2=all the right option for that? ambig2=all Write a copy to the output for each reference to which it maps.
You don't need to pre-index the genomes unless you want to run this multiple times. Try the following. We don't want too many reads going into the "clean" files (that name was in ref to something else, you could call them "uninteresting", since they should not be from either of your genomes). Those could be assembled separately to see what you get later.
Set
maxindel=
to a number that you know the average intron size is for your organism. For eukaryotes it could be as large as 100k(b). You may need to specify memory explicitly by-XmxNNg
in case you run out of memory. You may need 30-60GB of RAM depending on size of your data and genome sizes.You can add
.gz
extension to all fastq files to keep files compressed. BBMap will be fine with that.Hi GenoMax I am still running the script:
I adjusted the maxindel=1900 which is the average intron length for both the snails (1.3-2 kb) and parasite (1.9kb).
Almost half of the sample are done now. Attached is a photo for sample stats file for one of the control replicates and one of the infected (6 hr post infection) replicates
.
I am still waiting for the all samples to be finished. But based on these stats it looks like even the control (not infected with the parasite) has some parasite reads. I assume that these may be shared reads with other contaminants within the control samples.
Now, how to move forward with these results?
For de novo transcriptome of host reads only should I merge the snail reads with the clean.fq.gz files (which does not map to either the snail or the parasite) because it may represent reads specific to my snail species and not shared with the closely related species that I used in the alignment!
How to calculate the % of parasite reads in my samples?
At least now this result is more in line with what your expectation was (more host and much less parasite). Looks like you will get plenty of reads here from all of your samples to pool and do a de novo assembly using
trinity
for the snail? The percentage of ambiguous reads is small enough that you probably can set them aside for now.I would suggest merging "clean" reads and assembling them separately to see what you get. At least in round 1. Looks like you are getting at least 20-25% in this category? Are snails/parasites cultivated aseptically, if not these may simply be contaminants that you can't use.
You seem to have much less of parasite reads (2500 odd compared to 30-40 M for the host). If you consider the "clean" reads to be also host then the % will drop even further.
Thanks GenoMax . The snails (live in freshwater) and parasites (develop in snails and rodents) are subject for contamination from water and during processing (infection and later on RNA extraction!).
These are also included in each of the snail and parasite reads, right? because I want to keep them in the snail reads.
But I am also concerned that there might be some reads specific to the species!
what is the purpose of assembling the clean reads separately?
to get the percentage of the parasite reads within each sample should it be:
N=number of unambiguous parasite reads / number of assigned snail reads + number of clean snail reads (from clean2_${i}.fq.gz)
Ah yes, you did
ambig2=all
so those reads would be in both bins. That number is small (~2500) compared to rest of the data so that should not cause big issues.Those reads were neither snail or parasite. While you hope there are reads specific for you species in there based on where the snails live (and how they were collected) there is a definite chance that the reads may be from contaminants. So keeping them separate, assembling and the blasting a few will give you an idea of what may be in there. Depending on what you find those may ultimately be non-usable or ambiguous with respect to the snail or the parasite. You may need to do additional experiments to prove that they came from parasite or snail.