Hi.
I'm analyzing RNA-seq data.
I did a mapping of RNA-seq data to the reference sequence (hg19) and got a bam file. To confirmed mapped reads, I saw the bam file by using IGV. It seems that there are many green reads in IGV (about 20-30% of mapped reads were green ).
I know that the green reads indicate tandem duplications or translocations. However, I don't understand why there were many green reads in my data set.
I'm worrying that this might be caused by my wrong operation for RNA-seq data. Could you give me advice about it?
My pipeline of RNA-seq was as below. (1) Trimmomatic exclude adapter sequences and low-quality bases from my fastq files. (2) Tophat2 mapped my reads to the reference sequence (hg19).
Thank you for your comment.
As you said, tophat 2 is an old method. So, I tried to apply HISAT2 into same fastq file. As a result, the bam file indicated similar result that there were many green reads.
To be honest, I've found some projects where a mix of TopHat2 and STAR alignments can be helpful, but I haven't yet found a project where HISAT2 had a unique advantage (and it seemed relatively similar to STAR, but without the additional files, like for splice junctions or gene fusion finding).
Perhaps more important to that example, I believe the problem with that particular study is the likelihood of encountering reads within the simulated range for T2 or T3. For example, you may find TopHat2 is less likely to align reads from another species, which you may find useful in certain circumstances.
In other words, I agree that trying an alternative aligner can be helpful for understanding your dataset better. However, I don't agree with the rationale that TopHat2 should be discredited because it is an "old" method (although I also see that "There is nothing wrong about mapping RNA-seq reads using TopHat 2" was part of the response). I have some various notes collected in this blog post, and I believe this is the Biostars thread where that particular benchmark paper has been previously discussed.
Also, I'm not sure about the answer to the original question (about the green reads). Maybe uploading an image can help get answers?
P.S. I believe I did recently encounter a possible advantage to using HISAT2: TopHat2 and STAR have similar performance on the clusters, when I allocate 4 cores and 8 GB of RAM, but I noticed my local computers had greater difficulty running STAR (although they could run HISAT2).
So, some people may benefit from the lower memory requirements of HISAT2 (versus STAR), and there may be other possible advantages as well (for certain projects).
I'm sorry for my late reply.
I've uploaded a part of image when my bam file was shown by IGV.
I feel that there are many green reads in the bam file.
If you notice something bout the image, please tell me.
Hi - I edited your response to show the alignment in the plot (I had to right-click on the image to open just the image in another window, and then provide that slightly different hyperlink.
I'm not sure how big a deal the green reads are, but it looks like the proportion of green reads is less than 50% (even though I agree there is still a non-trivial proportion).
However, I do see that the green color is for tandem duplications (when coloring by pair duplication), as described in this section of the IGV manual.
To be honest, I typically use single-end rather than paired-end reads for RNA-Seq analysis. So, I don't know how common this is for paired-end reads. It looks like the position of is usually consistent across the gene. If this is due to the fragment size being less than the read length, I am guessing that is why you discussed checking the impact of read trimming.
I have no idea if this will help, but how many reads are getting trimmed? What if you quality filter reads instead of trimming them? If the green pairs have lower quality scores, perhaps that at least helps you identify a feature that allows you to decrease the frequency of green pairs/reads.
Or, do you know what barcode was used for your samples? Is is possible these could be primer dimers (or if your barcode itself was a tandem repeat, like CAGCAG), or something that causes there to be more than one barcode sequence? The left and right side (outside the gene annotation) seem to have a higher proportion of green reads. What if you use
samtools view
to create a very small .sam file of reads within that region? Do they tend to be soft-clipped? Can you see the size and composition of the duplicated sequence (in those limited number of reads)?I'm also not sure if seeing the same sequence from a small fragment on your forward and reverse reads affects IGV coloring (interpreting the small fragment as a tandem repeat). I didn't think that was the case, but perhaps you can use RSeQC rna-fragment-size-py to get an idea of the range of fragment size in MYC?
Or, what if you use PEAR to merge the reads, and then get an idea of a) how many reads overlap and b) what is the insert size among the fragments that overlap (using FastQC on the merged reads). If 10-20% of your reads have genomic sequence that is smaller than the read length, that makes having the paired-end reads less effective, but I am just trying to think of all the possible ways to try understand your data better.
Thank you for your editing.
About the trimming and mapping, my commands are as below.
java -Xmx4g -jar trimmomatic-0.38.jar PE -threads 1 -phred33 -trimlog ${id}.trimlog \ ${id}_R1.fastq.gz ${id}_R2.fastq.gz \ ${id}_R1.paired.fastq.gz ${id}_R1.unpaired.fastq.gz \ ${id}_R2.paired.fastq.gz ${id}_R2.unpaired.fastq.gz \ LEADING:30 TRAILING:30 MINLEN:60 \ ILLUMINACLIP:adapters.fa:2:30:10
tophat -p 1 -G ${db_path}/Annotation/Archives/archive-2015-07-17-14-32-32/Genes/genes.gtf \ -o ${tophat_dir}/${id} ${db_path}/Sequence/Bowtie2Index/genome \ ${id}_R1.paired.fastq.gz ${id}_R2.paired.fastq.gz
If you noticed something, could you give me advices.
I don't know my barcodes. As above, I removed adopter sequence by using Trimmomatic. However, I didn't remove poly-A (or poly-T) sequence. Should I removed these sequences?
First, I think it may be good to look at the actual sequence for some of those green reads. If you look at slightly off-target sequence (zooming in using IGV), I think you can find an interval with mostly green reads. This is where I am recommending using
samtools view chr:start-stop
, as described in the samtools manual.Second, in terms of the trimming, I was more interested in two metrics:
1) The number / percent of sequences trimmed (and the amount trimmed)
Even if you don't know your adapter sequence, you may be able to use FastQC to guess your adapter sequence. For example, I could figure that out for some of my own samples using FastQC.
For example, this library had a smaller fragment size (with Nextera adapter):
https://github.com/cwarden45/Bastu_Cat_Genome/blob/master/Basepaws_Notes/Read_QC/README.md
compared to this other library (also with a Nextera adapter):
https://github.com/cwarden45/DTC_Scripts/blob/master/Nebula/README.md (please scroll towards bottom)
FastQC should also be able to give you a sense of whether you had a lot of polyA/polyT reads (in terms of over-represented sequences).
2) Your insert size (whether there is considerable overlap between your paired-end reads).
To be honest, I'm not sure what is causing the green reads. However, if your fragment size is often less than your read size, there is not as much of an advantage to using the paired-end reads. So, I'm not sure if this is crucial, but could you use PEAR to merge the paired-end reads. In a sense, not being able to merge the reads could be good, if your forward and reverse reads don't overlap. However, if they overlap more than you expect, this might be a way to find that out.
I don't think removing polyA should make a huge difference (I think polyA reads are less likely to be aligned via TopHat), but there is something that is causing those reads to be colored green in your alignments.
Thank you for your advices!
I'll carefully check sequence data from my bam files.