Hello! I have RNAseq data from samples containing 2 to 3 bacterial strains each. For the analysis I performed FastQC, trimmomatic to remove adapters and then Bowtie2 to align the reads to the reference genomes. I have one bowtie2 output file per sample with the paired ends and another one with the singles. I want to make one output file per reference genome with featurecounts for the paired ends on one side and the singles on the other.
For the paired end I run this command : featureCounts -T 24 -p -M -g Parent -a $x -o ${Output}/${genome}.txt ${Samples}/${genome}_paired
In output I get this:
I don't understand why in the running featurecounts part assign them in single end mode ?
when I look at one of the paired end input files I get this:
97320876 + 0 in total (QC-passed reads + QC-failed reads)
97320876 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
591739 + 0 mapped (0.61% : N/A)
591739 + 0 primary mapped (0.61% : N/A)
97320876 + 0 paired in sequencing
48660438 + 0 read1
48660438 + 0 read2
499606 + 0 properly paired (0.51% : N/A)
518434 + 0 with itself and mate mapped
73305 + 0 singletons (0.08% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
You need to include
along with
to count the read pairs.
Thanks a lot for the answer, from now on featurecounts takes well into account that they are paired reads. On the other hand, the output results give me very high values of : Unassigned_Unmapped and Unassigned_NoFeatures. For example :
As I work with samples that contain several bacteria this does not really surprise me for the other reference annotation genomes as I have much less RNA corresponding to these strains in my samples. However, for this strain (which takes almost 70% of my samples) I also get low results which worries me more.
for the strain that represents about 70% of my samples I have :
and for another strain that represents like 1% I have :
I'm doing this script :
featureCounts -T 24 -p --countReadPairs -s 2 -g Parent -a $x -o ${Output}/${genome}.txt ${Samples}/${genome}_paired
I took the reference genome and annotation file from the same webpage on NCBI website, did the alignement with Bowtie2, converted sam to bam with samtools view and sorted the samples by read name with samtools sort.
If you are working with a mix of genomes then it is possible that your strain is only a small fraction of the total data.
You need to address what @swbarnes2 pointed out below.
featureCounts
only works with data that is aligned in first place. And if you have really poor alignment % then that needs to be addressed first.For the bacteria representing ~70% (i'm talking to 70% because I made qPCR on the samples) of my sample I have this :
The result of 591739 + 0 mapped (0.61% : N/A) is from a bacteria representing like 1% of the sample
I thought that since this bacteria mapped well (95 to 98% depending on the sample) I would have better results for the featurecounts output with it, more than 1 to 8% succesfully assigned alignements.
If you are trying to make a bam file of the reads aligned to a single reference, don't do it like that. Subset the bam you have to just the reads aligning where you want them. Though since there might be some similar regions between all your bacteria, this pipeline might not be appropriate for what you ae doing. Neither Bowtie nor FeatureCounts is very smart about reads which might align to multiple places.
Have you looked at the alignments in IGV or a similar browser? If your reads are multi-mapping then those are going to be not counted by default. You may also have a lot of rRNA in your data so that would be something else to check on.
I haven't looked at the alignment on IGV, I'm very new to it, but I'll try! In Bowtie2 output for this strain which is the most present in my samples I have from 1 to 7% of reads that align more than once.
At the sequencing platform there was depletion of rRNA so I assumed it was ok on that side but I will check. Do you check this with IGV too?
Most the data appears to align one time so that looks reasonable. Do you get more assignments if you try
-s 0
(non-stranded) option? Are you certain your libraries are "reverse stranded"?I get a little more successfully assigned alignements with the -s 0 option. I'm not sure, I sent my samples to Illumina NovaSeq 6000 sequencing. I will ask the sequencing platform to be sure. I don't understand why I have a good percentage of alignement for this strain (between 80 and 95%) but a low percentage of successfully assigned alignements (between 2 and 9%).