Hi all,
Through the RNA-seq analysis workflow using Linux,
Trimmomatic generates 4 out-put files; forward-paired.fq.gz, reverse-paired.fq.gz, and the 2 unpaired files.
As I read in several threads, Trimomatic is expected to;
- Remove the adapters and the low-quality reads.
- generates 2 paired-fq.gz files with the same read number.
FastQC reports for both paired.files (for each sample) indicate good quality, but I received different read counts for the two (paired) files. Downstream analysis might be affected, especially feature counts (This issue is likely behind the error; Paired-end reads are included, and The reads are assigned on the single-end mode, when I run featureCounts).
For example;
sample1-forward.paired-fq.gz (wc -l ) = 4457636
sample2-reverse.paired.fq.gz (wc -l) = 4580400
I used the command line below as explained in Trimmomaic manual;
java -jar $EBROOTTRIMMOMATIC/trimmomatic-0.39.jar PE samplef1.fastq.gz sampler1.fastq.gz samplef1.paired.fq.gz samplef1_unpaired.fq.gz sampler1_paired.fq.gz sampler1_unpaired.fq.gz ILLUMINACLIP:NexteraPE-PE.fa:2:30:10:2:True LEADING:3 TRAILING:3 MINLEN:36
So, what could be the extra reads? could you please suggest any solution to fix this issue?
Thank you in advance.
Thank you so much ATpoint, after decompressing the files, they turned out to be the correct match.
So if the read counts for both files are correct, what do you suggest the reason behind the error; Paired-end reads are included, and The reads are assigned on the single-end mode when I run featureCounts (subread). I used HISAT2 for mapping, samtools -n for sorting the converted bam files by name.
what error? and what software produced the error and what is the command line that you used for running is?
Please find below the details of the question you asked; thanks JV;
RNA-SEQ reads were mapped to a reference genome using HISAT2
Sam files were converted to bam and sorted (by name) using samtools sort -n function
When I run featurCounts using the bam-sorted.bam files as input, featurCounts (subread) using the command line;
featureCounts -p -B -C -t gene -g gene_id -a referenc-genome.gtf -o count.txt treatemnt1.sorted.bam treatemnt2.sorted.bam treatemnt3.sorted.bam treatemnt4.sorted.bam treatment5.sorted.bam control1.sorted.bam control2.sorted.bam control3.sorted.bam control4.sorted.bam control5.sorted.bam
featureCounts (Subread) showed the message below;
Paired-end reads are included, and The reads are assigned on the single-end mode
Is it a common message or this indicates that featureCounts is treating the paired-end reads as if they were single-end reads?
should not sorting bam files by name be more accurate as input for futureCounts?.Sorting by position/ by name resulted > same output (feature.counts.table).
Thank you
With
featureCounts -p
option you now also need to include the following.Thank you so much @GenoMax,
I am dealing with bacterial RNA-seq data, should I include the flag;
(-O) assigns reads to all their overlapping meta-features, as recommended in some discussions?
When I included it in featureCounts, it increased the % of the successfully assigned alignments.