Hi all,
I'd like to generate gene count files for downstream DESeq2 analysis and hierarchical clustering from my RNA-seq data but I'm not sure how to go about it with what I have already done.
I have 100bp PE RNA-seq data generated with the TruSeq Stranded mRNA Library Prep Kit. Because there were many overlapping reads in the paired data (median insert size was 200bp), we used PEAR to merge PE read overlaps into a SE fastq file. Reads which were not overlapping were retained in two fastq files, R1 and R2 for PE mapping. The three files were then mapped using STAR. It should be noted that PE files were very small compared to SE files.
For PEAR generated SE:
STAR \
--genomeDir /Resource/ \
--readFilesIn ../wz745.assembled.fastq \
--outFileNamePrefix ./wz745.assembled.fastq \
--outTmpDir ./wz745.assembled.fastq/TMP \
--readFilesCommand cat \
--runThreadN 3 \
--outSAMtype BAM SortedByCoordinate
For PEAR PE:
STAR \
--genomeDir /Resource/ \
--readFilesIn ../wz745.unassembled.forward.fastq ../wz745.unassembled.reverse.fastq \
--outFileNamePrefix ./wz745.unassembled.forward.fastq \
--outTmpDir ./wz745.unassembled.forward.fastq/TMP \
--readFilesCommand cat \
--runThreadN 3 \
--outSAMtype BAM SortedByCoordinate
The resulting bam files were merged using samtools 'merge' into a single bam.
Next, I wanted to use htseq-count to generate .tsv count files for each gene. I've done this but one file failed (of 53) to produce output and gave a mate record missing warning Warning: Mate records missing for 1 records; first such record:
This made me realize htseq-count probably doesn't support both paired-end and single-end data in the same file. So I separated the PE and SE reads as suggested previously in another thread. Error at the end of HT-seq Analysis and output.txt with 0 bytes in size
I then ran htseq-count for each individual file:
SE:
htseq-count \
--format bam \
--order pos \
--mode intersection-strict \
--stranded yes \
--minaqual 1 \
--type exon \
--idattr gene_id \
wz745.se.bam \
/Resource/Homo_sapiens.GRCh37.75.gtf > wz745.se.yes.tsv
PE:
htseq-count \
--format bam \
--order pos \
--mode intersection-strict \
--stranded reverse \
--minaqual 1 \
--type exon \
--idattr gene_id \
wz745.pe.bam \
/Resource/Homo_sapiens.GRCh37.75.gtf > wz745.pe.tsv
What I don't understand is when I run --stranded reverse
or --stranded yes
on the singular PE file I again get the mate record missing warning and a file of mostly 0 counts. This might be a problem with the PEAR output or a legitimate count since the input files are very small.
If I run --stranded yes
on the SE file, I get a file with very few genes with counts above 0. But if I run the --stranded reverse
on the SE file I get almost the same output as when I run --stranded reverse
on the merged file with SE and PE together. I'm not sure which flags I should be using or if it is even necessary to split out the files into SE and PE from PEAR input format. I was naively thinking individual files SE --stranded yes
and PE --stranded reverse
should give me all the counts per gene when added up, but if I do that, I have almost no counts. I'm definitely doing something wrong.
Any advice is appreciated
Tesa
Hi h.mon,
Thanks for your input. I will take a look at the STAR GeneCounts methodology. I will go back to the original fastqs and re-align with STAR without merging overlapping reads to avoid introducing bias.
Cheers,
Tesa