Question

Low assigned alignments

0

Entering edit mode

3.6 years ago

Simon Ahn ▴ 10

Basecalls performed using CASAVA version v1.8.2 Trimmed reads with fastx_quality_trimmer 0.0.13 with a quality treshhold of 18 and a length of 20 Aligned with Bowtie 2.1.0 and Tophat 2.0.10 using Gencode v19 junctions Samtools 0.1.19-44428cd to make a bam, sort, index Raw counts were generated using htseq_count 0.6.1 using the UCSC HG19 known gene transcripts from Illumina iGenomes edgeR 3.2.4 in R 3.1.0 were used to normalize counts between samples and to make comparisons between groups. Genome_build: hg19 Supplementary_files_format_and_content: Normalized reads in CPM for each detected gene along with differential gene expression log2 (fold change) and p-values for each time point

This is the data processing detail for a paper. I downloaded fastq files from this paper and used STAR alignment and featureCounts to get my read matrix with the reference of Homo_sapiens.GRCh38 from ENSEMBL. However, with featureCounts, only 3% reads were aligned. (This is single-end fastq file so I used -s 0, -s 1, -s 2 option for featureCounts but all results were bad) What went wrong with my pipeline? Is it related to reference genome that I used?

fastq-alignment RNAseq raw-count • 1.6k views

ADD COMMENT • link 3.6 years ago by Simon Ahn ▴ 10

0

Entering edit mode

Whatever version of the genome you are using, you should see comparable mapping rates and assigned rates. But, and here is the point, you must use the same genome build for reference genome and GFF/GTF annotation file. I am guessing this is the most likely explanation of what went wrong. If you are using ENSEMBL, download both, assembly and GFF file from the same source. Check also that both files use the same identifiers for chromosomes ("1" vs. "chr1" is the archetypical source of confusion).

ADD REPLY • link 3.6 years ago by Michael 55k

0

Entering edit mode

Thanks to your comment. But I downloaded same genome build and it worked fine with other dataset.

ADD REPLY • link 3.6 years ago by Simon Ahn ▴ 10

0

Entering edit mode

Possibly, if you could provide all the details, exact SRA id of dataset, complete download paths of assembly and gff (ftp://...), and all parameters used with STAR and featureCounts, that might become reproducible. I could offer to give it a short try tomorrow.

ADD REPLY • link 3.6 years ago by Michael 55k

0

Entering edit mode

I really appreciate your help. Here are some specific details.

fastq downloaded from SRA: SRP050036 using fasterq-dump (only used SRR1657054 for test)
ENSEMBL reference file created with STAR

FASTA: http://ftp.ensembl.org/pub/release-104/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz

GTF: http://ftp.ensembl.org/pub/release-104/gtf/homo_sapiens/Homo_sapiens.GRCh38.104.gtf.gz

STAR --runThreadN 64 --runMode genomeGenerate --genomeDir [output folder name] \ --genomeFastaFiles [Homo_sapiens.GRCh38.dna.toplevel.fa path] \ --sjdbGTFfile [Homo_sapiens.GRCh38.104.gtf path]
STAR aligned (version: 2.7.9a)

STAR --runThreadN 32 --genomeDir [ENSEMBL reference file path] \ --readFilesIn [SRR1657054.fastq path] \ --outSAMtype BAM SortedByCoordinate --quantMode GeneCounts \ --outFileNamePrefix [output file name]
Extract count matrix using featureCounts

featureCounts -T 10 -s 0 -a [Homo_sapiens.GRCh38.104.gtf path]\ -o [output file name] \ [SRR1657054Aligned.sortedByCoord.out.bam path]

(I tried -s 1 and -s 2 also, but didn't get better result)

ADD REPLY • link 3.6 years ago by Simon Ahn ▴ 10

score 1 · Answer 1 · 2021-10-24

From the text copied in your post it appears you are referring to this data set from GEO https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1550110

It sounds like you tried to run each -s option which specifies specifies single-end read strand specificity. You can directly check this with ReSQC infer-experiment.py http://rseqc.sourceforge.net/#infer-experiment-py

use some code like

infer_experiment.py -r /YourDIR/Homo_sapiens.GRCh38.104.bed -i /BAMDIR/YourBAMalignment.bam

however even if you had the reads in wrong direction with the -s 0 (unstranded) you should have gotten 40-50% mapped to features and then with either -s 1 or -s 2 giving a higher or lower % so like the comment above check that your reference genome and annotation file match. Perhaps even try another read count program like HTSeq.