Question

Error in raw count file of .sam after alignment on GRCh38

0

Entering edit mode

7.7 years ago

Mirza.Jawad • 0

Hello! I downloaded the indexes of GRCh38 human genome from ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/ and aligned my pair ended reads on it. after getting .bam, i converted it into .sam by using

**"samtools view -h -o output.sam input.bam"**

then i run

**"htseq-count output.sam Homo_sapiens.GRCh37.75.gtf "**

then I sorted the .sam file by using command

**"samtools sort -O sam -T 13.sam -o 13.sort.sam 13.sam"**

but it results like

...........................................................................................................................................................
Warning: Read SRR993713.6118094 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
Warning: Read SRR993713.122141 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
Warning: Read SRR993713.10023628 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
...........................................................................................................................................................

and then it shows the result with ensemble gene ids with 0 counts. please tell me whats wrong with it. Am i using the wrong gtf file. or the way of sorting my file is wrong?

RNA-Seq Assembly alignment sequencing • 2.3k views

ADD COMMENT • link updated 3.8 years ago by Biostar 20 • written 7.7 years ago by Mirza.Jawad • 0

2

Entering edit mode

.bam, i converted it into .sam

why ?

Haha, remember when you wrote SAM files to disk???

No neither do Ihttps://t.co/SDLiogrxvd pic.twitter.com/Q86rw2gwnp
— Mick Watson (@BioMickWatson) July 11, 2016

ADD REPLY • link 7.7 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

I learnt the pipeline from a person he said that u need to align then i have to get .sam formate but tophat gave me result in bam formate then i thought i should convert it into sam formate. what is the desired output formate of tophat tp get raw counts?

ADD REPLY • link 7.7 years ago by Mirza.Jawad • 0

0

Entering edit mode

You can use featureCounts to get a count matrix directly from your BAM files.

ADD REPLY • link 7.7 years ago by GenoMax 148k

0

Entering edit mode

Its not working at all :(

ADD REPLY • link 7.7 years ago by Mirza.Jawad • 0

0

Entering edit mode

It is not working. I spent 3 hours on it :(

ADD REPLY • link 7.7 years ago by Mirza.Jawad • 0

0

Entering edit mode

What is not working?

ADD REPLY • link 7.7 years ago by GenoMax 148k

2

Entering edit mode

Am i using the wrong gtf file.

Yes you are. As @Ram pointed out, why are you mixing genome builds. This is a surefire way of getting erroneous results. Consider yourself fortunate that you got 0 counts, otherwise you may have continued on your merry way with completely incorrect data.

ADD REPLY • link 7.7 years ago by GenoMax 148k

0

Entering edit mode

then whats the solution ?

ADD REPLY • link 7.7 years ago by Mirza.Jawad • 0

1

Entering edit mode

Use a GTF file for GRCh38 (since you used indexes for that build). Did you use Ensembl? The GTF files for that build can be found here.

ADD REPLY • link 7.7 years ago by GenoMax 148k

1

Entering edit mode

Why download the GRCh38 indexes and use them with the GRCh37 gtf file? Am I missing something here?

ADD REPLY • link 7.7 years ago by Ram 44k

score 3 · Accepted Answer · 2017-04-18

3

Entering edit mode

7.7 years ago

Matteo Schiavinato ★ 3.6k

When htseq-count shows that error, it means that some reads which were supposed to be paired end didn't have a mate. When you run htseq-count with paired end mode, it wants all the reads to have a mate, so you have to filter keeping only the proper pairs first (correct orientation and within insert size).

samtools view -b -h -f 0x2 input.bam > output.bam

This will account for those warnings about reads without mates.

If your final counts file doesn't have any counts inside, you could be using the wrong version of the GTF file (you have to use the one which has been generated on the assembly that you mapped the reads against). As @genomax2 said, "Use a GTF file for GRCh38 (since you used indexes for that build)".

ADD COMMENT • link 7.7 years ago by Matteo Schiavinato ★ 3.6k

0

Entering edit mode

I am getting sam result after using samtools view -b -h -f 0x2 input.bam > output.bam and after using htseq-count for Homo_sapiens.GRCh38.88.chr.gtf

ADD REPLY • link 7.7 years ago by Mirza.Jawad • 0

0

Entering edit mode

An important indicator of how good your results are is in the lines which have an underscore before ("_"), in the htseq-count output. Try grepping them, you'll see the reads that haven't been assigned. If that number is extremely high, there are still some problems.

ADD REPLY • link 7.7 years ago by Matteo Schiavinato ★ 3.6k