Ambiguous reads in htseq-count
1
0
Entering edit mode
5.7 years ago
kousi31 ▴ 100

Hi all, I have a dount in htseq-count output.

I used samtools to extract uniquely mapped properly paired reads and sorted by name using the command

samtools view -bS -h -q 40 -f 2 -F 512 ${i} | samtools sort -n -o ${i}_sorted.bam -@ 20

I used this output for htseq-count using the coomand

htseq-count -f bam -r name -s no -t exon -i ID ${i} GCF_003121395.1_UOA_WB_1_genomic.gff >${i}_htseqcnt.

eg. output from two files

file 1

__no_feature 1537976

__ambiguous 5157805

__too_low_aQual 0

__not_aligned 0

__alignment_not_unique 0

file2

__no_feature 3546477

__ambiguous 11880801

__too_low_aQual 0

__not_aligned 0

__alignment_not_unique 0

Mygff file is like this

gene 3503760 3523716 . - . ID=gene7;

mRNA 3503760 3523716 . - . ID=rna12;Parent=gene7;

exon 3515344 3515454 . - . ID=id60;Parent=rna12;

CDS ID=cds9;Parent=rna12

CDS ID=cds9;Parent=rna12

CDS ID=cds9;Parent=rna12

mRNA 3503760 3523659 . - . ID=rna13;Parent=gene7;

exon 3523589 3523659 . - . ID=id63;Parent=rna13;

My questions: why are there ambiguous reads when my input is properly paired reads (output from samtools)? My gff file doesn't have a gene_id option. can anyone please advice, where am I wrong?

RNA-Seq htseq-count • 4.2k views
ADD COMMENT
0
Entering edit mode
5.7 years ago

ambiguous points here to reads that can uniquely be assigned to a single genic feature, not that the mapping itself was/is ambiguous. This can of instance happen when genes overlap in their UTR region and a read aligned in that region will (depending on the counting mode) often be assigned as ambiguous.

see here for the explanation (and image explaining it)

If your gff file does not have the (default) gene_id you can tell to htseq-count to use a different tag (see in doc above to know how to)

now, looking at your output, it rather is an issue of indeed not providig the correct tag to use for the gene-assignment (try to use for instance 'ID' in your case)

ADD COMMENT
0
Entering edit mode

Thank you for pointing out the mistake in assigning the gene tag in "-i". The output I got is by using "-i ID", if you see my command above.

As per the manual: If the name of the attribute containing the gene ID for exon lines is not gene_id, use the --idattr. Often, its is, for example, Parent, GeneID or ID. Make sure it is the gene ID and not the exon ID.

Sorry I didn't paste my full gff file in the question

NC_037545.1 Gnomon gene 3503760 3523716 . - . ID=gene-NKIRAS1;Dbxref=GeneID:102394338;Name=NKIRAS1;gbkey=Gene;gene=NKIRAS1;gene_biotype=protein_coding

NC_037545.1 Gnomon mRNA 3503760 3523716 . - . ID=rna-XM_006074115.2;Parent=gene-NKIRAS1;Dbxref=GeneID:102394338,Genbank:XM_006074115.2;Name=XM_006074115.2;gbkey=mRNA;gene=NKIRAS1;model_evidence=Supporting evidence includes similarity to: 5 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 152 samples with support for all annotated introns;product=NFKB inhibitor interacting Ras like 1%2C transcript variant X1;transcript_id=XM_006074115.2

NC_037545.1 Gnomon exon 3523593 3523716 . - . ID=exon-XM_006074115.2-1;Parent=rna-XM_006074115.2;Dbxref=GeneID:102394338,Genbank:XM_006074115.2;gbkey=mRNA;gene=NKIRAS1;product=NFKB inhibitor interacting Ras like 1%2C transcript variant X1;transcript_id=XM_006074115.2

NC_037545.1 Gnomon exon 3515344 3515454 . - . ID=exon-XM_006074115.2-2;Parent=rna-XM_006074115.2;Dbxref=GeneID:102394338,Genbank:XM_006074115.2;gbkey=mRNA;gene=NKIRAS1;product=NFKB inhibitor interacting Ras like 1%2C transcript variant X1;transcript_id=XM_006074115.2

NC_037545.1 Gnomon exon 3509449 3509690 . - . ID=exon-XM_006074115.2-3;Parent=rna-XM_006074115.2;Dbxref=GeneID:102394338,Genbank:XM_006074115.2;gbkey=mRNA;gene=NKIRAS1;product=NFKB inhibitor interacting Ras like 1%2C transcript variant X1;transcript_id=XM_006074115.2

ID: is different for different exon

Parent: is same for exons within mRNA, but different for exons between mRNAs of same gene

gene: is same for all exons of a gene.

Is it correct to use "-i gene" ?? Link to my full gff file is this

Please suggest the correct attribute for "-i". Thank you

ADD REPLY
0
Entering edit mode

from this , yes I would say -i gene is the way to go. Does it not work?

ADD REPLY
0
Entering edit mode

I tried with -i gene and got this error

100000 GFF lines processed.

200000 GFF lines processed.

300000 GFF lines processed. .

.

.

1600000 GFF lines processed.

Error occured when processing GFF file (line 1688729 of file /DATA/kous/g_index/GCF_003121395.1_UOA_WB_1_genomic.gff): Feature id836346 does not contain a 'gene' attribute [Exception type: ValueError, raised in count.py:77]

Output of sed -n '1688729p' my_gff is

NC_006295.1 RefSeq exon 377 445 . + . ID=id836346;Parent=rna71515;gbkey=tRNA;product=tRNA-Phe

I found some of the lines (53 lines) with the seqname NC_006295.1 doesnot have a 'gene' attribute. Is there basic linux command to insert gene=unknown to these lines alone in the 9th column, with which I can edit the gff file slightly and use it?

command grep NC_006295.1 < my_gff | grep -v gene gives the lines without gene. Is there a way to insert "gene=unknown" to the 9th column for these lines in gff file?

Please give me suggestions.

ADD REPLY
0
Entering edit mode

it sounds like you are using a 'faulty' gff file. where did you get that from?

Looking at your extract it shows that this is a tRNA-exon. Most likley you are not interested in those , so you could filter your gff file and remove all tRNA lines from it

ADD REPLY
0
Entering edit mode

I downloaded gff file from ncbi with this link

Yes I am not interested in tRNAs. Should I filter all tRNAs or only the lines without 'gene' ?

ADD REPLY
0
Entering edit mode

yes, go ahead and filter all tRNA lines

ADD REPLY
0
Entering edit mode

Sorry. Please help me how to filter tRNA lines?

As the 3rd column has 3 features: gene, tRNA and exon. gbkey has variation between gene and tRNA. Some lines have gene_biotype=tRNA and some lines have product=tRNA instead.

In some cases like in tRNA-methyltransferase gene, gbkey=mRNA and product=tRNA-methyltransferase.

ADD REPLY
1
Entering edit mode
grep -v 'tRNA' <your gff file> > new_gff file

or

awk '$3 !~ /tRNA/' <your gff file> > new_gff file

from the top of my head and without knowing your gff file, so do check the output you obtain

ADD REPLY
0
Entering edit mode

Using grep -v 'tRNA' removes mRNA of genes like tRNA methyltransferase.
Using awk '$3 !~ /tRNA/' retains exons of tRNA.

Please check if this approach is correct.

write lines without 'gene'

grep -v gene my.gff > no_gene_gff

Check features of no_gene_gff

awk '{print $3}' no_gene_gff | sort | uniq -c

514
509 1
1 annotwriter
1 Bubalus
26169 cDNA_match
2 D_loop
24 exon
1 origin_of_replication
509 region
2 rRNA
22 tRNA

Since the lines without genes do not have mRNA as evident from above is it ok to use,

grep 'gene' my_gff file > gene_gff file

and use this gene_gff file for htseq count?

ADD REPLY
0
Entering edit mode

yes, you're absolutely correct on using the -v tRNA approach.

Hmm, ok , it's worth a try at least indeed. I'm just wondering if you might not miss things that way ... anyway, no way I can tell from here, so go ahead and give it a try I would say

ADD REPLY
0
Entering edit mode

Thank you for your suggestions and patient replies.

grep 'gene' my_gff file > gene_gff file and using gene_gff file greatly reduced my ambiguous reads. I hope I didn't miss anything, as awk '{print $3}' no_gene_gff did not report mRNAs.

ADD REPLY

Login before adding your answer.

Traffic: 1623 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6