Hi all, I have a dount in htseq-count output.
I used samtools to extract uniquely mapped properly paired reads and sorted by name using the command
samtools view -bS -h -q 40 -f 2 -F 512 ${i} | samtools sort -n -o ${i}_sorted.bam -@ 20
I used this output for htseq-count using the coomand
htseq-count -f bam -r name -s no -t exon -i ID ${i} GCF_003121395.1_UOA_WB_1_genomic.gff >${i}_htseqcnt.
eg. output from two files
file 1
__no_feature 1537976
__ambiguous 5157805
__too_low_aQual 0
__not_aligned 0
__alignment_not_unique 0
file2
__no_feature 3546477
__ambiguous 11880801
__too_low_aQual 0
__not_aligned 0
__alignment_not_unique 0
Mygff file is like this
gene 3503760 3523716 . - . ID=gene7;
mRNA 3503760 3523716 . - . ID=rna12;Parent=gene7;
exon 3515344 3515454 . - . ID=id60;Parent=rna12;
CDS ID=cds9;Parent=rna12
CDS ID=cds9;Parent=rna12
CDS ID=cds9;Parent=rna12
mRNA 3503760 3523659 . - . ID=rna13;Parent=gene7;
exon 3523589 3523659 . - . ID=id63;Parent=rna13;
My questions: why are there ambiguous reads when my input is properly paired reads (output from samtools)? My gff file doesn't have a gene_id option. can anyone please advice, where am I wrong?
Thank you for pointing out the mistake in assigning the gene tag in "-i". The output I got is by using "-i ID", if you see my command above.
As per the manual: If the name of the attribute containing the gene ID for exon lines is not gene_id, use the --idattr. Often, its is, for example, Parent, GeneID or ID. Make sure it is the gene ID and not the exon ID.
Sorry I didn't paste my full gff file in the question
NC_037545.1 Gnomon gene 3503760 3523716 . - . ID=gene-NKIRAS1;Dbxref=GeneID:102394338;Name=NKIRAS1;gbkey=Gene;gene=NKIRAS1;gene_biotype=protein_coding
NC_037545.1 Gnomon mRNA 3503760 3523716 . - . ID=rna-XM_006074115.2;Parent=gene-NKIRAS1;Dbxref=GeneID:102394338,Genbank:XM_006074115.2;Name=XM_006074115.2;gbkey=mRNA;gene=NKIRAS1;model_evidence=Supporting evidence includes similarity to: 5 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 152 samples with support for all annotated introns;product=NFKB inhibitor interacting Ras like 1%2C transcript variant X1;transcript_id=XM_006074115.2
NC_037545.1 Gnomon exon 3523593 3523716 . - . ID=exon-XM_006074115.2-1;Parent=rna-XM_006074115.2;Dbxref=GeneID:102394338,Genbank:XM_006074115.2;gbkey=mRNA;gene=NKIRAS1;product=NFKB inhibitor interacting Ras like 1%2C transcript variant X1;transcript_id=XM_006074115.2
NC_037545.1 Gnomon exon 3515344 3515454 . - . ID=exon-XM_006074115.2-2;Parent=rna-XM_006074115.2;Dbxref=GeneID:102394338,Genbank:XM_006074115.2;gbkey=mRNA;gene=NKIRAS1;product=NFKB inhibitor interacting Ras like 1%2C transcript variant X1;transcript_id=XM_006074115.2
NC_037545.1 Gnomon exon 3509449 3509690 . - . ID=exon-XM_006074115.2-3;Parent=rna-XM_006074115.2;Dbxref=GeneID:102394338,Genbank:XM_006074115.2;gbkey=mRNA;gene=NKIRAS1;product=NFKB inhibitor interacting Ras like 1%2C transcript variant X1;transcript_id=XM_006074115.2
ID: is different for different exon
Parent: is same for exons within mRNA, but different for exons between mRNAs of same gene
gene: is same for all exons of a gene.
Is it correct to use "-i gene" ?? Link to my full gff file is this
Please suggest the correct attribute for "-i". Thank you
from this , yes I would say
-i gene
is the way to go. Does it not work?I tried with -i gene and got this error
100000 GFF lines processed.
200000 GFF lines processed.
300000 GFF lines processed. .
.
.
1600000 GFF lines processed.
Error occured when processing GFF file (line 1688729 of file /DATA/kous/g_index/GCF_003121395.1_UOA_WB_1_genomic.gff): Feature id836346 does not contain a 'gene' attribute [Exception type: ValueError, raised in count.py:77]
Output of
sed -n '1688729p' my_gff
isNC_006295.1 RefSeq exon 377 445 . + . ID=id836346;Parent=rna71515;gbkey=tRNA;product=tRNA-Phe
I found some of the lines (53 lines) with the seqname NC_006295.1 doesnot have a 'gene' attribute. Is there basic linux command to insert gene=unknown to these lines alone in the 9th column, with which I can edit the gff file slightly and use it?
command
grep NC_006295.1 < my_gff | grep -v gene
gives the lines without gene. Is there a way to insert "gene=unknown" to the 9th column for these lines in gff file?Please give me suggestions.
it sounds like you are using a 'faulty' gff file. where did you get that from?
Looking at your extract it shows that this is a tRNA-exon. Most likley you are not interested in those , so you could filter your gff file and remove all tRNA lines from it
I downloaded gff file from ncbi with this link
Yes I am not interested in tRNAs. Should I filter all tRNAs or only the lines without 'gene' ?
yes, go ahead and filter all tRNA lines
Sorry. Please help me how to filter tRNA lines?
As the 3rd column has 3 features: gene, tRNA and exon. gbkey has variation between gene and tRNA. Some lines have gene_biotype=tRNA and some lines have product=tRNA instead.
In some cases like in tRNA-methyltransferase gene, gbkey=mRNA and product=tRNA-methyltransferase.
or
from the top of my head and without knowing your gff file, so do check the output you obtain
Using
grep -v 'tRNA'
removes mRNA of genes like tRNA methyltransferase.Using
awk '$3 !~ /tRNA/'
retains exons of tRNA.Please check if this approach is correct.
write lines without 'gene'
grep -v gene my.gff > no_gene_gff
Check features of no_gene_gff
awk '{print $3}' no_gene_gff | sort | uniq -c
514
509 1
1 annotwriter
1 Bubalus
26169 cDNA_match
2 D_loop
24 exon
1 origin_of_replication
509 region
2 rRNA
22 tRNA
Since the lines without genes do not have mRNA as evident from above is it ok to use,
grep 'gene' my_gff file > gene_gff file
and use this gene_gff file for htseq count?
yes, you're absolutely correct on using the -v tRNA approach.
Hmm, ok , it's worth a try at least indeed. I'm just wondering if you might not miss things that way ... anyway, no way I can tell from here, so go ahead and give it a try I would say
Thank you for your suggestions and patient replies.
grep 'gene' my_gff file > gene_gff file
and using gene_gff file greatly reduced my ambiguous reads. I hope I didn't miss anything, asawk '{print $3}' no_gene_gff
did not report mRNAs.