Why do I have * in my FASTA? g2gtools diploid
1
0
Entering edit mode
3 months ago
jon.klonowski ▴ 210

Hi, I am using a bespoke program (g2gtools extract) that extracts sequences from a FASTA based on a GTF and the features you want (in my case I am extracting all transcripts). My FASTA is an entire genome of a diploid organism customized to include all SNPs and INDELS the organism carries.

When I run this program, it returns an error: "Sequence contains non-DNA character '*' at position 393"

At first, I was thinking that maybe these denote stop codons but when I investigate my FASTA, there are only 35. But upon second thought, that doesn't make much sense because its a genomic FASTA

Thanks in advance!


UPDATE TO THE ISSUE

Ok, so I found out the issue. I found out that the patient VCF has overlapping INDELs and, although they are not denoted like this within the VCF, when I extracted a set of them in a patient with bcftools query, the * appears. I looked at GATK and this is how overlapping INDELs are annotated: https://gatk.broadinstitute.org/hc/en-us/articles/360035531912-Spanning-or-overlapping-deletions-allele

My issues now what to do to prevent the reporting of overlapping INDELs in this way... I wonder if there is a way to prevent this annotation...

Original trio VCF - so 2 parents and the patient:

#CHROM  POS      REF     ALT     
chr1    154590147  CCG     C     
chr1    154590148  CG      C    
chr1    154590149  G       *      
chr1    154590149  G       C      

and then, if I just extract the patient:

#CHROM  POS         REF   ALT     GT  
chr1     154590148   CG  C      0|1 
chr1     154590149   G   *      1|0 
chr1     154590149   G   C      0|1 

after extracting the proband genotypes using bcftools query.

g2gtools RNAseq FASTA genome transcriptome • 753 views
ADD COMMENT
0
Entering edit mode

customized to include all SNPs and INDELS the organism carries.

Must be something related to this then.

ADD REPLY
0
Entering edit mode

I am using the same software for the entire pipeline - g2gtools -, so it would be weird if the output from a software is incompatible with that software....

ADD REPLY
0
Entering edit mode

Yea, youre right. g2gtools does introduce the into the fasta. I am just not sure what they mean. I am going to try to see what SNPs or INDELs align with the location of the in the FASTA

ADD REPLY
0
Entering edit mode

Can you show me what g2gtools commands you are running? I've been using g2gtools a bit in the past few weeks.

ADD REPLY
0
Entering edit mode

hey dsull ! Check out my update to the question please!

ADD REPLY
0
Entering edit mode
3 months ago

search for non ATGC in you reference.

grep -v '^>' in.fasta | tr -d 'ATGCatgcNn\n\r'
ADD COMMENT
0
Entering edit mode

The output is: *YKY********R**RRRYWMWWYMKYYWYKKMYRRRWR***RYYBRRRRKWRS*RYY*RYWB***YYW*YKYRSKYWYYRYRYKYYMYYMRRRYWMWWYMKYYWYKKMYRRRWRRYYBRRRRKWRSRYSY*RYSYY*RYWYYKWM******MRRSKYWYYRYRYKYRYWYYKWMR****RYWBYY*WMRYRYYYMMRMMR*RWYRYRWYRY**YYRSWYYRSW

but the letters are all IUPAC standard and acceptable in a FASTA: https://www.bioinformatics.org/sms/iupac.html

would it not be the 35 '*', which is not IUPAC standard?

ADD REPLY
0
Entering edit mode

does g2tools support non ATGC chars ?

ADD REPLY
0
Entering edit mode

yes, but not asterisks. which is weird because it was their tool that introduced them in the first place :/ I am going to try to see what SNPs or INDELs align with the location of the *

ADD REPLY
0
Entering edit mode

hey Pierre Lindenbaum! Check out my update to the question please! Let me know if you have experience/thoughts.

ADD REPLY
0
Entering edit mode

I found your response to a similar question that seems to suggest that I can remove these variants. I need to think through whether I can still do this for a diploid (phased) organism... I am not sure if these spanning / overlapping indels MUST be on the same allele, or they are annotated as such even if they are not on the allele. Wouldnt removing the INDEL with the '*' be meaningful if the spanning/overlapping indels are on seperate alleles? bcftools norm resulting in '*' in alternate allele

ADD REPLY

Login before adding your answer.

Traffic: 1865 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6