Consensus sequences producing stop codons when translated
1
0
Entering edit mode
9 months ago
hemr3 ▴ 10

Hi all!

I have created a consensus sequence from a human reference and a Neanderthal genome (to fill gaps - I know this will drastically reduce the power of my research but this is fine for now), to observe potential changes between Neanderthal and human genes. I did this using BCFTOOLS.

I made sure the human alignment and Neanderthal alignment matched.

bcftools consensus -f human_reffasta_chr[].fa.gz -o consen_chr[].fa chromosome_neanderthal.vcf.gz

I then cut the required exons using SAMTOOLS.

The coordinates of the exons were drawn from https://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi?REQUEST=GV&DATA=92322&BUILDS=ALLBUILDS for each gene.

samtools faidx consen_chr[].fa chr:pos-pos > gene_consen_exon_no.fa

                #cat-ting all the exons together for translation
cat ~/Research/*.fa > gene_total_exons.fa

However, when using the ENSEMBL translate tool online, the translations produced had stop codons within the amino acid sequence and/or did not have a methionine present in the beginning of the sequence; eg.

>124936775-124936906_1
VWLNSPIWTHLEKSPRAITIVETASITQSRG*SRTIGTAF*ETQ

Given the similarity between Neanderthals and humans in general, and the additional similarities created by generating a consensus sequence, I didn't expect this at all. This leads me to think that something has gone wrong in one of these steps. I triple-checked the exon coordinates, and they were correct. I ran BLAST, and the sequences had a high level of identity, but not to the degree that would be expected.

I'm not sure what I did wrong, and I would be grateful for some pointers!

bcftools ENSEMBL gene samtools fasta • 780 views
ADD COMMENT
0
Entering edit mode
9 months ago
Michael 55k

However, when using the ENSEMBL translate tool online, the translations produced had stop codons within the amino acid sequence and/or did not have a methionine present in the beginning of the sequence; eg.

This is to be expected. The exons are not equal to the coding sequence. You need to take into account the UTR's and extract the CDS sequence for any coding change.

I have created a consensus sequence from a human reference and a Neanderthal genome. ... I did this using BCFTOOLS.

Unlike the name, bcftools consensus (they should change the name of that function) doesn't compute a consensus sequence. It is an Alt-genome really, with all variant positions replaced by the ALT allele. I am not sure if I would take that approach. Possibly, using VEP or SnpEff might be better. You would need to convert the Neanderthal coordinates into hg19 coordinates though.

ADD COMMENT
0
Entering edit mode

Thank you so much! This makes a lot more sense. But if I'm going on the CCDS website (which says it only has the CDS regions) and am getting these results, how do I strictly get the protein coding regions only?

ADD REPLY
0
Entering edit mode

Have a look at this part of the web site. The database indeed contains only CDS. The whole Nucleotide sequence is the coding sequence, and when you translate it, you get the exact AA sequence starting with MANA... below. The black and blue nucleotide sequence parts both indicate exons, the color is just for marking alternating exons. The same is not the case if you extract the exon coordinates from another annotation e.g. in GFF format without removing UTR's or extract annotated CDS.

Does this clarify things?

enter image description here

ADD REPLY
0
Entering edit mode

Not entirely, sorry - and apologies if I'm not really understanding your explanation! If you look on the same page you have coordinates above with the title 'Chromosomal Locations for CCDS [etc]' (see below image). Would these be the CDS sequence? The exon coordinates are often longer than the nucleotide sequence in the image you showed. I used these coordinates from the table for each target gene and got the issues with the results above.

enter image description here

AKA the 45278966-45279184 exon is 218 places/coordinates long, but the corresponding nucleotide sequence in the CDS sequence data is only 54 nucleotides long. How does this compare?

enter image description here

How would you get the exact coordinates for the CDS sequence so I could extract it from a FASTA file?

Thank you so much for all your help!

ADD REPLY
1
Entering edit mode

You can download all the CDS sequences from the genome data directory directly. https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/ for the human RefSeq genome.

ADD REPLY

Login before adding your answer.

Traffic: 1976 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6