Hi all!
I have created a consensus sequence from a human reference and a Neanderthal genome (to fill gaps - I know this will drastically reduce the power of my research but this is fine for now), to observe potential changes between Neanderthal and human genes. I did this using BCFTOOLS.
I made sure the human alignment and Neanderthal alignment matched.
bcftools consensus -f human_reffasta_chr[].fa.gz -o consen_chr[].fa chromosome_neanderthal.vcf.gz
I then cut the required exons using SAMTOOLS.
The coordinates of the exons were drawn from https://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi?REQUEST=GV&DATA=92322&BUILDS=ALLBUILDS for each gene.
samtools faidx consen_chr[].fa chr:pos-pos > gene_consen_exon_no.fa
#cat-ting all the exons together for translation
cat ~/Research/*.fa > gene_total_exons.fa
However, when using the ENSEMBL translate tool online, the translations produced had stop codons within the amino acid sequence and/or did not have a methionine present in the beginning of the sequence; eg.
>124936775-124936906_1
VWLNSPIWTHLEKSPRAITIVETASITQSRG*SRTIGTAF*ETQ
Given the similarity between Neanderthals and humans in general, and the additional similarities created by generating a consensus sequence, I didn't expect this at all. This leads me to think that something has gone wrong in one of these steps. I triple-checked the exon coordinates, and they were correct. I ran BLAST, and the sequences had a high level of identity, but not to the degree that would be expected.
I'm not sure what I did wrong, and I would be grateful for some pointers!
Thank you so much! This makes a lot more sense. But if I'm going on the CCDS website (which says it only has the CDS regions) and am getting these results, how do I strictly get the protein coding regions only?
Have a look at this part of the web site. The database indeed contains only CDS. The whole Nucleotide sequence is the coding sequence, and when you translate it, you get the exact AA sequence starting with MANA... below. The black and blue nucleotide sequence parts both indicate exons, the color is just for marking alternating exons. The same is not the case if you extract the exon coordinates from another annotation e.g. in GFF format without removing UTR's or extract annotated CDS.
Does this clarify things?
Not entirely, sorry - and apologies if I'm not really understanding your explanation! If you look on the same page you have coordinates above with the title 'Chromosomal Locations for CCDS [etc]' (see below image). Would these be the CDS sequence? The exon coordinates are often longer than the nucleotide sequence in the image you showed. I used these coordinates from the table for each target gene and got the issues with the results above.
AKA the 45278966-45279184 exon is 218 places/coordinates long, but the corresponding nucleotide sequence in the CDS sequence data is only 54 nucleotides long. How does this compare?
How would you get the exact coordinates for the CDS sequence so I could extract it from a FASTA file?
Thank you so much for all your help!
You can download all the CDS sequences from the genome data directory directly. https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/ for the human RefSeq genome.