Get every variables sites between two aligned CDS sequences

0

Entering edit mode

6.8 years ago

maxime.policarpo ▴ 200

Hi,

I have a file containing aligned dna sequences for a gene with 3 species :

>species1
ATGCTGATTCGATTGCGATTAGCTAGCTG

>species2
ATGCTGTTTCGATTTTTATTAGCTAGCTG

>species3
ATGCTCCTTCGATTGCGATTAGTTAGCTG

I would like to recover every sites that are different between the species 1 and species 2 (and is it a synonymous or non synonymous mutation) and also know if this position have more likely beed mutated on species 1 or species 2 based on the species 3 sequence (so if 1=A, 2=C and 3=A, then the mutation arised in species 2).

I tried to write a script in python but it doesn’t make differences between synonymous and non synonymous mutations .. I would like to know if something more reliable and commonly used exist ? Here is my script :

seq1=Species1

seq2=Species2

seq3=Species3


  if seq3!="":

     for i in range(0,len(seq1)):

        if seq1[i]!=seq2[i]:
            if seq3[i]==seq1[i]:
                print("Mutation on seq2 : "+seq1[i]+"->"+seq2[i], i)
            elif seq3[i]==seq2[i]:
                print("Mutation on seq1 : "+seq2[i]+"->"+seq1[i], i)
            else :
                print("Non orientable mutation, seq1 : "+seq1[i]+" vs seq2 : "+seq2[i], i)

Thanks for the help,

Maxime Policarpo

alignment mutation synonymous non-synonymous • 4.0k views

ADD COMMENT • link updated 6.8 years ago by finswimmer 16k • written 6.8 years ago by maxime.policarpo ▴ 200

1

Entering edit mode

Hello Maxime,

Whether the mutation is synonymous or non synonymous depends on where the translation starts and where introns are located. Without knowing that it will be impossible to say something about this. Or are the sequences you have cDNA (which might be as they all starts with ATG)?

fin swimmer

ADD REPLY • link 6.8 years ago by finswimmer 16k

0

Entering edit mode

Hi,

Some of my genes are truncated so they don't all starts by an ATG but they are all in phase (by that i mean that the first letter of every sequences if the first base of the first codon) so there should not be any problem concerning reading frame !

Thanks

ADD REPLY • link 6.8 years ago by maxime.policarpo ▴ 200

1

Entering edit mode

Ok,

there should be a way to do this using BioPython. Could you please provide an example of how your output should like for the input example you gave above?

Is it possible that both, species1 on and species2, have different bases compared to species3 on the same position? Could all species have different bases on the same position?

fin swimmer

ADD REPLY • link 6.8 years ago by finswimmer 16k

0

Entering edit mode

Yes it is possible that the three species have a different base. In this case, i consider that this mutation is not orientable.

I think i can deal with every outputs you suggest. In my script i have an output that tells me the mutation and on which species it occured (A->T on species 2) but it dosn't really matter as long as i have the informations i want (synonymous or non synonymous + in which species)

Maxime

ADD REPLY • link 6.8 years ago by maxime.policarpo ▴ 200

1

Entering edit mode

One more question. If it's important to know whether it is a synonymous change or not, wouldn't it be better to compare the whole triplet rather the single bases?

ADD REPLY • link 6.8 years ago by finswimmer 16k

0

Entering edit mode

Yeah you are totally right, i aim to also study the impact of those mutations on the protein but for the moment, i am only interested in the number of those mutation rather than their impact on amino acids (in fact, my first aim was to calculate ka/ks on a set of genes, but i study two species that are so close that the task is impossible)

ADD REPLY • link 6.8 years ago by maxime.policarpo ▴ 200

1

Entering edit mode

Hm,

I really think you must think more about if what information you need and how your output should look like.

With BioPython it is easy to check whether a base change will also lead to aminoacid change. But one have to look at all bases of a triplet at once, as you can have more than one change in that triplet. E.g. the Ref is AAA and your species have TTT. If I only check each position without looking at the other I came to a false result (TAA != ATA != AAT != TTT).

fin swimmer

ADD REPLY • link 6.8 years ago by finswimmer 16k

0

Entering edit mode

Yeah i understand ! You are totally right, i think you are pointing the fact that telling if a mutation is synonymous or non-synonymous is very hard in the case of multiples mutations in the same codon ? Actually, i think that it would not be a problem in my datas, as two mutations in the same codon is almost impossible due to the divergence time between my species. (I believe that when calculating ka/ks ratios, a statistic model is used to deal with those problems right ? )

Here is my ideal output :

If bases of species 1 = species 2 --> print nothing If base of species 1 != species 2 --> print the base of species 1 and species 2 and species 3 and tell if it bring a amino acid change or not and if in the codon, multiple mutations arised, then don't print the non-synonymous/synonymous information

Thanks for you dedication to my problem !

ADD REPLY • link 6.8 years ago by maxime.policarpo ▴ 200

0

Entering edit mode

Ok,

I will write something in python later. I'm short in time now.

Are all the sequences in one fasta file and is the id always "species1", "species2" and "species3" as you show in your first post? Or how does your input look like?

BTW: I removed your identical answer post to keep this discussion in the right order.

fin swimmer

ADD REPLY • link 6.8 years ago by finswimmer 16k

0

Entering edit mode

Yeah thanks i didn't know how to delete it ^^. Yes my sequences are all in fasta with the same names !

>species1
ATG....
>species2
ATG...
> species3
ATG...

(And the sequences are already aligned)

Thanks,

Maxime

ADD REPLY • link updated 6.8 years ago by finswimmer 16k • written 6.8 years ago by maxime.policarpo ▴ 200

0

Entering edit mode

You can delete your own post, by clicking on "moderate" below your post.

fin swimmer

ADD REPLY • link 6.8 years ago by finswimmer 16k

2

Entering edit mode

6.8 years ago

finswimmer 16k

Hello maxime,

based on your criteria here is a python script that should work. The BioPython module needs to be installed. Run it like this:

$ python compare.py input.fasta > result.csv

	import sys
	from Bio import SeqIO


	def mismatch(seq1, seq2):
	index = []

	for i, (s1, s2) in enumerate(zip(seq1, seq2)):
	if s1 != s2:
	index.append(i)

	return index


	sequences = SeqIO.index(sys.argv[1], "fasta")
	result = []

	for i in range(0, len(sequences['species1'].seq), 3):
	codon_s1 = sequences['species1'].seq[i:i+3]
	codon_s2 = sequences['species2'].seq[i:i+3]
	codon_s3 = sequences['species3'].seq[i:i+3]

	if codon_s1 != codon_s2:
	effect = "."
	mismatches = mismatch(codon_s1, codon_s2)

	if len(mismatches) == 1:
	if codon_s1.translate() == codon_s3.translate():
	effect = "synonymous"
	else:
	effect = "non-synonymous"

	for m in mismatches:
	result.append((
	i+m+1,
	codon_s1[m],
	codon_s2[m],
	codon_s3[m],
	effect
	))

	print("#POS", "species1", "species2", "species3", "effect", sep="\t")

	for row in result:
	print("\t".join(map(str, row)))

view raw compare.py hosted with ❤ by GitHub

The input is looking like this:

$ cat input.fasta
>species1
ATGCTGATTCGATTGCGATTAGCTAGCTG
>species2
ATGCTGTTTCGATTTTTATTAGCTAGCTG
>species3
ATGCTCCTTCGATTGCGATTAGTTAGCTG

The output like this:

#POS  species1  species2  species3  effect
7     A         T         C         non-synonymous
15    G         T         G         synonymous
16    C         T         C         .
17    G         T         G         .

The position is 1-based. That means the first base in the sequence is also numbered with 1

fin swimmer

ADD COMMENT • link 6.8 years ago by finswimmer 16k

0

Entering edit mode

Thanks a lot for your help ! It is exactly what i was looking for !

ADD REPLY • link 6.8 years ago by maxime.policarpo ▴ 200

1

Entering edit mode

Fine if it helps.

But I'm still unsure whether this is what you need, as I don't really understand the goal of this.

In your example look at the output for position 15. It's only there because species 1 and 2 are different. The effect is calculated based on the base of species 1. As it's the same like in species 3 the "effect" is of course synonymous.

Doesn't it make more sense to print all positions where at least one species has a different base and calculate the effect for the species that had a different base in comparision to species 3?

fin swimmer

ADD REPLY • link 6.8 years ago by finswimmer 16k

0

Entering edit mode

Yeah but i think it is what your script is doing no ? The line " if codon_s1.translate() == codon_s3.translate():" take the species3 as reference to detect the mutation effect if understood well ?

ADD REPLY • link 6.8 years ago by maxime.policarpo ▴ 200

1

Entering edit mode

This line compares the translation of the codons of species1 und species3. But, to stay at line for POS 15, you haven't a base change in species1 and so the effect is of course "synonymous". Wouldn't it be more interesting to see whether there is in effect in species2 in this case?

ADD REPLY • link 6.8 years ago by finswimmer 16k

0

Entering edit mode

Yeah it would be more interesting in fact !

ADD REPLY • link 6.8 years ago by maxime.policarpo ▴ 200

1

Entering edit mode

Then let's redefine your request:

print all sites where the base of species1 and species2 disagree along with the bases on that site for all 3 species
for the species that disagree with the base in species3 check whether this leads to an amino acid change
in case there are multiple mutation within one codon do not check for amino acid change and print just the site along with the bases information

Right?

fin swimmer

ADD REPLY • link 6.8 years ago by finswimmer 16k

0

Entering edit mode

This would be great !

Maxime

ADD REPLY • link 6.8 years ago by maxime.policarpo ▴ 200

2

Entering edit mode

6.8 years ago

finswimmer 16k

Here the new script based on the new criterias:

	import sys
	from Bio import SeqIO


	def mismatch(seq1, seq2):
	index = []

	for i, (s1, s2) in enumerate(zip(seq1, seq2)):
	if s1 != s2:
	index.append(i)

	return index


	def get_effect(seq, pos, ref, mismatches):
	effect = "."

	if len(mismatches) > 1:
	effect = "*"
	elif pos in mismatches:
	if seq.translate() == ref.translate():
	effect = "synonymous"
	else:
	effect = "non-synonymous"

	return effect


	sequences = SeqIO.index(sys.argv[1], "fasta")
	result = []

	for i in range(0, len(sequences['species1'].seq), 3):
	codon_s1 = sequences['species1'].seq[i:i+3]
	codon_s2 = sequences['species2'].seq[i:i+3]
	codon_s3 = sequences['species3'].seq[i:i+3]

	if codon_s1 != codon_s2:
	mismatches_s1 = mismatch(codon_s1, codon_s3)
	mismatches_s2 = mismatch(codon_s2, codon_s3)

	pos = set(mismatches_s1 + mismatches_s2)

	for p in pos:
	effect_s1 = get_effect(codon_s1, p, codon_s3, mismatches_s1)
	effect_s2 = get_effect(codon_s2, p, codon_s3, mismatches_s2)

	result.append((
	i+p+1,
	codon_s3[p],
	codon_s1[p],
	effect_s1,
	codon_s2[p],
	effect_s2
	))

	print("#POS", "species3", "species1", "effect_s1", "species2", "effect_s2", sep="\t")

	for row in result:
	print("\t".join(map(str, row)))

view raw compare_v2.py hosted with ❤ by GitHub

Output now looks like this:

#POS  species3  species1  effect_s1       species2  effect_s2
7     C         A         non-synonymous  T         non-synonymous
15    G         G         .               T         non-synonymous
16    C         C         .               T         *
17    G         G         .               T         *

I've pulled species3 in front as this is the reference.

. means same base as reference and therefor effect wasn't calculated.

* means multiple bases are changed within the codon and therefore no effect was calculated.