Contig assembly task, errors
1
0
Entering edit mode
6 months ago
samRayne ▴ 20

Hello everyone,

I just completed had a bioinformatics amplitude test where the question was quite open ended and I am not sure if I have answered it correctly, I am trying to correct my mistakes and could some help from the community to recognise them as the wording is throwing me off.

Question 1:

The following short DNA sequences have all been derived from the same longer contiguous stretch of sequence (or "contig").

TGCATGATGG   
ATGCGCTGC  
ATGATGGATACCCC

Assemble them into the original contig

My answer:

ATGCGCTGCATGATGGATACCCC (Original Contig) 

Question 2

These additional sequences have been derived from a larger contig that incorporates the original contig.

GGTCGCTTCGCGGCC
GCGGCCGCTAATCGGGG

Assemble them into the larger sequence.

My answer:

GGTCGCTTCGCGGCCGCTAATCGGGG (result of sequences)

Larger sequence (this part I was very confused on how to do this and what was the correct method)

ATGCGCTGATGATGGATACCCC + GGTCGCTTCGCGGCCGCTAATCGGGG  

Answer :

ATGCGCTGATGATGGATACCCCGGTCGCTTCGCGGCCGCTAATCGGGG  

How does this task differ from the first task?

Task 1 involved using a set of sequences that have been extracted from a contig to form an assembled contig (original contig), aligning 3 similar sequences to form 1 unique sequence composed of the three sequences.

Task 2 now involves using a similar approach regarding the finding the 2nd contig using the sequences, this then involved adding this sequences in priority of order, and recognising that ATG is a start codon and placing the original contig first via assumption for protein synthesis via transcription

Translate the DNA sequence into a protein sequence

DNA Sequence: ATG-CGC-TGC-ATG-ATG-GAT-ACC-CCG-GTC-GCT-TCG-CGG-CCG-CTA-ATC-GGG-G
Reading frame:  ATG CGC TGC ATG ATG GAT ACC CCG GTC GCT TCG CGG CCG CTA ATC GGG G
Translated Protein Sequence:
Met-Arg-Cys-Met-Met-Asp-Thr-Pro-Val-Ala-Ser-Arg-Pro-Leu-Ile-Gly

What do you find that is notable?

The start codon "ATG" initiates the translation, and the translation continues until one of the stop codons is encountered. Absence of a stop codon, sequence doesn't contains a stop codon (TAA, TAG, or TGA), which would indicate the end of protein synthesis. This suggests that the sequence may be a open reading frames (ORFs).

<h6>### Included this wasn't too sure if it was highly relevant</h6>

Sequence 1 has a GC content of approximately 54.55%.

Sequence 2 has a higher GC content of 72%.

Therefore, the difference in GC content between the two sequences may indicate differences in their biological roles, such as their propensity for gene regulation, protein binding, or other molecular interactions.

Conservation of Amino Acids: In the first sequence, there is a repetition of methionine (Met) and arginine (Arg) codons. Similarly, in the second sequence, there is a repetition of glycine (Gly) and arginine (Arg) codons. This repetition could imply certain functional motifs or domains within the protein sequences.

dna genetics contig assembly • 749 views
ADD COMMENT
0
Entering edit mode
6 months ago

Feels a bit strange for an aptitude test, feels more like homework.

1) seems OK to me, just chuck them into CAP3 and see what falls out. Can also do that manually I guess.

2) it may have been that they wanted you to reverse complement GCGGCCGCTAATCGGGG so it 'connects' to/overlaps with the CCCC at the end of 1)'s answer. An old-school assembler might have done this too.

3) this one depends on 2) being done correctly. If you indeed have several start codons (3?) it could be that the ORF predictor messed up and shouldn't have been extending further 'to the left', meaning the correct protein could've been Met-Asp-Thr-Pro-Val-Ala-Ser-Arg-Pro-Leu-Ile-Gly. I've seen that happen with overlong mitochondrial genes getting an 'extra' start codon to the left.

ADD COMMENT
0
Entering edit mode

Hi Phillip,

Thanks for providing an answer and clearing up some confusion, looking into what you said about older assemblers reverse complementing the 2nd sequence to accommodate the 1st sequence, can you provide an example of a piece of software? as well as maybe some insight to why the assembly has changed from this method?

Thanks, Ricardo

ADD REPLY
1
Entering edit mode

Hi Ricardo! CAP3 is an 'older' overlap assembler from 1999 which is perfectly suited for these kinds of 'small' tasks. I used it last in my undergrad to assemble some ESTs....
Nowadays we use De Bruijin graph-based assemblers like MEGAHIT, Spades, Velvet which should work here too but is probably overkill.

ADD REPLY
0
Entering edit mode

Hi Phillip,

Thanks for sharing you knowledge and funnily enough this really was an amplitude test, but anyways your sharing of knowledge has been extremely helpful and I really do appreciate the time you've taken to read my post. I've created another post using you insights and looking into the working a bit more just to clear up further confusion. Genome Assembly task + Protein Translation, assignment advice on a question here it is if you would like to read and maybe contribute further.

Thanks again, Ricardo

ADD REPLY

Login before adding your answer.

Traffic: 2003 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6