Hello wonderful beings of bioinformatics!
I'm new to this world and could use some help.
My job is to run multiple sequence alignment on a large dataset. I am looking into the L1 family of genes and wanting to compare 7,525 elements of full length sequences. Each sequence is ~6,000 base pairs in length. I was hoping to create a giant visual output of 7,525 x 6,000. I have all of the sequences in a fasta file, but it seems to be too large to use in any of the multiple sequence alignment platforms. I've tried using mafft, clustal omega, and clustalw so far.
My complete fasta file is ~46,000kb. Is it really too big for processing?
I figured the easy solution was to split up the data, so I divided my 1 fasta file into 10 fasta files, all under 5000kb each (the maximum on clustalw). This produces an output, but it is not what I expected. I want to get an old-school looking output like:
chr10_35478-41492 --------AAATGGAATAGGAACAGCTCTGGTCTACA-GCTCCCAGCGTG
chr10_1221417-1227555 TTTATTATTATTATTATACTTTAAGTTTTAGGGTACATGTGCACAACGTG
* * *** ** * * * **** * * ** ****
(except much larger)
I can generate this output when I copy/paste a handful of samples, but when I use a bigger fasta file, there is no old-school visual output for comparison. I get this:
CLUSTAL 2.1 Multiple Sequence Alignments
Sequence type explicitly set to DNA
Sequence format is Pearson
Sequence 1: chr10_100406331-100412500 6169 bp
Sequence 2: chr10_100454491-100460565 6074 bp
Sequence 3: chr10_10069375-10075458 6083 bp
etc.
I've looked through the arguments and options, but am not understanding what it is I need to do to get the output I want.
The slogan for clustal omega got me excited: "the last alignment program you'll ever need", but the file size limit here is even less than clustalw. I'm confused, because it says it can "allow hundreds of thousands of sequences to be aligned in only a few hours". Am I using it wrong?
I think the most reasonable answer is probably to subset my data even further. However, then I'll have the problem of trying to merge it all back together to compare.
Overall, I am hoping to find out if 1) what I am doing wrong and stupid, 2) what parameters I need to generate the old-school clustal "plots", 3) what program I should be using, if I am using one that isn't a good fit for my purposes, 4) any suggestions on how to visualize this many regions of this length. I have spent almost 25 hours on this already, and probably should have asked for help sooner!
THANK YOU <3
Thank you much for your help!
I'm happy to hear my dataset isn't big enough to break the programs! I really appreciate the logical questioning about working with the data the way I have it. My boss is interested in looking at really subtle changes, which I imagine is why the request included so much redundancy. We'll see how far I get with this, I guess!
I was successful (I think?) using clustalo from the terminal. The GUIs all gave me size limit problems. I used:
resutling in
and got a giant output that looks mostly like:
About 4 million pages later (I kid, somewhat) I can see some of the dna:
but I am unsure how to interpret all of this. I have read that "*" means that the residues or nucleotides are identical in all sequences in the alignment, ":" means that conserved substitutions are observed, and "." means that semi-conserved substitutions have been observed.
but I have dashes! Does that mean these are all gaps? Is it something about my inputs? Does it have something to do with the output:
?
Any suggestions or directions are so appreciated! Thank you so much.
The '-' are gaps or 'indels' insertion/deletions. That means the alignment has worked. The next thing is to get the alignment inframe so you can translate it - that is certainly what your boss wants. However, this might prove difficult because alignment errors will result in frameshifts.
What you might want to do is supplement this with an amino acid alignment separately, because translating and getting everything inframe isn't always straight-forward.
Thanks for the thanks, but upvotes welcome.
Forgive me, I am not sure what "get the alignment inframe" means. What would supplementing with an amino acid alignment look like? Is it similar to what Mensur suggested below about trimming the non-conserved columns? Or does "translating" mean something different? TY!!
Yes this is different, it is possible that you are using amino acids already, but you definitely said 'DNA'.
So I mean is using the triplet codon table to convert the sequence to amino acids so you can examine the protein mutations in the gene. It absolutely the next step your supervisor will want.
The issue is you probably have 'junk' given all the gaps at the beginning of the alignment and Mansaur is suggesting a tool to remove these, to clean up the "noisy ends"
Are you saying you didn't believe my math when I told you that your output would be tens of thousands of pages?
Kidding aside, you got exactly the expected output. And yes, that is how it should look like, and how voluminous it should be when you have 7+ thousand sequences with 6 thousand letters each. I honestly don't know what kind of subtle changes one can see by studying this kind of output, but I think you have done the part your boss asked for.
One thing that may help is trimming out the non-conserved columns, which will at least somewhat reduce the output size. I still think it will be beyond what a person can reasonably analyze, but who knows. This trimming can be done using multiple programs, one of which is trimAl.
Hahaha, no doubt! Loving your sense of humor.
I am glad to hear that I got the exact expected output! I looked into trimming the non-conserved columns with trimAl, but am getting an error:
I searched for the error but didn't find any more information other than what the error already says. Do you have any idea what I might do here? Should this be its own post, perhaps?
It is difficult to know for sure without seeing the alignment - and please don't post it here. I am guessing that trimAl is reading only the part of sequence name up to the column, so many of your sequences end up having the same name (chr10, chr1, etc). If my guess is correct, this could be solved by making all sequence names unique.
I suggest you copy your alignment to
unique.aln
and run the following command:This will convert
chr10:100406331-100412500
intochr10_100406331-100412500
and similarly all otherchr10:
occurrences intochr10_
, which I think will make sequence names unique. The thing is that you will need to run the above command for all chromosome numbers, and then maybe the file will work with trimAl. If not, you may need to start a new thread with more details.I ran the perl command for each chromosome and checked that the name changed worked. However, I'm getting a similar error.
I understand it may be time to ask about this in a new post, but I thought I'd see if you had any suggestions regarding the "number of residues fixed by the alignment."
EDIT: I see that some of these names have additional colons, like chr10_GL383545v1_alt:13600-19716 , so I'll edit the perl line to swap : for _ after the word "alt" - right?
Right.
Ok, it worked! Thank you so much for your help.
Thank you very much for your help! Turns out I have 6 cores on my PC, and 12 on my laptop (that's probably normal, but it was surprising to me). I have updated this adventure in my post below. I'd love to hear any additional thoughts you have about it. Thanks again!
Beyond this you are into quite specialist approaches - you need to discuss your research objectives with your supervisor