Strore in separated fasta files fasta sequences with same locus
1
0
Entering edit mode
7.7 years ago
Amy • 0

Hi,

I'm currently working on alternative splicing and I have a fasta file in which, sometimes many reads taken as different genes come from the same locus due to alternative splicing. I need to group those sequences in separated single fasta files in order to align them further and get one single sequence per locus.

The fasta file looks like this:

>XM_010906229.1 PREDICTED: Elaeis guineensis probable strigolactone esterase D14 homolog (LOC105031930), mRNA
ATTTTTTTGAGCTAATAAACTCTCCAACCGCTATCAATATATAGTACCCCTACACATCCC
CTGCGGGGGTCACCCACATCATCATTATCATTTCACTCTCTCGTTTTGTCGGTGTGCCTG
CTTGCCATGTATAGGAACGTAAGGATCGTAGGGAATGGAGAGCAAACTGTAGTGCTCTCA
CATGGCTACGGTGGGAGCCAGTCGGTTTGGGACAAGGTGGTGCCGCACCTGTCTCAAAAG
>XM_010906230.1 PREDICTED: Elaeis guineensis leucine-rich repeat receptor-like serine/threonine/tyrosine-protein kinase SOBIR1 (LOC105031930), mRNA
CTTCCCATATGGATCCTTTCAATTCCTCTCCACCCTCCTCTCTCTCTACGCTTTCTAATC
ATCAACTCAGAACTAACGAAGGCCCAGCACCAACAAGACATCCCTCCATGGCCGATTTTC
TCTCTCATCCACTATGGGTCGGCGCTGCCATCTCTTTCTCCGTCGGCTTTGCCGTAGGTA
CCTTCATCTTCATCGTCTGGAAACTCGCCATCAGCCGCTGCCGTCGCATCCAAACCAACG
AAGAAGAACTCGCCAACACCCCCACCGTCTTCAGCCCCATGCTCAGGTCCAACCTCTCCT
>XM_010906231.1 PREDICTED: Elaeis guineensis leucine-rich repeat receptor-like serine/threonine/tyrosine-protein kinase SOBIR1 (LOC105031931), mRNA
AATCACCTCTAAATCAATTTCTCTTAAATTTTATGAGGACAATCAAGGAGAAAAAATAAT
GCATTAAGTACAAACATTCAAGTCTTCTTCAAGTAAGTATCATCGACAACAATCAGCTTG
TTAAGAAGCTCTTGGATTATCTAGTTGGACAACATAAATGCTATCTAAAATAATAATACG
GAAACCTATAAGACTTTCAAGTCGGGCTAAGGTGCTCTTCCTCATGTCTGACTGCCCCTC
CAGTTTTCTATAGTTGCATCCTTTTAACGTCAGGCCTATTATCAGGATTTTTTTTTCTTC

What I want is to store for example the two first sequence in a separated file from the initial fasta file if they come from the same location (LOC105031930). How can I do it ? Any inputs will be very appreciated, Thanks.

sequence RNA-Seq alignment fasta • 2.3k views
ADD COMMENT
1
Entering edit mode

Please give an examples what the fasta headers are and in which way you would like them to be. On the other hand, it seems to me that you needed the fasta file of genes, but you have downloaded the transcripts. Try dloading the gene-table from your source.

ADD REPLY
0
Entering edit mode

Examples are in the original post (not sure if they were added after you wrote this comment).

>XM_010906229.1 PREDICTED: Elaeis guineensis probable strigolactone esterase D14 homolog (LOC105031930), mRNA
>XM_010906230.1 PREDICTED: Elaeis guineensis leucine-rich repeat receptor-like serine/threonine/tyrosine-protein kinase SOBIR1 (LOC105031930), mRNA

Since these two sequences are from the same LOC#, OP wants them in a separate file.

ADD REPLY
0
Entering edit mode

they are added after

ADD REPLY
6
Entering edit mode
7.7 years ago

Maybe not super elegant but teaches you awk tricks. Redirecting printing to different files with variable file names is the best part. It is very basic and only appends to files, so if they are already in place, you might get into trouble as it does not overwrite them.

awk '/^>/{start=match($0,/\(/)+1;len=match($0,/\)/)-match($0,/\(/)-1; locus_id=substr($0,start,len); file_name=locus_id".fasta"; print $0 >> file_name}/^[^>]/{print $0 >> file_name}' input.fasta

/^>/ is finding fasta sequences headers and match helps extract loci names assuming they are between parenthesis () and no other parenthesis present. print uses >> to append to fariable file name defined in file_name=locus_id".fasta"

this will generate two fasta files in your case (or as many as there are loci in your whole data set).

ADD COMMENT
3
Entering edit mode

+1 for the fasta manipulation by awk. The code can be simplified even further as

awk ' /^>/ {match($0, /LOC[0-9]*/); locus_id=substr($0, RSTART, RLENGTH); print $0  >> locus_id".fasta" } /^[^>]/ {print $0 >> locus_id".fasta"}' input.fasta
ADD REPLY
0
Entering edit mode

Thank you for the help. Since the first lines of my script are written in perl is it possible to combine perl and this awk code together ?

ADD REPLY
0
Entering edit mode

If you are using perl then this could be done in the program itself. Or you could use system() call to call the code above.

ADD REPLY
0
Entering edit mode

Great then. I'm currently trying to tun the code from the command line. It seems to work. I'll keep you posted about the progress. Thank you all !

ADD REPLY
0
Entering edit mode

I agree, this is why I wrote that my version is not super elegant. I just wanted to make sure my solution works on different operating systems and with different loci IDs. For some reason awk on Mac OS does not support print $0 >> locus_id".fasta" but print $0 >> locus_id will work perfectly. Or /^[^>]/ can be replaced with !/^>/ and so on.

My general approach to awk is that it is not about speed or code readability and so on, but rather about ability to write one-liners quickly for fast and easy tasks like mentioned above. It is good for a small research project or for testing a new idea. For production scripts, tools and analysis there are flow, exceptions and error management needed and ability to build on top later, so using functions, services, python, C++, java, R, etc become a key. If we approach awk and other languages that way in the process of teaching bioinformatics, I think RSRART and RLENGTH are a bit overcomlicated conceptes that are hard to remember and use in other languages, while index, match, etc work similarly to awk's match basic functionality of returning the position of the sub-string within the string.

ADD REPLY

Login before adding your answer.

Traffic: 1894 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6