Matching fasta headers in a file and removing the entries that do not have a corresponding match
1
0
Entering edit mode
7.9 years ago

Hi, I have a file with 30,000 fasta entries from two different species, rn5 and mm10 (CDS alignment of the two species). I want to able to make sure that every fasta entry from rn5 has a corresponding partner from mm10. I know thats not true because I have 16,200 sequences from rn5 and only 15,000 sequences from mm10. So I want to be able to remove all stray rn5/mm10 sequences from the file. My file looks like

>NM_022953_rn5_1_38 66 0 2 chr1:268444735-268444931-
MALTPQRGSSSGLSRPELWLLLWAAAWRLGATACPALCTCTGTTVDCHGTGLQAIPKNIPRNTERL
>NM_022953_mm10_1_38 66 0 2 chr19:41743212-41743408-
MALTPQRGSSSGLSRPELWLLLWAAAWRLGATACPALCTCTGTTVDCHGTGLQAIPKNIPRNTERL
>NM_022953_rn5_2_38 24 2 2 chr1:268429604-268429675-
ELNGNNITRIHKNDFAGLKQLRVL
>NM_022953_mm10_2_38 24 2 2 chr19:41729055-41729126-
ELNGNNITRIHKNDFAGLKQLRVL
>NM_022953_rn5_3_38 24 2 2 chr1:268428022-268428093-
QLMENQIGAVERGAFDDMKELERL
>NM_022953_mm10_3_38 24 2 2 chr19:41727070-41727141-
QLMENQIGAVERGAFDDMKELERL
>NM_022953_rn5_4_38 24 2 2 chr1:268423192-268423263-
RLNRNQLQVLPELLFQNNQALSRL

The NM header lines start with '>'. I want to match the digits after rn5 and mm10 (eg:rn5_x_xx with mm10_x_xx ) and keep only those pairs that match and remove the pair of lines that do not have a corresponding match. Example: NM_022953_rn5_4_38 does not have a mm10 match and hence, remove the header line and the sequence line for this entry. The output will be all of the above lines sans this line. I also want to maintain the order of these lines. I really have no idea how to do this. I tried to solve this using awk and sed but have not been successful. I am very new to this and would appreciate help in solving this. Thank you.

fasta perl awk sed text processing • 2.8k views
ADD COMMENT
0
Entering edit mode

What is the data that you are trying to match?

ADD REPLY
2
Entering edit mode
7.9 years ago
  • get the uniq words for rn5 in the input.fasta
  • get the uniq words for mm10 in the input.fasta
  • combine both lists and get the words present in both list.
  • create a list of words with prefix _rn5 and _mm10 with awk
    cat <(cat input.fa | grep ">" | grep rn5_ | cut -d ' ' -f 1 | cut -d '_' -f 4-5  | sort| uniq) <(cat input.fa | grep ">" | grep mm10_ | cut -d ' ' -f 1 | cut -d '_' -f 4-5  | sort | uniq) | sort | uniq -d | awk '{printf("_rn5_%s \n_mm10_%s \n",$0,$0);}' > my.list
$ head my.list
    _rn5_1_38 
    _mm10_1_38 
    _rn5_2_38 
    _mm10_2_38 
    _rn5_3_38 
    _mm10_3_38 
 (...)

now you have 'my.list' , linearize the fasta, fgrep it with your list and convert it back to fasta:

    awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fa | grep -f my.list | tr "\t" "\n"
ADD COMMENT
0
Entering edit mode

Thank you! This works perfectly!

ADD REPLY
0
Entering edit mode

Actually, I am running into some problems when I use the original data file. This code works perfect with my example data file but somehow the last part is not working for the original file.

So, I get a my.list of unique entries and this list has 7844 entries (3922 for each species) and when in the last step we add the fasta lines for these entries the final output file should contain a total of 15686 lines. However, my final output file is turning out to be the same as the input.fa with 648210 lines. I am so confused why it isn't working with the original file.

ADD REPLY
0
Entering edit mode

check there is no empty line in 'my.list': it would match all the lines.

ADD REPLY
0
Entering edit mode

Thanks for responding. I removed empty lines if there were any with this sed -i '/^$/d' my.list It still does not help.

EDIT: I also tried with subsections of the my.list file and it seems to work with subsections. There seems to be something in that original my.list that's matching the whole file. I removed blank/empty lines and white spaces but non of that seems to work. Please help! :/

ADD REPLY
0
Entering edit mode

UPDATE: I think I figured out what the problem is. The unique entries are not based on rn5_x_xx but NMxx_rn5_x_xx. So I have more than one sequence for each entry in my.list. That was a total carelessness on my part to not have noticed that. Trying to now modify the cut command to make my.list with NMxx_rn5_x_xx entries.

ADD REPLY

Login before adding your answer.

Traffic: 1655 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6