Hi, I have a file with 30,000 fasta entries from two different species, rn5 and mm10 (CDS alignment of the two species). I want to able to make sure that every fasta entry from rn5 has a corresponding partner from mm10. I know thats not true because I have 16,200 sequences from rn5 and only 15,000 sequences from mm10. So I want to be able to remove all stray rn5/mm10 sequences from the file. My file looks like
>NM_022953_rn5_1_38 66 0 2 chr1:268444735-268444931-
MALTPQRGSSSGLSRPELWLLLWAAAWRLGATACPALCTCTGTTVDCHGTGLQAIPKNIPRNTERL
>NM_022953_mm10_1_38 66 0 2 chr19:41743212-41743408-
MALTPQRGSSSGLSRPELWLLLWAAAWRLGATACPALCTCTGTTVDCHGTGLQAIPKNIPRNTERL
>NM_022953_rn5_2_38 24 2 2 chr1:268429604-268429675-
ELNGNNITRIHKNDFAGLKQLRVL
>NM_022953_mm10_2_38 24 2 2 chr19:41729055-41729126-
ELNGNNITRIHKNDFAGLKQLRVL
>NM_022953_rn5_3_38 24 2 2 chr1:268428022-268428093-
QLMENQIGAVERGAFDDMKELERL
>NM_022953_mm10_3_38 24 2 2 chr19:41727070-41727141-
QLMENQIGAVERGAFDDMKELERL
>NM_022953_rn5_4_38 24 2 2 chr1:268423192-268423263-
RLNRNQLQVLPELLFQNNQALSRL
The NM header lines start with '>'. I want to match the digits after rn5 and mm10 (eg:rn5_x_xx with mm10_x_xx ) and keep only those pairs that match and remove the pair of lines that do not have a corresponding match. Example: NM_022953_rn5_4_38 does not have a mm10 match and hence, remove the header line and the sequence line for this entry. The output will be all of the above lines sans this line. I also want to maintain the order of these lines. I really have no idea how to do this. I tried to solve this using awk and sed but have not been successful. I am very new to this and would appreciate help in solving this. Thank you.
What is the data that you are trying to match?