how to remove multiple redundent sequence id of protein
2
0
Entering edit mode
9.3 years ago

Hi everyone

I have a list of id in which many are redundant (not similar) like I have following id

Traes_1DL_9344DD117.1
Traes_1DL_9344DD117.2
Traes_1DL_9344DD117.3
Traes_1DL_9344DD117.4
Traes_1DL_BDDF1198A.1
Traes_1DL_BDDF1198A.2
Traes_1DL_BDDF1198A.3
Traes_1DL_BDDF1198A.4
Traes_1DL_BDDF1198A.5
Traes_1DL_C91636A55.1
Traes_1DL_C91636A55.2
Traes_1DL_C91636A55.3
Traes_1DL_C91636A55.4
Traes_1DL_C91636A55.5
Traes_1DL_C91636A55.7
Traes_1BL_43408C9B0.2
Traes_2AL_48B239DD6.1
Traes_2AL_48B239DD6.2
Traes_2AL_48B239DD6.3

Traes_2AL_48B239DD6.4
Traes_2AL_C00552444.1
Traes_2AL_C00552444.2
Traes_2AL_C00552444.3
Traes_2AS_146D16572.1
Traes_2AS_146D16572.2
Traes_2AS_146D16572.3
Traes_2AS_146D16572.4
Traes_2AS_146D16572.5

I am assuming that the id having similar words before point (.) but differ later like

Traes_2AS_146D16572.1
Traes_2AS_146D16572.2
Traes_2AS_146D16572.3
Traes_2AS_146D16572.4
Traes_2AS_146D16572.5

having only one significant id i.e Traes_2AS_146D16572.1, and I want to retain that only from rest. So how I can do it through programming?

In short, I want to extract only id's which have .1 in last from all among. And if .1 is not present but .2 or .3 or so on is present and they are unique means initial id is different then I also want to print them. like Traes_1BL_43408C9B0.2 is only present and I want to print them.

Thank you

protein-sequence perl • 2.7k views
ADD COMMENT
1
Entering edit mode

the answer is

grep '\.1$' input.txt
ADD REPLY
1
Entering edit mode

what if there is only version .2 of the entry in the input?

ADD REPLY
1
Entering edit mode

he said "i want to extract only id's which have .1 in last from all among. "

ADD REPLY
0
Entering edit mode

Thanks nterhoeven,

I have put this as a separate question, but the moderator closed it telling me to search not put as a new question.

The grep gives a pattern,so for 10,000 or so I cannot do grep always for individual, that's y I was searching for awk command as I mentioned earlier.

Any other suggestions are welcome...

thanks a lot
RV

ADD REPLY
0
Entering edit mode

you can use the --file option with grep to search for multiple pattern, but this can be quite slow, if your fasta file is large.

A faster solution would be a perl script using a hash and the "exists" function.

BTW: you should use answers only if you are answering the question on the top. Other stuff (discussions, etc) should be done in comments

ADD REPLY
0
Entering edit mode

Ok Thanks nterhoeven |

ADD REPLY
0
Entering edit mode
9.3 years ago
nterhoeven ▴ 120

Edits:

An edit after discussing in the comments of this answer to give a clear solution:

Scenario 1:

You only need the part before the "."

cat input.txt | cut -d"." -f1 | sort | uniq

will go through the IDs, remove the part after the "." and make the list unique

cat input.txt | cut -d"." -f1 | sort | uniq -c

will additionally show how often the ID was seen.

Scenario 2:

You want to keep the part after the "."

cat input.txt | sort | perl -ne 'print $_ unless $_=~/^$patt\.\d+$/; ($patt)=$_=~/(^\S+)\./'

This will read the list of IDs, sort them and print an ID, if the part before the "." has not been seen before.

You should try all three commands I posted here and maybe make some changes to your test data (removing the ID with ".1", etc) to see how they work and then decide, which command suits you the best.


Original answer:

This should work for you

perl -ne 'print $_ unless $_=~/^$patt\.\d+$/; ($patt)=$_=~/(^\S+)\./'
echo "Traes_1DL_9344DD117.1
Traes_1DL_9344DD117.2
Traes_1DL_9344DD117.3
Traes_1DL_9344DD117.4
Traes_1DL_BDDF1198A.1
Traes_1DL_BDDF1198A.2
Traes_1DL_BDDF1198A.3
Traes_1DL_BDDF1198A.4
Traes_1DL_BDDF1198A.5
Traes_1DL_C91636A55.1
Traes_1DL_C91636A55.2
Traes_1DL_C91636A55.3
Traes_1DL_C91636A55.4
Traes_1DL_C91636A55.5
Traes_1DL_C91636A55.7
Traes_2AL_48B239DD6.1
Traes_2AL_48B239DD6.2
Traes_2AL_48B239DD6.3
Traes_2AL_48B239DD6.4
Traes_2AL_C00552444.1
Traes_2AL_C00552444.2
Traes_2AL_C00552444.3
Traes_2AS_146D16572.1
Traes_2AS_146D16572.2
Traes_2AS_146D16572.3
Traes_2AS_146D16572.4
Traes_2AS_146D16572.5" > input.txt

cat input.txt | perl -ne 'print $_ unless $_=~/^$patt\.\d+$/; ($patt)=$_=~/(^\S+)\./'

Traes_1DL_9344DD117.1
Traes_1DL_BDDF1198A.1
Traes_1DL_C91636A55.1
Traes_2AL_48B239DD6.1
Traes_2AL_C00552444.1
Traes_2AS_146D16572.1
ADD COMMENT
0
Entering edit mode

Let me know, if I should explain the perl command

ADD REPLY
0
Entering edit mode

Thank you sir, it works.

But what I have to do if .1 is not available so in this case I have to take .2 as well, so how I can extract .1 and if .1 is not there then .2, or others unique in single command.

ADD REPLY
0
Entering edit mode

Do you need .1 or .2 as well or just the base name? like Traes_2AS_146D16572

ADD REPLY
0
Entering edit mode

Thank you sir, I think the way I want to get id is not possible (but here is my elaborate question- sir at the time of annotation they are named id

Traes_2AS_146D16572.1
Traes_2AS_146D16572.2
Traes_2AS_146D16572.3
Traes_2AS_146D16572.4

like this and I assumed that the id .1 is important, but if .1 is not present then I can take .2 like Traes_1BL_43408C9B0.2 is present but Traes_1BL_43408C9B0.1 is not present so I can take Traes_1BL_43408C9B0.2 as well.) but it seems it is not possible if it is possible then please help me) thank you for your explanation.. sir can I get unique initial id means before "." like

Traes_1DL_9344DD117
Traes_1DL_BDDF1198A
Traes_1DL_C91636A55
Traes_2AL_48B239DD6
Traes_2AL_C00552444
Traes_2AS_146D16572
Traes_1BL_43408C9B0

so I can get idea and count this also

Thank u once again

ADD REPLY
1
Entering edit mode

if you only need the part before the ".", you can use

cut -d"." -f1 | sort | uniq

adding the -c option to unique will give you the number of occurrences.

ADD REPLY
0
Entering edit mode

OK sir. Thank you very much.

ADD REPLY
0
Entering edit mode

The command I posted takes care of this.

Here is an explanation how it works:

perl -ne  #start reading the input file line by line
'print $_ unless $_=~/^$patt\.\d+$/;    #print the line, if the part before the . does not match the the content of $patt
($patt)=$_=~/(^\S+)\./'    #set $patt to the current string before the .

So, it always prints the first occurence of an ID and skips all following. Note, that this only works if the input is sorted so you might want to run it as

cat input.txt | sort | perl -ne ...
ADD REPLY
0
Entering edit mode

I edited my answer to include some of the comments made here

ADD REPLY
0
Entering edit mode
9.3 years ago

Hi,

I am stuck in same kind of problem, as a newbie in bioinformatics, I am trying to extract the fasta sequences from a file_2 with similar ids from another list file_1

File_1.fasta

>comp148_c0_seq1
>comp169_c0_seq1
>comp258_c0_seq1
>comp285_c0_seq1
>comp350_c0_seq1
>comp424_c0_seq1
>comp783_c0_seq1
>comp1089_c0_seq1

File_2.fasta

>comp6_c1_seq1  -1      22      237
MAILRFMDSWVVGVNVCGKRPRRFVDPINMIRETIIRVHVRPFGVWISIICLIISLTSQCWKEWRRLLIRRF
>comp11_c0_seq1 -2      35      358
MKEDDKVIEDDEKAEGSKGDIQKEEPGADDETEESNKLIGDNQGKDEANADEDDPQNEETIDKSEENKQREEQQQITLLHFIGRSFASLLKNLLKKTCPSAAEGNNYY
>comp42_c0_seq1 +3      114     305
MPSTLKFLLIYESHWYDKNSEKLVNEFLSLLAHCTQLRYMPILLEDYDLLKLIEEKNTRQFDKI
>comp43_c0_seq1 -2      38      298
MSSSNWAFYIGVSNGHVHDNLLVCKAPNCYCFPTRMDHCYIGGTQHHFEWPTDAISRPQWNGIGDTLGCGILLNPKNELAIFFTANG
>comp48_c0_seq1 +3      18      242
MLYKSLINSKSLRGKTPAEVVNMFANDGQRIFDAVTFAPLVLIGPLVLVGGLIYLLRVIGPVSLLAVSVFLIFDF
>comp53_c0_seq1 -1      55      312
MIGNRLRVKRDKVTLKMEISHCLHSQIIGRGGRNTQKIMRDTGCHIHFPDSNKCLTNPVTMPQQAKNDQVSISGCAKDVEKAREML
>comp56_c0_seq1 -2      110     379
MNNARLNAEINELHAAIHANVHYGRPFKPSHISMNKSQATDRSDNNVCGQLATIDNKNENDHDNDNDDNEANDETRERRRFTVADYMPGG
>comp56_c1_seq1 -1      52      408
MWIWSGPIFSTTILFGHISPFLKPGWRRRAYFCPNFPRRLRVYWATTALLSLLIITFLLGRHFFLGVPSQLTTSPEAIFPSTLGLDCGNVTLSFTAAANDEKVPSNQTLAADKVVASFA
>comp72_c0_seq1 +2      14      271
MYALTWNGLMELKLGADTCPDVNVNWEVFGERGLKSISLFAVADKVFIFSTPNELLVYDRGSATISQFPIASPPLKTLLAVSQSSQ
>comp74_c0_seq1 +3      96      305
MVIPFIFLKAQTIQLEAKDESNFCQNRVDLLAHEVEVRVRIGKRQKRALASSAGCCSCGRGPMGERGAPG
>comp79_c0_seq1 -1      31      213
MASAMFCCQVCMLLSSAYHIFGCHSPNRRKRWLRADLFGVSAGLIGLYLSGLYTSFYHFPV

I used awk:

diff <(cat file_1 | grep ">" | sort) <(cat file_2 | grep ">" | sort) | grep "^<" | awk -F\> '{print $2}'

I m just getting ids only, I need sequences to corresponding ID also.

Can anyone suggest awk or perl code for the same?

Thanks
RV

ADD COMMENT
0
Entering edit mode

Hi Ruchi, welcome to Biostars.

It is better to open a new question for this, but here is a quick answer:

Assuming, your fasta file has no empty lines and only one line per sequence, you can use the -A option of grep.

This allows you to display the lines following a match. For example:

grep -A 1 'comp72_c0_seq1' File_2.fasta

will return:

>comp72_c0_seq1 +2      14      271
MYALTWNGLMELKLGADTCPDVNVNWEVFGERGLKSISLFAVADKVFIFSTPNELLVYDRGSATISQFPIASPPLKTLLAVSQSSQ

You can grep for a list of Patterns with the --file option (check 'man grep' for details)

ADD REPLY

Login before adding your answer.

Traffic: 1780 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6