Hi,
I would like to filter sequences (in command line Unix) with grep or BBmap based on a list of names stored in a separate file. The list has the names but not the full names, just part of the full sequence name.
The name list looks like:
cre
cln
pab
pde
pta
ppt
smo
atr
seu
pgi
cca
han
The sequences look like this (the names, from the list I have, are at the beggining (first three characters)):
>cel-let-7-5p MIMAT0000001 Caenorhabditis elegans let-7-5p
UGAGGUAGUAGGUUGUAUAGUU
>cel-let-7-3p MIMAT0015091 Caenorhabditis elegans let-7-3p
CUAUGCAAUUUUCUACCUUACC
>cel-lin-4-5p MIMAT0000002 Caenorhabditis elegans lin-4-5p
UCCCUGAGACCUCAAGUGUGA
>pad-lin-4-3p MIMAT0015092 Caenorhabditis elegans lin-4-3p
ACACCUGGGCUCUCCGGGUACC
>pad-miR-1-5p MIMAT0020301 Caenorhabditis elegans miR-1-5p
CAUACUUCCUUACAUGCCCAUA
>cel-miR-1-3p MIMAT0000003 Caenorhabditis elegans miR-1-3p
UGGAAUGUAAAGAAGUAUGUA
>cel-miR-2-5p MIMAT0020302 Caenorhabditis elegans miR-2-5p
CAUCAAAGCGGUGGUUGAUGUG
>cca-miR-2-3p MIMAT0000004 Caenorhabditis elegans miR-2-3p
UAUCACAGCCAGCUUUGAUGUGC
>cca-miR-34-5p MIMAT0000005 Caenorhabditis elegans miR-34-5p
AGGCAGUGUGGUUAGCUGGUUG
My BBmap code is the following:
./bbmap/filterbyname.sh in=mature.fa out=filtered.fa include=t names=names.txt substring
I don't have an idea for grep.
The problem is that this code filters other sequences too (wrong sequences) because I don't know how to tell to filter only those where name present at the beginning. Maybe 'grep' would be better?
Please help.
Best wishes,
thend
adding a
-
to your input name list might already help in your example (putting a>
in front will even help more)also the
substring
parameter requires a value, no? I would set it tosubstring=name
Dear lieven.sterck,
The original command does not require '-', please look:
I tried also with substring=name, but unfortunately, it doesn't change the outcome.
I meant adding the dash to your names in the name list (not to the options), just like you did with your grep approach below ;-)