Question

gene's ID extraction

0

Entering edit mode

7.1 years ago

paraskevopou ▴ 20

Dear people, I have 2 txt files, file1.txt with long ID's names (e.g TRINITY_DN14306_c0_g3_i4) and file2.txt with the ID's of my interest, however the isoform information is missing (e.g. TRINITY_DN14306_c0_g3). File1 has 40000 records while file2 has 5000. I would like to extract these 5000 from file1 along with the isoform information. I used the following command but the output I get is empty.

while read line; do grep -e "${line}_" file1.txt; done < file2.txt > out.txt

Any suggestions will be helpful. Thanks a lot in advance! Sofia

rna-seq • 1.3k views

ADD COMMENT • link 7.1 years ago by paraskevopou ▴ 20

0

Entering edit mode

please, validate or comment your previous questions:

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.

Upvote|Bookmark|Accept

ADD REPLY • link 7.1 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

however the isoform information is missing (e.g. TRINITY_DN14306_c0_g3)

please, provide a sample of each files.

ADD REPLY • link 7.1 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Thanks a lot for the comments. Here are mi files. I want to extract only the names that are present on file2.txt from file1.txt. But in file1.txt there is the isoform information (_i*) which I also need in my output file.

file1.txt

TRINITY_DN12874_c0_g1_i1
TRINITY_DN12795_c0_g1_i2
TRINITY_DN12248_c0_g1_i1
TRINITY_DN12868_c0_g1_i1
TRINITY_DN12866_c0_g1_i1
TRINITY_DN12817_c1_g1_i1
TRINITY_DN12775_c1_g2_i2
TRINITY_DN12829_c0_g1_i1
TRINITY_DN12736_c0_g1_i1
TRINITY_DN12865_c0_g1_i1

file2.txt

TRINITY_DN12874_c0_g1
TRINITY_DN12248_c0_g1
TRINITY_DN12866_c0_g1
TRINITY_DN12817_c1_g1

ADD REPLY • link 7.1 years ago by paraskevopou ▴ 20

0

Entering edit mode

-e is for multiple pattern matching, while you have one per iteration. But I am not sure if this could matter.

ADD REPLY • link 7.1 years ago by grant.hovhannisyan ★ 2.6k

1

Entering edit mode

It is helpful if you have extended regular expressions, such as capture groups or even character classes in certain instances. I've rarely ever gone wrong with using an -e when it's not really needed. I've found that more often than not, expected behavior is seen with -e (or -P) than without. Plus, it's easier to build on.

ADD REPLY • link 7.1 years ago by Ram 45k