Hello!
I want to subset a selected dataset (a list of entries) from a big data file. I have a list named "contig.list" that looks like this:
Contig_339241_4
Contig_1004621_3
Contig_1666_1
Contig_836268_32
Contig_1479_10
Contig_640297_1
Contig_365838_1
..
I want to subset the entries of this list from a big table named "function.tax.ranks" that looks like this:
Contig_339241_4 Taxonomy
Contig_339241_41 Taxonomy
Contig_339241_47 Taxonomy
Contig_1004621_3 Taxonomy
Contig_1004621_30 Taxonomy
Contig_1004621_39 Taxonomy
Contig_1666_1 Taxonomy
Contig_836268_32 Taxonomy
Contig_1479_10 Taxonomy
Contig_1479_100 Taxonomy
Contig_1479_100 Taxonomy
Contig_1479_107 Taxonomy
Contig_640297_1 Taxonomy
Contig_365838_1 Taxonomy
Contig_365838_16 Taxonomy
Contig_365838_17 Taxonomy
..
The resulting output should be:
Contig_339241_4 Taxonomy
Contig_1004621_3 Taxonomy
Contig_1666_1 Taxonomy
Contig_836268_32 Taxonomy
Contig_1479_10 Taxonomy
Contig_640297_1 Taxonomy
Contig_365838_1 Taxonomy
I have tried
grep -f contig.list function.tax.ranks > contig_taxa.txt
But the problem is the subsetting doesn't stop at the last digit, it extracts everything after that. For example, while my list has only "Contig_339241_4", I am getting additional output from "Contig_339241_41" and "Contig_339241_47" (basically all entries from Contig_339241_4[0-9]). How can I fix it?
Thank you very much in advance!
Regards, PSP
GenoMax thanks a lot!