I am working on SNP files, and I am trying to search a list of SNPs in another file in command line:
I have been using this line:
$ grep -w -f {file with list of snps in rsid in a single column .txt} {file being searched with a larger number of SNPs .txt} > {file with the found Snps .txt}
The thing is this worked fine yesterday, but today it has been returning all the SNPs in the larger file that are in millions and I do not want that. I just wanted the selected SNPs with their values.
The thing is this worked fine yesterday, but today it has been
returning all the SNPs
With the same set of files? That is unlikely if nothing has changed with the files.
If not then there is something wrong with one of the files you are using. My guess would likely be the input rsID files. Did that file come from a windows machine?
You should ideally be using bcftools instead of grep for this:
today it has been returning all the SNPs in the larger file that are
in millions and I do not want that. I just wanted the selected SNPs
with their values.
It sounds to me like you may have looped through the larger file with the contents of the smaller file.
I agree with Pierre that bcftools is a good solution here, but it is also useful to understand why this may be happening using the solution you tried.
Essentially, what you need to do is load the first file into memory, into a data object like a set.
You can then loop through the larger, multicolumn file only a single time, and check each entry in each line only once.
Yes I have tried a bit to got through the files manually and check every single rsID but when it is a large file it becomes difficult (especially that I am new in this and I don't have the set of tools to shortcut many tasks) but yeah, you are absolutely right.
May I ask when you say loop, what do you mean by it?
With the same set of files? That is unlikely if nothing has changed with the files.
If not then there is something wrong with one of the files you are using. My guess would likely be the input rsID files. Did that file come from a windows machine?
You should ideally be using
bcftools
instead ofgrep
for this:Thank you for the suggestion!
May I ask, What does bcftools do, and is it solely for genetic data? Also, does it need to be setup in the terminal?
check there is no blank line or just a dot
.
in your rsID_list.txtfurthermore, you should use
grep -F -w -f
to prevent the regular expressionsfurthermore you should use a tool like bcftools to extract those variants.
Thank you for the suggestion!
Is there a difference between the -F and -f options?
what would a . do when running the commands?