A very rookie Question
1
0
Entering edit mode
16 months ago

Hi,

I am working on SNP files, and I am trying to search a list of SNPs in another file in command line:

I have been using this line:

$ grep -w -f {file with list of snps in rsid in a single column .txt} {file being searched with a larger number of SNPs .txt} > {file with the found Snps .txt}

The thing is this worked fine yesterday, but today it has been returning all the SNPs in the larger file that are in millions and I do not want that. I just wanted the selected SNPs with their values.

What to do in this case?

Thank you

genetics grep • 1.1k views
ADD COMMENT
1
Entering edit mode

The thing is this worked fine yesterday, but today it has been returning all the SNPs

With the same set of files? That is unlikely if nothing has changed with the files.

If not then there is something wrong with one of the files you are using. My guess would likely be the input rsID files. Did that file come from a windows machine?

You should ideally be using bcftools instead of grep for this:

bcftools view -i'ID=@rsID_list.txt' in.vcf
ADD REPLY
0
Entering edit mode

Thank you for the suggestion!

May I ask, What does bcftools do, and is it solely for genetic data? Also, does it need to be setup in the terminal?

ADD REPLY
0
Entering edit mode

The thing is this worked fine yesterday, but today it has been returning all the SNPs in the larger file that are in millions and I do not want that.

check there is no blank line or just a dot . in your rsID_list.txt

furthermore, you should use grep -F -w -f to prevent the regular expressions

furthermore you should use a tool like bcftools to extract those variants.

ADD REPLY
0
Entering edit mode

Thank you for the suggestion!

Is there a difference between the -F and -f options?

what would a . do when running the commands?

ADD REPLY
0
Entering edit mode
16 months ago
LauferVA 4.5k

hi doctor ahelwa,

this portion, here:

today it has been returning all the SNPs in the larger file that are in millions and I do not want that. I just wanted the selected SNPs with their values.

It sounds to me like you may have looped through the larger file with the contents of the smaller file.

I agree with Pierre that bcftools is a good solution here, but it is also useful to understand why this may be happening using the solution you tried.

Essentially, what you need to do is load the first file into memory, into a data object like a set.

You can then loop through the larger, multicolumn file only a single time, and check each entry in each line only once.

VAL

ADD COMMENT
0
Entering edit mode

Hi LauferVA,

Thank you for your suggestion!

Yes I have tried a bit to got through the files manually and check every single rsID but when it is a large file it becomes difficult (especially that I am new in this and I don't have the set of tools to shortcut many tasks) but yeah, you are absolutely right.

May I ask when you say loop, what do you mean by it?

ADD REPLY

Login before adding your answer.

Traffic: 1495 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6