Quick Search For Overlaps With A Mask File
3
0
Entering edit mode
11.1 years ago
diviya.smith ▴ 60

Hi there,

I am interested in computing some statistics for a subset of the genome. I have compiled a list of 28M sites and I am now checking if this site passed the 1000 genomes filters. In order to do this, I am using samtools to access the mask file (in fasta format) for the particular position using -

samtools faidx maskfile chr_pos

However, for 28M sites, this is very slow. Does anyone have any suggestions to speed up the search? I am fairly good at Perl programming but I dont think using the standard search methods that I am familiar with will significantly speed things up. Any suggestions would be very helpful?

Also as step2, I need to lookup the results from a table for which I am currently using perl hashes but again too slow for such large amount of data. I would be most grateful if you have any suggestions to more the speed.

-Diviya

search • 2.5k views
ADD COMMENT
1
Entering edit mode

What are the attributes for these 28M sites ? How you are matching with genomes ? Are you trying to match based on some interval range etc?

ADD REPLY
0
Entering edit mode
11.1 years ago

you could always try using ANNOVAR filtering option, which would give you 1000 genomes' population frequencies, if by "checking if this site passed the 1000 genomes filters" you mean that you want to want to know if the 1000 genomes project detected those sites as variants:

annotate_variation.pl -filter -dbtype 1000g2012apr_eur -buildver hg19 YOUR_FILE humandb/
ADD COMMENT
0
Entering edit mode

Pardon my ignorance, But Can you please explain that how this is going help above posed question ? This is purely for my knowledge only !

ADD REPLY
0
Entering edit mode

I think this is an attempt to be more specific about "this site passed the 1000 genomes filters".

ADD REPLY
0
Entering edit mode

thanks Michael, you got it. I was just suggesting to filter (or to highlight) 1000 genomes' variants, if the original 28M sites are to be queried for known variation.

ADD REPLY
0
Entering edit mode
11.1 years ago
Michael 55k

You should calculate range overlaps between query regions and 1kg regions to approach this in the most efficient way. Therefore, convert both files to ranges either in GFF or BED format and use bedtools or Bioconductor/IRanges to calculate overlaps. See What is the quickest algorithm for range overlap?

To do this you need to face the following challenges:

  • Is the mask file the correct and optimal way of getting all the "site passed the 1000 genomes filters" and what does that mean?
  • How is the fact encoded in the mask file and how can this be converted to BED or GFF? If you show us where you got that file from would certainly help.
ADD COMMENT
0
Entering edit mode
11.1 years ago
diviya.smith ▴ 60

Sorry, perhaps I was not very clear. I am interested in extracting all the CG sites across the genome (~28M) and checking if these sites passed the 1000 genomes filters (information for which is included in a mask file in fasta format). I am currently using samtools to pull out the sites from the fasta file but the search is very slow. Any suggestions on how I can improve the search time?

ADD COMMENT
0
Entering edit mode

this should be a comment, not an answer. anyway, I would go for Michael's suggestion: getting 1000genomes' masked regions in bed format and intersecting them with your sites.

ADD REPLY
0
Entering edit mode

Thanks all for the suggestions. This is very helpful.

ADD REPLY

Login before adding your answer.

Traffic: 1649 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6