Getting the number of SNPs in some ranges
5
0
Entering edit mode
5.6 years ago
zizigolu ★ 4.3k

Hi,

I have called copy number and I have 2 files (I have shared link of my files ); One contains some ranges

> head(cndata[,c(2,3,4)])
     Chromosome   Start      End
4470          1   51479   817980
4471          1  818499  1136753
4472          1 1138735  2558308
4473          1 2558740  5724264
4474          1 5724940  5749083
4475          1 5749226 12529544
> 

[REDACTED DROPBOX LINK]

I have another file like below

              Chromosome         Position
rs62635286       1                 51479   
rs75454623       1                 114930
rs806731         1                 30923

[REDACTED DROPBOX LINK]

How I can count the number of SNP in each range? For example based on the second table how many SNP are in range of 51479 to 817980?

LINUX CNV wgs R • 4.7k views
ADD COMMENT
0
Entering edit mode

I'm not so comfortable with R, but I would probably do this with either bedtools (if you have bed files of ranges and positions) or with bcftools (if you have a vcf file for your SNPs and a bed file or ranges).

ADD REPLY
0
Entering edit mode

I don't have bed file :(

ADD REPLY
0
Entering edit mode

There are a few filetypes in bioinformatics which are used A LOT, such as bed, vcf, sam/bam, gff, fasta/fastq. Always try to work with these formats, since there are so many options and tools which you can use to get your stuff done. Rolling your own solution in a scripting language like R/Python is definitely possible and you'll learn some coding while you're doing that, but is unlikely the fastest thing you can do to get your problem solved.

ADD REPLY
0
Entering edit mode

We need to "merge with overlap/range" then "group by count", searching these terms should help you solve the problem.

ADD REPLY
0
Entering edit mode

Hello F!

It appears that your post has been cross-posted to another site: https://bioinformatics.stackexchange.com/questions/8531/extracting-the-number-snp-in-each-range

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY
0
Entering edit mode

Instead of tracing my activities in different forums, you guys please help me :(

ADD REPLY
3
Entering edit mode

You do not help your self. I have lost count of the number of questions you have asked in recent days/weeks which show absolutely no attempt at the task, nor willingess to learn.

We are not here to write code for you.

ADD REPLY
1
Entering edit mode

We did help you. There are solutions in this thread. What's wrong with those?

ADD REPLY
0
Entering edit mode

My brain not working nowadays :( too complicated solutions

ADD REPLY
3
Entering edit mode

We're glad to put you on the track to a solution, but we're not going to do your work. I estimate that the solution described here would take you 15 minutes of concentration, maximally. By the way, come say hi on slack if you have the time.

ADD REPLY
1
Entering edit mode

Soooo...you just expect others to do your work for you?

ADD REPLY
1
Entering edit mode
5.6 years ago
zizigolu ★ 4.3k
snp_table <- read.table('WTSI-OESO_121_1pre.copynumber.txt')
genomic_ranges <- read.table('WTSI-OESO_121_1pre.copynumber.caveman.txt.csv', sep = ',', col.names = c('Chromosome', 'Start', 'End'), stringsAsFactors = F)

how_many_in_range <- function(coords){
    coords = as.vector(coords)
    sum(snp_table$Chromosome == coords[1] & snp_table$Position > as.numeric(coords[2]) & snp_table$Position < as.numeric(coords[3]))
}

genomic_ranges$number_of_snps <- apply(genomic_ranges, 1, how_many_in_range)
ADD COMMENT
3
Entering edit mode

Thanks for posting a home-grown solution that worked for you! Do note that the other replies suggest solutions that employ widely accepted and optimized tools for exactly the kind of question you were addressing. It's usually more efficient in the long-run to familiarize oneself with standard tools of the field rather than asking other to re-invent a script, as small as it may be.

ADD REPLY
1
Entering edit mode

Try to verify the other solutions and accept ones that work. Also, it is good practice to link to the original source when you copy-paste code in open forums.

This solution is from Bioinfo SE: https://bioinformatics.stackexchange.com/a/8532/650

ADD REPLY
5
Entering edit mode
5.6 years ago

You have a .csv file with chromosome, start, end coordinates. You can change that into a .bed file pretty easily. Same with your SNP csv file.

Convert those csv files to bed and then use bedtools intersect.

ADD COMMENT
0
Entering edit mode

Just in case you're not sure this is what BED looks like. All you need to do is reorder your columns. You can do it in Excel.

ADD REPLY
0
Entering edit mode

Just to be complete, the definition of the BED file format simply ask for exactly those three columns, separated by tabs. To turn your second file into a BED file, you would obviously need to add a start column. Note that BED files' coordinates are zero-based, half-open coordinates (explained, for example, here), so it may help precision if you found out which convention was used for the files you have to make the conversion exact.

ADD REPLY
4
Entering edit mode
5.6 years ago
ATpoint 85k

Save GRanges as BED ( How To Write Data In A Granges Object To A Bed File. ) and then use BEDtools intersect. Have a look at the -c and -wa -wb options. This should not be a problem.

ADD COMMENT
3
Entering edit mode
5.6 years ago

Convert your SNPs from VCF to BED with vcf2bed, and then map them with bedmap --count to count SNPs that overlap the ad-hoc region:

$ vcf2bed < snps.vcf > snps.bed
$ echo -e '1\t51479\t817980' | bedmap --echo --count --delim '\t' - snps.bed > answer.bed

The count will be in the fourth column of answer.bed.

ADD COMMENT
0
Entering edit mode
2.2 years ago
Afrooz • 0

you can use

library(GenomicRanges)
countOverlaps(cndata, snps)
ADD COMMENT

Login before adding your answer.

Traffic: 2143 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6