I want to write a script that filters a pileup file from a site lists file.
As an input I get a reference genome, pileup and site lists files.
Example of an output for this script:
Pileup File :
NC_000001.10 11456 A 0 * *
NC_000001.10 11457 C 0 * *
NC_000001.10 11458 T 0 * *
NC_000002.11 11460 G 1 , E
NC_000002.11 13914 G 1 , E
NC_000003.11 10503990 C 1 , E
Reference File (After using grep by '>'):
>NC_000001.10 Homo sapiens chromosome 1, GRCh37.p13 Primary Assembly
>NT_113878.1 Homo sapiens chromosome 1 unlocalized genomic contig, GRCh37.p13 Primary Assembly
>NT_167207.1 Homo sapiens chromosome 1 unlocalized genomic contig, GRCh37.p13 Primary Assembly
>NC_000002.11 Homo sapiens chromosome 2, GRCh37.p13 Primary Assembly
>NC_000003.11 Homo sapiens chromosome 3, GRCh37.p13 Primary Assembly
Lists File:
chr1 11456 C1orf186 - intronic intronic no no N N N
chr1 11457 C1orf186 - intronic intronic no no N N N
chr2 13914 intergenic - intergenic intergenic no no N N N
chr7 30504355 NOD1 - intronic intronic no no N N N
chr3 10503990 SSX2IP - Syn Gln->Gln no no N N N
chr1 13131320 MEF2A + intronic intronic no no N N N
Output File:
NC_000001.10 11456 A 0 * *
NC_000001.10 11457 C 0 * *
NC_000002.11 13914 G 1 , E
NC_000003.11 10503990 C 1 , E
So for every refseq in the pileup file I would convert it to the matching chromose<id> in which is found in the reference file, and then overlap it with my lists file so I would get the lines with the same chromose<id> and position.
The problem is that while there are many refseq coordinates that belong to chr1 for example, alignment to each of them will give a position, with respect to the specific coordinate. But on the list there is a full, single, chr1.
I need to be able to compare between:
* a pileup file that was created based on a refseq reference genome
* a file with editing sites that was downloaded from RADAR (we may give the file)
Maybe there is a more appropriate reference genome file or that there is a way to convert the refseq positions according to the list
convert your pileup to bed using awk and then use bedtools intersect
Could you please elaborate on how exactly I would need to use bedtools intersect?