Tab Delimited Files Comparison
1
0
Entering edit mode
11.7 years ago
Raghav ▴ 100

I have two files, file1 contain, information regarding gene start positions and gene end positions with their chromosome locations and file 2 contain variant information in particular chromosome location it may be in CDS/5'UTRs/3'UTRs or some where else, I want to compare file2 (variant information) to file1 (complete gen information) and want to filter only those gene from file1 that have variation according to file2. for example

file2

Don-0    1    288    A    G    3    1    1    1   
Don-0    1    291    T    A    5    2    1    1
Don-0    1    303    T    C    18    3    1    1
Don-0    1    310    C    G    14    2    1    1
Don-0    1    317    C    T    23    4    1    1
Don-0    1    331    A    T    32    6    1    1
Don-0    1    344    A    C    32    9    1    1
Don-0    1    352    G    C    32    7    1    1
Don-0    1    365    C    G    32    6    1    1
Don-0    1    **4000**    A    G    25    6    1    1
`

file1

chromosome    1    30427671    .    .    .    ID=Chr1;Name=Chr1    
**gene            3631    5899    .**    +    .    ID=AT1G01010;Note=protein_coding_gene;Name=AT1G01010    
mRNA    3631    5899    .    +    .    ID=AT1G01010.1;Parent=AT1G01010;Name=AT1G01010.1;Index=1    
protein    3760    5630    .    +    .    ID=AT1G01010.1-Protein;Name=AT1G01010.1;Derives_from=AT1G01010.1    
exon    3631    3913    .    +    .    Parent=AT1G01010.1    
five_prime_UTR    3631    3759    .    +    .    Parent=AT1G01010.1    
CDS    3760    3913    .    +    0    Parent=AT1G01010.1,AT1G01010.1-Protein;    
exon    3996    4276    .    +    .    Parent=AT1G01010.1    
CDS    3996    4276    .    +    2    Parent=AT1G01010.1,AT1G01010.1-Protein;

We want to filter only gene which contains variant position from file2 like

 **gene            3631    5899    .** **4000**    **A    G**

Need your valuable comments and suggestions

bash • 3.0k views
ADD COMMENT
1
Entering edit mode

I'm sure you can find many answers on this site for your question. Search for "bedtools" please.

ADD REPLY
2
Entering edit mode
11.7 years ago

Can you get your file1 data into UCSC BED format? The file1 data looks almost like GFF.

If you can work this into GFF, it can be converted into sorted BED with BEDOPS gff2bed and BEDOPS sort-bed (and optionally filtered for genes with grep):

$ gff2bed < file1 \
    | grep gene - \
    > file1.genes.bed

Once in BED format, it can be mapped against a modified, sorted file2 with BEDOPS bedmap and BEDOPS sort-bed (assuming Don-0 is the same as Chr1).

First, modify file2 to make sure it uses the same chromosome ID as in file1 and is in correct sort order:

$ awk '{res="Chr1"; for(i=2;i<=NF;i++){res=res"\t"$i}; print res}' file2 \
    | sort-bed - \
    > file2.mod.bed

The bedmap call reports elements of file2 that overlap each file1 element:

$ bedmap --echo --echo-map --delim '\t' file1.genes.bed file2.mod.bed \
    > answer.bed

In answer.bed, you will get each file1 element, along with any qualifying file2 elements which overlap the file1 element by one or more bases. For example (taking liberties with your data as shown, in order to demonstrate a possible result):

...
Chr1    3631    5899    {rest of file1 element...}    Chr1    1    4000    A    G    {rest of file2 element...}
...

It looks like you might first need to work your data into a standard format that can be easily used with tools of this sort.

ADD COMMENT

Login before adding your answer.

Traffic: 1791 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6