Question

How to update the fileds in bedfiles (or dataframes) based on overlaps from between intervals of two bedfiles (dataframes)?

1

Entering edit mode

7.7 years ago

kirannbishwa01 ★ 1.6k

In the following data:

data01 =

contig  start    end    haplotype_block 
2   5207    5867    1856
2   155667    155670    2816
2   67910    68022  2
2   68464    68483  3
2   118146    118237    132
2   118938    119559    1157

data02 =

contig    start   last    feature gene_id gene_name   transcript_id
2   5262    5496    exon    scaffold_200003.1   CP5 scaffold_200003.1
2   5579    5750    exon    scaffold_200003.1   CP5 scaffold_200003.1
2   5856    6032    exon    scaffold_200003.1   CP5 scaffold_200003.1
2   6115    6198    exon    scaffold_200003.1   CP5 scaffold_200003.1
2   916 1201    exon    scaffold_200001.1   NA  scaffold_200001.1
2   614 789 exon    scaffold_200001.1   NA  scaffold_200001.1
2   171 435 exon    scaffold_200001.1   NA  scaffold_200001.1
2   2677    2806    exon    scaffold_200002.1   NA  scaffold_200002.1
2   2899    3125    exon    scaffold_200002.1   NA  scaffold_200002.1

Problem:

I want to compare the ranges (start - end) from these two data frames.
If the ranges overlap I want to transfer the gene_id and gene_name values from data02 to to a new column in the data01.

I tried (using pandas):

data01['gene_id'] = ""
data01['gene_name'] = ""

data01['gene_id'] = data01['gene_id'].\
apply(lambda x: data02['gene_id']\
        if range(data01['start'], data01['end'])\
           <= range(data02['start'], data02['last']) else 'NA')

How can I improve this code? I am currently sticking to pandas, but if the problem is better addressed using dictionary I am open to it. But, please explain the process, I am open to learning rather than just getting an answer.

I also tried using bedtools intersect, rather than merge:

bedtools intersect -a data01.txt -b data02.txt -sorted > merged_d1and2.txt

The result merged_d1and2.txt is the interval values from data01 broken (or updated) into more interval based on overlap from data02. But, I simply want to create new field (gene_id and gene_name) in data01 and update based on overlaps from data02.

Can some one suggest me an improved pythonic or bedtools approach?

thanks.

Thanks,

bedtools merge python gene_features update • 2.5k views

ADD COMMENT • link 7.7 years ago by kirannbishwa01 ★ 1.6k

score 1 · Answer 1 · 2017-04-18

1

Entering edit mode

7.7 years ago

Alex Reynolds 36k

You can use awk and bedmap to solve this:

$ sort-bed data02.bed | awk 'BEGIN{OFS="\t";}{print $1,$2,$3,$5"_"$6,$4,$7;}' - | bedmap --echo --echo-map-id --delim '\t' <(sort-bed data01.bed) - > answer.bed

ADD COMMENT • link 7.7 years ago by Alex Reynolds 36k

0

Entering edit mode

Hi Alex, thanks for the answer. Can you add a little bit of explanation, so I can follow and make if any other changes are desired. Thanks

ADD REPLY • link 7.7 years ago by kirannbishwa01 ★ 1.6k

0

Entering edit mode

I am getting the following error message: Genomic end coordinate is less than (or equal to) start coordinate.

ADD REPLY • link 7.7 years ago by kirannbishwa01 ★ 1.6k

0

Entering edit mode

You have one or more lines in a BED file, where the end coordinate is less than, or equal to, the start coordinate.

If you're following the UCSC convention of a half-open, 0-indexed BED format, this means this is a not valid BED file.

Use awk to find out where you have such a line and clean it up, either by adjusting the start coordinate or by removing the line, e.g.:

$ awk '{if ($2 == $3) { $2--; } print $0; }' bad.bed | sort-bed - > good.bed

Or to filter out bad BED elements:

$ awk '{if ($2 < $3) { print $0; } }' bad.bed | sort-bed - > good.bed

ADD REPLY • link 7.7 years ago by Alex Reynolds 36k

0

Entering edit mode

But, isn't the UCSC method of 0-indexed BED format the appropriate one.

ADD REPLY • link 7.7 years ago by kirannbishwa01 ★ 1.6k

0

Entering edit mode

That's correct: this error message means you do not have a BED file with valid coordinates. Your BED file is not following convention.

ADD REPLY • link 7.7 years ago by Alex Reynolds 36k

0

Entering edit mode

I don't understand what you mean my my bed file are not following convension. I am going to check how start vs. end compare for each line in my bed files first.

ADD REPLY • link 7.7 years ago by kirannbishwa01 ★ 1.6k

0

Entering edit mode

I was able to run the script after removing the lines that had same start and end position. But, still output has some problems. Btw, why are you using 'sort-bed' rather than bedtools. Thanks.

ADD REPLY • link 7.7 years ago by kirannbishwa01 ★ 1.6k

0

Entering edit mode

Hi, Alex

Why is there $5"_"$6 in the code? and are the values $4,$7 after that redundant. I would like some explanation, since the output is not exactly the same as I expected.

Thanks,

ADD REPLY • link 7.7 years ago by kirannbishwa01 ★ 1.6k

0

Entering edit mode

You want those two fields in the output. That merges them and puts them into the ID column, where they can be easily mapped with bedmap. The other two fields are not used, so you can leave the command as it is or remove them — up to you.

ADD REPLY • link 7.7 years ago by Alex Reynolds 36k

0

Entering edit mode

For some reason the gene_id and gene_name are connected by an underscore in the output files. like scaffold_200003.1_CP5. And in some places gene_id are connected with gene_id.

ADD REPLY • link 7.7 years ago by kirannbishwa01 ★ 1.6k

0

Entering edit mode

See: $5"_"$6. It might help to learn a little basic awk or explore what each step of the commands is doing.

ADD REPLY • link 7.7 years ago by Alex Reynolds 36k

0

Entering edit mode

I know little bit of awk and reading it. But, why is there the underscore in the output. I think its not fruitful to me to read the whole documentation. But, thaanks anyway.

ADD REPLY • link 7.7 years ago by kirannbishwa01 ★ 1.6k

0

Entering edit mode

We want to transfer only the 5th and 6th field from data02 to data01. But, why print all the fields?? I read the awk documentation but can't find clue to 1) why we need to do {print $1,$2,$3,$5"_"$6,$4,$7} and 2) the rationale behind $5"_"$6. There might be something with bedmap.

ADD REPLY • link 7.7 years ago by kirannbishwa01 ★ 1.6k

0

Entering edit mode

You don't have to print all the fields. But if you want to map the ID field, you need an ID field containing the 5th and 6th fields, which is why we use $5"_"$6. You can also map the entire element and not concatenate the 5th and 6th fields, but then you get the interval and everything else. The goal here to minimize the amount of work you have to do to clean up the output. Hope this helps!

ADD REPLY • link 7.7 years ago by Alex Reynolds 36k

0

Entering edit mode

If that's the point then why in the ouput geneid and geneName are merged together. The output is coming as values of gene_id_gene_name

ADD REPLY • link 7.7 years ago by kirannbishwa01 ★ 1.6k

0

Entering edit mode

Because they need to be in the ID field in order to be mapped as an ID? You can replace the underscore with a space character, if you want.

ADD REPLY • link 7.7 years ago by Alex Reynolds 36k

score 0 · Answer 2 · 2017-04-18

0

Entering edit mode

7.7 years ago

Alex Reynolds 36k

Use bedmap --echo --echo-map-id --delim '\t' and pipe to an awk statement if you need the ID in a different column. The docs do a good job of explaining things.

ADD COMMENT • link 7.7 years ago by Alex Reynolds 36k