In the following data:
data01 =
contig start end haplotype_block
2 5207 5867 1856
2 155667 155670 2816
2 67910 68022 2
2 68464 68483 3
2 118146 118237 132
2 118938 119559 1157
data02 =
contig start last feature gene_id gene_name transcript_id
2 5262 5496 exon scaffold_200003.1 CP5 scaffold_200003.1
2 5579 5750 exon scaffold_200003.1 CP5 scaffold_200003.1
2 5856 6032 exon scaffold_200003.1 CP5 scaffold_200003.1
2 6115 6198 exon scaffold_200003.1 CP5 scaffold_200003.1
2 916 1201 exon scaffold_200001.1 NA scaffold_200001.1
2 614 789 exon scaffold_200001.1 NA scaffold_200001.1
2 171 435 exon scaffold_200001.1 NA scaffold_200001.1
2 2677 2806 exon scaffold_200002.1 NA scaffold_200002.1
2 2899 3125 exon scaffold_200002.1 NA scaffold_200002.1
Problem:
- I want to compare the ranges (start - end) from these two data frames.
- If the ranges overlap I want to transfer the
gene_id
andgene_name
values from data02 to to a new column in the data01.
I tried (using pandas):
data01['gene_id'] = ""
data01['gene_name'] = ""
data01['gene_id'] = data01['gene_id'].\
apply(lambda x: data02['gene_id']\
if range(data01['start'], data01['end'])\
<= range(data02['start'], data02['last']) else 'NA')
How can I improve this code? I am currently sticking to pandas, but if the problem is better addressed using dictionary I am open to it. But, please explain the process, I am open to learning rather than just getting an answer.
I also tried using bedtools
intersect, rather than merge:
bedtools intersect -a data01.txt -b data02.txt -sorted > merged_d1and2.txt
The result merged_d1and2.txt is the interval values from data01 broken (or updated) into more interval based on overlap from data02. But, I simply want to create new field (gene_id and gene_name) in data01 and update based on overlaps from data02.
Can some one suggest me an improved pythonic or bedtools approach?
thanks.
Thanks,
Hi Alex, thanks for the answer. Can you add a little bit of explanation, so I can follow and make if any other changes are desired. Thanks
I am getting the following error message:
Genomic end coordinate is less than (or equal to) start coordinate.
You have one or more lines in a BED file, where the end coordinate is less than, or equal to, the start coordinate.
If you're following the UCSC convention of a half-open, 0-indexed BED format, this means this is a not valid BED file.
Use
awk
to find out where you have such a line and clean it up, either by adjusting the start coordinate or by removing the line, e.g.:Or to filter out bad BED elements:
But, isn't the UCSC method of 0-indexed BED format the appropriate one.
That's correct: this error message means you do not have a BED file with valid coordinates. Your BED file is not following convention.
I don't understand what you mean my my bed file are not following convension. I am going to check how start vs. end compare for each line in my bed files first.
I was able to run the script after removing the lines that had same start and end position. But, still output has some problems. Btw, why are you using 'sort-bed' rather than
bedtools
. Thanks.Hi, Alex
Why is there
$5"_"$6
in the code? and are the values$4,$7
after that redundant. I would like some explanation, since the output is not exactly the same as I expected.Thanks,
You want those two fields in the output. That merges them and puts them into the ID column, where they can be easily mapped with
bedmap
. The other two fields are not used, so you can leave the command as it is or remove them — up to you.For some reason the
gene_id
andgene_name
are connected by an underscore in the output files. likescaffold_200003.1_CP5
. And in some placesgene_id
are connected withgene_id
.See:
$5"_"$6
. It might help to learn a little basicawk
or explore what each step of the commands is doing.I know little bit of awk and reading it. But, why is there the underscore in the output. I think its not fruitful to me to read the whole documentation. But, thaanks anyway.
We want to transfer only the 5th and 6th field from data02 to data01. But, why print all the fields?? I read the awk documentation but can't find clue to 1) why we need to do
{print $1,$2,$3,$5"_"$6,$4,$7}
and 2) the rationale behind$5"_"$6
. There might be something withbedmap
.You don't have to print all the fields. But if you want to map the ID field, you need an ID field containing the 5th and 6th fields, which is why we use
$5"_"$6
. You can also map the entire element and not concatenate the 5th and 6th fields, but then you get the interval and everything else. The goal here to minimize the amount of work you have to do to clean up the output. Hope this helps!If that's the point then why in the ouput
geneid and geneName
are merged together. The output is coming as values ofgene_id_gene_name
Because they need to be in the ID field in order to be mapped as an ID? You can replace the underscore with a space character, if you want.