I have data that looks like this:
Chr Start End Strand log.fc PeakID Annotation Detailed Annotation Distance to TSS Gene Name Gene Description Gene Type
chr3R 28540016 28540264 C 1.646677294 S2_cell_5ug_1 5 UTR (NM_176575, exon 1 of 18) 5 UTR (NM_176575, exon 1 of 18) 184 Ppn Papilin proteinCcoding
chrX 3692952 3693300 + 1.641962533 S2_cell_5ug_2 5 UTR (NM_166984, exon 2 of 8) 5 UTR (NM_166984, exon 2 of 8) 3763 Mnt Mnt proteinCcoding
chr2L 9177079 9177677 + 1.607698006 S2_cell_5ug_3 5 UTR (NM_001201819, exon 2 of 14) 5 UTR (NM_001201819, exon 2 of 14) 1020 tai taiman proteinCcoding
chr2R 23870524 23870872 + 1.5819139 S2_cell_5ug_4 5 UTR (NM_079109, exon 1 of 3) 5 UTR (NM_079109, exon 1 of 3) 425 ken ken and barbie proteinCcoding
chr3L 539884 540582 C 1.569726505 S2_cell_5ug_5 5 UTR (NM_138193) 5 UTR (NM_138193) 381 klar klarsicht proteinCcoding
So it is tab separated but does have spaces within the columns.
I would like to be able to compare these ranges to other files of the same type and structure, but without being limited by the number of files. These are the steps I would like to take:
- Make an index column from the first file that contains all the genomic ranges
- Check against another file if the genomic ranges overlap
- If it does append the data
- If it doesn't, add the range to the 'index' column
- Leave empty columns for the first set of data and append the second set.
So if I had a second file like so:
Chr Start End Strand log.fc PeakID Annotation Detailed Annotation Distance to TSS Gene Name Gene Description Gene Type
chr3R 28540016 28540314 - 0.171281417 m6ace_S2_peaks_2039 5 UTR (NM_176575, exon 1 of 18) 5 UTR (NM_176575, exon 1 of 18) 43872 Ppn Papilin protein-coding
chr2L 9177129 9177677 + 0.399838989 m6ace_S2_peaks_53 5 UTR (NM_001201819, exon 2 of 14) 5 UTR (NM_001201819, exon 2 of 14) 34242 tai taiman protein-coding
chr2R 23870474 23870922 + 0.238601528 m6ace_S2_peaks_875 5 UTR (NM_079109, exon 1 of 3) 5 UTR (NM_079109, exon 1 of 3) 37785 ken ken and barbie protein-coding
chr3L 440634 441032 - 0.256231658 m6ace_S2_peaks_678 5 UTR (NM_001103992, exon 1 of 4) 5 UTR (NM_001103992, exon 1 of 4) 38067 klar klarsicht protein-coding
chr3L 391975 392173 - 0.110795274 m6ace_S2_peaks_3280 3UTR (NR_124740) 3UTR (NR_124740) 38065 trh trachealess protein-coding
Here is an example of how it would look:
Index_Chr Index_Start Index_End Chr Start End Strand log.fc PeakID Annotation Detailed Annotation Distance to TSS Gene Name Gene Description Gene Type Chr Start End Strand log.fc PeakID Annotation Detailed Annotation Distance to TSS Gene Name Gene Description Gene Type
chr3R 28540016 28540264 chr3R 28540016 28540264 C 1.646677294 S2_cell_5ug_1 5 UTR (NM_176575, exon 1 of 18) 5 UTR (NM_176575, exon 1 of 18) 184 Ppn Papilin proteinCcoding chr3R 28540016 28540314 - 0.171281417 m6ace_S2_peaks_2039 5 UTR (NM_176575, exon 1 of 18) 5 UTR (NM_176575, exon 1 of 18) 43872 Ppn Papilin protein-coding
chrX 3692952 3693300 chrX 3692952 3693300 + 1.641962533 S2_cell_5ug_2 5 UTR (NM_166984, exon 2 of 8) 5 UTR (NM_166984, exon 2 of 8) 3763 Mnt Mnt proteinCcoding
chr2L 9177079 9177677 chr2L 9177079 9177677 + 1.607698006 S2_cell_5ug_3 5 UTR (NM_001201819, exon 2 of 14) 5 UTR (NM_001201819, exon 2 of 14) 1020 tai taiman proteinCcoding chr2L 9177129 9177677 + 0.399838989 m6ace_S2_peaks_53 5 UTR (NM_001201819, exon 2 of 14) 5 UTR (NM_001201819, exon 2 of 14) 34242 tai taiman protein-coding
chr2R 23870524 23870872 chr2R 23870524 23870872 + 1.5819139 S2_cell_5ug_4 5 UTR (NM_079109, exon 1 of 3) 5 UTR (NM_079109, exon 1 of 3) 425 ken ken and barbie proteinCcoding chr2R 23870474 23870922 + 0.238601528 m6ace_S2_peaks_875 5 UTR (NM_079109, exon 1 of 3) 5 UTR (NM_079109, exon 1 of 3) 37785 ken ken and barbie protein-coding
chr3L 539884 540582 chr3L 539884 540582 C 1.569726505 S2_cell_5ug_5 5 UTR (NM_138193) 5 UTR (NM_138193) 381 klar klarsicht proteinCcoding chr3L 440634 441032 - 0.256231658 m6ace_S2_peaks_678 5 UTR (NM_001103992, exon 1 of 4) 5 UTR (NM_001103992, exon 1 of 4) 38067 klar klarsicht protein-coding
chr3L 391975 392173 chr3L 391975 392173 - 0.110795274 m6ace_S2_peaks_3280 3UTR (NR_124740) 3UTR (NR_124740) 38065 trh trachealess protein-coding
Again, the most important thing would be being able to do this again and again with a third, fourth, fifth file..
I have trying to figure this in GenomicRanges in R but honestly I'm in over my head. If anyone can help it would be much appreciated.
These don't actually do what I want - the first is similar to GRanges reduce (I think) but does not output overlaps in the different datasets and the second does not indicate when ranges are overlapping it just combines all the data.
It might help to go with simpler examples and to look at the options to
bedops
andbedmap
, which can help with calculating or aggregating specific values from typical BED columns.