Question

Intersecting ChIP/RNA-seq data

0

Entering edit mode

9.2 years ago

Constantine ▴ 290

Hello I'm trying to intersect 1 RNA-seq with 1 ChIP-seq file.

My ChIP-seq file looks like this:

chrX    118904350       118904500
chrX    169994450       169994600
chrX    170004550       170004700
chrX    170005250       170005400
chr13   44869500        44869750
chr13   112578250       112578400
chr12   41777950        41778100

whereas my RNA-seq file looks like this:

Chrom                                         Start
chr1                         3214482;3421702;3670552
chr1                                 3647309;3658847
chr1                                         3680155
chr1                        4290846;4343507;4351910;4352202;4360200;4409170
chr1                        4490928;4493100;4493772;4495136;4496291
chr1                        4773198;4777525;4782568;4783951;4785573
                                                  End Strand Length  Symbol
                              3216968;3421901;3671498      -   3634    Xkr4
                                      3650509;3658904      -   3259 Gm19938
                                              3681788      +   1634 Gm10568
      4293012;4350091;4352081;4352837;4360314;4409241      -   9747     Rp1
              4492668;4493466;4493863;4495942;4496413      -   3130   Sox17
              4776801;4777648;4782733;4784105;4785726      -   4203  Mrpl15

Bedtools intersect doesn't seem to be working and I get an error message regarding the RNA-seq file. I think it happens because more than 1 genomic region for Start and End is assigned for each gene. Any idea how to collapse the ";" and end up with something equivalent to my ChIP-seq file?

Thanks

RNA-Seq ChIP-Seq • 3.1k views

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by Constantine ▴ 290

2

Entering edit mode

Your RNAseq file is a hacked version of a bed12 file, so just download the real bed12 file and use that.

ADD REPLY • link 9.2 years ago by Devon Ryan 104k

0

Entering edit mode

What do you mean by hacked? Plus I didn't download the file from anywhere. I generated everything - This is my DE file

ADD REPLY • link 9.2 years ago by Constantine ▴ 290

0

Entering edit mode

"Hacked" means "modified" in this context. If you generated everything then just generate it again in a more useful format. You can certainly convert it to something more useful with awk (as nicely demonstrated by Joseph Pearson ), but in the long term it's simpler to stick to more standardized formats.

ADD REPLY • link 9.2 years ago by Devon Ryan 104k

0

Entering edit mode

Link doesn't work

Not sure how to reformat the file. I add the whole annotation file right in R right after I call featureCounts and this is the final output

GeneID                           Chr
37935    497097                chr1;chr1;chr1
54747 100503874                     chr1;chr1
46311 100038431                          chr1
4885      19888 chr1;chr1;chr1;chr1;chr1;chr1
5313      20671      chr1;chr1;chr1;chr1;chr1
7047      27395      chr1;chr1;chr1;chr1;chr1
                                                Start
37935                         3214482;3421702;3670552
54747                                 3647309;3658847
46311                                         3680155
4885  4290846;4343507;4351910;4352202;4360200;4409170
5313          4490928;4493100;4493772;4495136;4496291
7047          4773198;4777525;4782568;4783951;4785573
                                                  End      Strand Length
37935                         3216968;3421901;3671498       -;-;-   3634
54747                                 3650509;3658904         -;-   3259
46311                                         3681788           +   1634
4885  4293012;4350091;4352081;4352837;4360314;4409241 -;-;-;-;-;-   9747
5313          4492668;4493466;4493863;4495942;4496413   -;-;-;-;-   3130
7047          4776801;4777648;4782733;4784105;4785726   -;-;-;-;-   4203
       Symbol   type_of_gene
37935    Xkr4 protein-coding
54747 Gm19938          ncRNA
46311 Gm10568        unknown
4885      Rp1 protein-coding
5313    Sox17 protein-coding
7047   Mrpl15 protein-coding

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by Constantine ▴ 290

1

Entering edit mode

Fixed the link, that was just a biostars bug. It would have been helpful had you said you were using featureCounts. Either write a script to parse this into a modified BED format or just intersect the ChIP results with a BED file containing gene coordinates, cut out the gene name, and grep the appropriate lines from the output you posted.

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by Devon Ryan 104k

0

Entering edit mode

This is what I use for the Chrom column

Ann$Chr <- unlist(lapply(strsplit(Ann$Chr, ";"), function(x) paste(unique(x), collapse = "|")))

but I have no clue how to reformat the Start/End

ADD REPLY • link 9.2 years ago by Constantine ▴ 290

Ram · Accepted Answer · 2015-10-05

5

Entering edit mode

9.2 years ago

Joseph Pearson ▴ 480

Try this awk conversion:

awk 'NR>1 {chrom=$1;arraySize=split($2,starts,";");split($3,ends,";");symbol=$6;for (i=1;i<=arraySize;i++){print chrom,starts[i],ends[i],symbol"_"i,"\n"}}' yourRNAseqFile.txt

edit: Sorry, to clarify, this will convert your "DE" file to a BED4 format.

ADD COMMENT • link 9.2 years ago by Joseph Pearson ▴ 480

0

Entering edit mode

Assuming that your actual RNA-seq file is column-delimited, and that column 1 is chromosome, column 2 is starts, column 3 is ends, column 4 is strand, column 5 is length, and column 6 is symbol, the one-liner will spit out paired start-end coordinates (one per line) for each gene.

The script skips over the first line (I'm assuming you will have the column labels there), and it uniquely labels the identifier (column 4 of the output) by appending the position in the semicolon-delimited list for starts and ends.

ADD REPLY • link 9.2 years ago by Joseph Pearson ▴ 480

0

Entering edit mode

Thanks for the fast reply. I'm afraid the code doesn't work :/

My file is tab-delimited and has a header

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by Constantine ▴ 290

0

Entering edit mode

Sorry, my response assumed that the example file that you posted had the chromosome, start(s), end(s) and other info on a single line; it looks like the output is actually spread over multiple lines? As Devon Ryan pointed out, it will be easiest to get your output file from featureCounts into a format like BED, or just get a list of your genes of interest, with BED-type (or gff-type) annotations.

ADD REPLY • link 9.2 years ago by Joseph Pearson ▴ 480

0

Entering edit mode

If you can get your output into the following format:

Chrom    Start    End    Strand    Length    Symbol
chr1    3214482;3421702;3670552    3216968;3421901;3671498    -    3634    Xkr4
chr1    3647309;3658847    3650509;3658904    -    3259    Gm19938
chr1    3680155    3681788    +    1634    Gm10568
chr1    4290846;4343507;4351910;4352202;4360200;4409170    4293012;4350091;4352081;4352837;4360314;4409241    -    9747    Rp1
chr1    4490928;4493100;4493772;4495136;4496291    4492668;4493466;4493863;4495942;4496413    -    3130    Sox17
chr1    4773198;4777525;4782568;4783951;4785573    4776801;4777648;4782733;4784105;4785726    -    4203    Mrpl15

Then my awk one-liner should work. If you are working with featureCounts in R, there should be a way to write.table the output, after some toying with rbind (cbind?). Again, Devon's suggestion of just getting annotations from a bed file (or gff file) for the genes of interest (as identified by featureCounts) is probably the headache-free way to go.

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by Joseph Pearson ▴ 480

0

Entering edit mode

Actually it worked....I was using a comma delimited file and didn't notice.

Thank you very much

ADD REPLY • link 9.2 years ago by Constantine ▴ 290

0

Entering edit mode

Fantastic, glad I could help.

ADD REPLY • link 9.2 years ago by Joseph Pearson ▴ 480