I called peaks using MACS2 and now would like to use my narrowPeak file to do featureCounts. As far as I can tell, featureCOunts only uses SAF or GFF formats. Does anyone know an easy way to convert between a bed and these file types for Dm3?
I called peaks using MACS2 and now would like to use my narrowPeak file to do featureCounts. As far as I can tell, featureCOunts only uses SAF or GFF formats. Does anyone know an easy way to convert between a bed and these file types for Dm3?
awk 'OFS="\t" {print $1"."$2"."$3, $1, $2, $3, "."}' in.narrowPeak > out.saf
The first column is the identifier. This is then either any gene name or in case of genomic regions simply the concatenated genomic coordinates, e.g. separated by dot, such as chr1.100.1000000
. For genomic regions I always set strand to .
, for genes one can set the strand where the gene is located.
I recently developed bed2gff to quickly convert .bed files to a gff3 format, a tool written in Rust. Could be of help here!
Option 1 :GenomeTools
Option 2: Go to galaxy oqtans page here and on the left pane there is "GFF Toolkit" which has BED_to_GFF3 converter .
Option 3: A long route. From here make use of bedToGenePred followed by genePredToGtf to get a gtf file at first. Then you can make use of the lot of tools that does a gtf to gff3 conversion.
I am mentioning multiple options because depending on whether you bed file is 6 column or 12 column or so, some may not be applicable.
I believe that SAF format use 1-based coordinates that are closed on both ends. Here is how I got this conclusion.
First make some toy data.
$ cat genome.fa
>chr1
AATTCCGGAAAATTTTCCCCGGGGAAAAAAAAAAAAAAAAAACCCCCCCCCCCCCCCCCCCCCCCCCCCC
$ cat reads.fa
>q1
AAAATTTTCCCCGGGGAAAAAAAAAAAAAAAAAACC
Map reads to the genome:
$ STAR --runMode genomeGenerate --genomeDir test_star --genomeFastaFiles genome.fa --genomeSAindexNbases 2
$ STAR --genomeDir test_star --readFilesIn reads.fa --outFileNamePrefix reads_
$ cat reads_Aligned.out.sam
@HD VN:1.4
@SQ SN:chr1 LN:70
@PG ID:STAR PN:STAR VN:2.7.10a CL:STAR --genomeDir test_star --readFilesIn reads.fa --outFileNamePrefix reads_
@CO user command line: STAR --genomeDir test_star --readFilesIn reads.fa --outFileNamePrefix reads_
q1 0 chr1 9 255 36M * 0 0 AAAATTTTCCCCGGGGAAAAAAAAAAAAAAAAAACC * NH:i:1 HI:i:1 AS:i:35 nM:i:0
The read was mapped to 9-44 nucleotides of chr1:
$ bedtools bamtobed -i reads_Aligned.out.sam
chr1 8 44 q1 255 +
Now we test whether several SAF intervals have overlapping reads.
$ featureCounts -s1 -a f1.saf -F SAF -o cnts1.txt reads_Aligned.out.sam
$ featureCounts -s1 -a f2.saf -F SAF -o cnts2.txt reads_Aligned.out.sam
$ featureCounts -s1 -a f3.saf -F SAF -o cnts3.txt reads_Aligned.out.sam
$ featureCounts -s1 -a f4.saf -F SAF -o cnts4.txt reads_Aligned.out.sam
$ cat f1.saf cnts1.txt
r1 chr1 5 8 +
# Program:featureCounts v2.0.1; Command:"featureCounts" "-s1" "-a" "f1.saf" "-F" "SAF" "-o" "cnts1.txt" "reads_Aligned.out.sam"
Geneid Chr Start End Strand Length reads_Aligned.out.sam
r1 chr1 5 8 + 4 0
$ cat f2.saf cnts2.txt
r1 chr1 5 9 +
# Program:featureCounts v2.0.1; Command:"featureCounts" "-s1" "-a" "f2.saf" "-F" "SAF" "-o" "cnts2.txt" "reads_Aligned.out.sam"
Geneid Chr Start End Strand Length reads_Aligned.out.sam
r1 chr1 5 9 + 5 1
$ cat f3.saf cnts3.txt
r1 chr1 44 50 +
# Program:featureCounts v2.0.1; Command:"featureCounts" "-s1" "-a" "f3.saf" "-F" "SAF" "-o" "cnts3.txt" "reads_Aligned.out.sam"
Geneid Chr Start End Strand Length reads_Aligned.out.sam
r1 chr1 44 50 + 7 1
$ cat f4.saf cnts4.txt
r1 chr1 45 50 +
# Program:featureCounts v2.0.1; Command:"featureCounts" "-s1" "-a" "f4.saf" "-F" "SAF" "-o" "cnts4.txt" "reads_Aligned.out.sam"
Geneid Chr Start End Strand Length reads_Aligned.out.sam
r1 chr1 45 50 + 6 0
We can see that the SAF 5-9 and SAF 44-50 have overlapping reads while SAF 5-8 and SAF 45-50 not. Therefore, the start and end in SAF are both 1-based and inclusive as in GTF format. Therefore, when converting bed to SAF, the bed start coordinates should plus 1 and bed end coordinates need no change.
I recently wrote this Bash script to do exactly as the OP asked for it. Hope it helps someone.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
I am wondering whether $2 should +1 because the narrowPeak is 0-based. But Actually I rarely find the description about SAF coordinate system :).
In 6.2.2. of https://bioconductor.org/packages/release/bioc/vignettes/Rsubread/inst/doc/SubreadUsersGuide.pdf it says that SAF format is inclusive for both start and end coordinate. I interpret this as "leave narrowPeak as it is". I noted that if you use e.g.
bedtools makeWindows
to make windows across the whole genome, and if you then transform to SAF and add 1 to start you will not have 100% reads assigned to that SAF but only 99.x% because that one start nucleotide is missing. WIthout modification to that BED you assign 100%. Not sure if this is a correct statement, I am always a bit confused with these different coordinate systems. I also never understood why the featureCounts developers could not simply use BED format instead of this SAF format.Thanks for your reply :). I will see it.
This SAF is 1-based coordinate system, yes, it should add 1. I have checked the example gene listed in the featureCounts manual.