Question

Converting from BED to SAF/GFF

3

Entering edit mode

8.4 years ago

ccag ▴ 60

I called peaks using MACS2 and now would like to use my narrowPeak file to do featureCounts. As far as I can tell, featureCOunts only uses SAF or GFF formats. Does anyone know an easy way to convert between a bed and these file types for Dm3?

bed gff saf featureCount • 18k views

ADD COMMENT • link updated 18 months ago by drperaltaxp • 0 • written 8.4 years ago by ccag ▴ 60

score 10 · Answer 1 · 2018-06-08

10

Entering edit mode

6.9 years ago

ATpoint 88k

awk 'OFS="\t" {print $1"."$2"."$3, $1, $2, $3, "."}' in.narrowPeak > out.saf

The first column is the identifier. This is then either any gene name or in case of genomic regions simply the concatenated genomic coordinates, e.g. separated by dot, such as chr1.100.1000000. For genomic regions I always set strand to ., for genes one can set the strand where the gene is located.

ADD COMMENT • link 5.2 years ago by ATpoint 88k

0

Entering edit mode

I am wondering whether $2 should +1 because the narrowPeak is 0-based. But Actually I rarely find the description about SAF coordinate system :).

ADD REPLY • link 5.2 years ago by shangguandong1996 ▴ 30

2

Entering edit mode

In 6.2.2. of https://bioconductor.org/packages/release/bioc/vignettes/Rsubread/inst/doc/SubreadUsersGuide.pdf it says that SAF format is inclusive for both start and end coordinate. I interpret this as "leave narrowPeak as it is". I noted that if you use e.g. bedtools makeWindows to make windows across the whole genome, and if you then transform to SAF and add 1 to start you will not have 100% reads assigned to that SAF but only 99.x% because that one start nucleotide is missing. WIthout modification to that BED you assign 100%. Not sure if this is a correct statement, I am always a bit confused with these different coordinate systems. I also never understood why the featureCounts developers could not simply use BED format instead of this SAF format.

ADD REPLY • link 5.2 years ago by ATpoint 88k

0

Entering edit mode

Thanks for your reply :). I will see it.

ADD REPLY • link 5.2 years ago by shangguandong1996 ▴ 30

1

Entering edit mode

This SAF is 1-based coordinate system, yes, it should add 1. I have checked the example gene listed in the featureCounts manual.

ADD REPLY • link 4.0 years ago by shengxin1015 ▴ 10

score 1 · Answer 2 · 2023-10-03

1

Entering edit mode

19 months ago

alejandrogzi ▴ 150

I recently developed bed2gff to quickly convert .bed files to a gff3 format, a tool written in Rust. Could be of help here!

ADD COMMENT • link 19 months ago by alejandrogzi ▴ 150

0

Entering edit mode

It's recommended to make a Tool post with usage examples that you can link to.

ADD REPLY • link 19 months ago by ATpoint 88k

0

Entering edit mode

Thanks ATpoint, I will work on that!

ADD REPLY • link 19 months ago by alejandrogzi ▴ 150

score 0 · Answer 3 · 2016-12-23

Option 1 :GenomeTools

Option 2: Go to galaxy oqtans page here and on the left pane there is "GFF Toolkit" which has BED_to_GFF3 converter .

Option 3: A long route. From here make use of bedToGenePred followed by genePredToGtf to get a gtf file at first. Then you can make use of the lot of tools that does a gtf to gff3 conversion.

I am mentioning multiple options because depending on whether you bed file is 6 column or 12 column or so, some may not be applicable.

score 0 · Answer 4 · 2022-11-08

I believe that SAF format use 1-based coordinates that are closed on both ends. Here is how I got this conclusion.

First make some toy data.

$ cat genome.fa
>chr1
AATTCCGGAAAATTTTCCCCGGGGAAAAAAAAAAAAAAAAAACCCCCCCCCCCCCCCCCCCCCCCCCCCC

$ cat reads.fa
>q1
AAAATTTTCCCCGGGGAAAAAAAAAAAAAAAAAACC

Map reads to the genome:

$ STAR --runMode genomeGenerate --genomeDir test_star --genomeFastaFiles genome.fa --genomeSAindexNbases 2
$ STAR --genomeDir test_star --readFilesIn reads.fa --outFileNamePrefix reads_

$ cat reads_Aligned.out.sam
@HD VN:1.4
@SQ SN:chr1 LN:70
@PG ID:STAR PN:STAR VN:2.7.10a  CL:STAR   --genomeDir test_star   --readFilesIn reads.fa      --outFileNamePrefix reads_
@CO user command line: STAR --genomeDir test_star --readFilesIn reads.fa --outFileNamePrefix reads_
q1  0   chr1    9   255 36M *   0   0   AAAATTTTCCCCGGGGAAAAAAAAAAAAAAAAAACC    *   NH:i:1  HI:i:1  AS:i:35 nM:i:0

The read was mapped to 9-44 nucleotides of chr1:

$ bedtools bamtobed -i reads_Aligned.out.sam
chr1    8   44  q1  255 +

Now we test whether several SAF intervals have overlapping reads.

$ featureCounts -s1 -a f1.saf -F SAF -o cnts1.txt reads_Aligned.out.sam
$ featureCounts -s1 -a f2.saf -F SAF -o cnts2.txt reads_Aligned.out.sam
$ featureCounts -s1 -a f3.saf -F SAF -o cnts3.txt reads_Aligned.out.sam
$ featureCounts -s1 -a f4.saf -F SAF -o cnts4.txt reads_Aligned.out.sam

$ cat f1.saf cnts1.txt
r1  chr1    5   8   +
# Program:featureCounts v2.0.1; Command:"featureCounts" "-s1" "-a" "f1.saf" "-F" "SAF" "-o" "cnts1.txt" "reads_Aligned.out.sam"
Geneid  Chr Start   End Strand  Length  reads_Aligned.out.sam
r1  chr1    5   8   +   4   0

$ cat f2.saf cnts2.txt
r1  chr1    5   9   +
# Program:featureCounts v2.0.1; Command:"featureCounts" "-s1" "-a" "f2.saf" "-F" "SAF" "-o" "cnts2.txt" "reads_Aligned.out.sam"
Geneid  Chr Start   End Strand  Length  reads_Aligned.out.sam
r1  chr1    5   9   +   5   1

$ cat f3.saf cnts3.txt
r1  chr1    44  50  +
# Program:featureCounts v2.0.1; Command:"featureCounts" "-s1" "-a" "f3.saf" "-F" "SAF" "-o" "cnts3.txt" "reads_Aligned.out.sam"
Geneid  Chr Start   End Strand  Length  reads_Aligned.out.sam
r1  chr1    44  50  +   7   1

$ cat f4.saf cnts4.txt
r1  chr1    45  50  +
# Program:featureCounts v2.0.1; Command:"featureCounts" "-s1" "-a" "f4.saf" "-F" "SAF" "-o" "cnts4.txt" "reads_Aligned.out.sam"
Geneid  Chr Start   End Strand  Length  reads_Aligned.out.sam
r1  chr1    45  50  +   6   0

We can see that the SAF 5-9 and SAF 44-50 have overlapping reads while SAF 5-8 and SAF 45-50 not. Therefore, the start and end in SAF are both 1-based and inclusive as in GTF format. Therefore, when converting bed to SAF, the bed start coordinates should plus 1 and bed end coordinates need no change.

score 0 · Answer 5 · 2023-10-25

0

Entering edit mode

18 months ago

drperaltaxp • 0

I recently wrote this Bash script to do exactly as the OP asked for it. Hope it helps someone.

https://github.com/UBMI-IFC/bed2gff/tree/main

ADD COMMENT • link 18 months ago by drperaltaxp • 0