Question

How to create SAF from text file for FeatureCounts

0

Entering edit mode

3.6 years ago

pt.taklifi ▴ 60

Hello everyone,

I have some ATACseq bam files and a reference peak set(genomic regions) in text format. I am trying to get count matrix from the bam files( i.e or each bam file I want to calculate f how many reads fall in each peak) . so I am trying to use featureCounts for this purpose. which needs annotations in SAF or GTF format. 1) in my case is annotation the reference peakset ? 2) if so how to convert my peak set to SAF or GTF format.

Here are a few lines of the peak set

seqnames    start   end name    score   annotation  percentGC
chr1    906012  906513  ACC_10  7.171192997 Intron  0.612774451
chr2    112541661   112542162   ACC_10008   22.03057903 Promoter    0.55489022
chr1    21673421    21673922    ACC_1001    6.459954383 Distal  0.508982036
chr2    112584205   112584706   ACC_10013   43.20855549 Promoter    0.586826347
chr2    112596243   112596744   ACC_10016   5.428209077 Intron  0.491017964
chr1    21725692    21726193    ACC_1002    5.201272875 Intron  0.405189621

featureCounts ATAC-seq • 6.0k views

ADD COMMENT • link updated 3 months ago by QX ▴ 60 • written 3.6 years ago by pt.taklifi ▴ 60

score 4 · Accepted Answer · 2021-05-02

4

Entering edit mode

3.6 years ago

ATpoint 85k

Converting from BED to SAF/GFF

If your file has a header you will need to skip the first line, in awk that would be via NR>1. You can convert this file you have there to SAF and then basically do something like:

featureCounts -a your.saf -F SAF -o counts.txt *.bam

ADD COMMENT • link 3.6 years ago by ATpoint 85k

0

Entering edit mode

thank you very much.

ADD REPLY • link 3.6 years ago by pt.taklifi ▴ 60

0

Entering edit mode

one following question: my bam files are not in the same directory, so I created a text file containing all bam file locations like :acc_bamFiles.txt here is a few lines:

SRR10984460/bam/SRR10984460.dedup.bam
SRR10984461/bam/SRR10984461.dedup.bam
SRR10984462/bam/SRR10984462.dedup.bam

however when I try :

featureCounts -a PanCancer_PeakSet.saf -F SAF -o counts.txt acc_bamFiles.txt

I get this error message:

ERROR: invalid parameter: 'acc_bamFiles.txt'

so should I just run featureCounts for each bam file separately and then append count matrixes together or is there an easier way ?

ADD REPLY • link 3.6 years ago by pt.taklifi ▴ 60

3

Entering edit mode

I do not think that this can be a text file. I usually make symbolic links to the directory where the SAF file is. Say you are in the SAF directory use ln -s /path/to/bam . for all BAMs and then use *.bam.

ADD REPLY • link 3.6 years ago by ATpoint 85k

0

Entering edit mode

thank you very much!

ADD REPLY • link 3.6 years ago by pt.taklifi ▴ 60

0

Entering edit mode

Hi ATpoint ,

in my data macs2 return multiple peaks per region:

chr     start   end     length  abs_summit      pileup  -log10(pvalue)  fold_enrichment -log10(qvalue)  name
1       826531  828140  1610    826808  1730    1328.65 14.7141 1326.59 peak_all_peak_1a
1       826531  828140  1610    827539  4563    5338.88 38.7957 5336.28 peak_all_peak_1b
1       826531  828140  1610    827986  698     292.385 5.94175 290.704 peak_all_peak_1c
1       831302  831572  271     831439  329     57.8574 2.80512 56.4633 peak_all_peak_2
1       832037  832505  469     832340  290     41.2257 2.47361 39.8811 peak_all_peak_3
1       844126  846414  2289    844345  297     44.053  2.53311 42.6991 peak_all_peak_4a
1       844126  846414  2289    844868  886     448.757 7.53982 446.986 peak_all_peak_4b

which return (I believe) that sum of read count the that region:

1.826531.828140 1;1;1   826531;826531;826531    828140;828140;828140    +;+;+   1610    323     485     506     419     193     275     264     441     479     390     488     548     266     383471     527     417     283     445     470     606     612     575     471
1.831302.831572 1       831302  831572  +       271     5       6       13      11      7       8       5       9       15      14      11      11      7       10      10      12      6       7 813      13      9       16      15
1.832037.832505 1       832037  832505  +       469     10      9       16      22      6       5       7       23      24      33      31      28      8       13      15      15      14      7 15       9       19      22      27      19
1.844126.846414 1;1;1;1;1;1     844126;844126;844126;844126;844126;844126       846414;846414;846414;846414;846414;846414       +;+;+;+;+;+     2289    170     271     242     253     106     195144     82      68      94      84      87      148     197     226     225     185     134     174     260     254     291     122     113

Do you know if this is a 'correct' way to pool all the peaks of same region and perform differential expression? or are there better options?

ADD REPLY • link 3 months ago by QX ▴ 60