Entering edit mode
8.2 years ago
dr.genetics
▴
60
I've run a DNA-seq data file with featureCounts and got the following (c is my featureCounts return value)
> head(cbind(c$counts, c$annotation));
GACTCCTCAATGTC.sam GeneID
DDX11L1 3 DDX11L1
WASH7P 3 WASH7P
FAM138A 0 FAM138A
FAM138F 0 FAM138F
OR4F5 0 OR4F5
LOC729737 4 LOC729737
Chr
DDX11L1 chr1;chr1;chr1
WASH7P chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1
FAM138A chr1;chr1;chr1;chr19;chr19;chr19
FAM138F chr1;chr1;chr1;chr19;chr19;chr19
OR4F5 chr1
LOC729737 chr1;chr1;chr1
Start
DDX11L1 11874;12613;13221
WASH7P 14362;14970;15796;16607;16858;17233;17606;17915;18268;24738;29321
FAM138A 34611;35277;35721;76220;76886;77330
FAM138F 34611;35277;35721;76220;76886;77330
OR4F5 69091
LOC729737 134773;139790;140075
End
DDX11L1 12227;12721;14409
WASH7P 14829;15038;15947;16765;17055;17368;17742;18061;18366;24891;29370
FAM138A 35174;35481;36081;76783;77090;77690
FAM138F 35174;35481;36081;76783;77090;77690
OR4F5 70008
LOC729737 139696;139847;140566
Strand Length
DDX11L1 +;+;+ 1652
WASH7P -;-;-;-;-;-;-;-;-;-;- 1769
FAM138A -;-;-;-;-;- 2260
FAM138F -;-;-;-;-;- 2260
OR4F5 + 918
LOC729737 -;-;- 5474
But I am a little confused about the results:
How count the count of WASH7P is only 3? It looks like there are 11 segments mapped to the gene?
Why FAM138A has 0 count? I understand the gene is located on two chromosomes: chr1 & chr19, but it has 3 counts on each of the chr.
OR4F5 has one read, and the segment spans exactly from the TSS to the TES? Guess there is a misunderstanding here.
Thanks.
Those are the begin and end coordinates of the (alternative) transcripts. Perfectly normal output if you ask me.
I'm not sure I know what you are trying to achieve, are you? This is DNA-sequencing, with which aim?
How to map BAM/SAM files to genes with abundance levels? has the background on this. Not 100% clear what dr.genetics wants to count/find depth for (to me).
Our experiments generate double stranded breaks (DSBs), and we are trying to see if there are any hotspots of DSBs. The DNA-seq technique we used (GUIDE-seq) captures the sequences with DSBs as the 5' or 3' ends. The more we see a DNA fragment, the more likely a DSB occurs at one of its ends. So we are looking for the abundance of DNA fragments and thus the frequency of DBS at specific genomic loci.
In other words, I am looking for the abundancy of DNA fragments in the BAM/SAM files and the relationship of such DNA fragments to genes.
I assume the "count" of 3, 0, etc. means the count of DNA fragments seen in the BAM/SAM file mapped within a particular gene? If so, it does not have enough resolution because we are also interested in where exactly the DSBs are.
Interesting experiment, and it will get you far quicker to helpful replies if you would have told us that earlier ;)
Featurecounts performs the counting per feature (in this case per gene), so therefore there are indeed 3 counts in WASH7P. But your resolution will probably be better.
I would suggest thinking in the direction of Chip-Seq experiments to perform clustering of those reads to find hotspots.
Great. thanks.
Can I just use the chr, start, end info in the SAM file for each sequence? Is there complications such as overlapping, reverse complement, etc.? If no, it seems that I can simply use such info directly?
That would work yes, if you find a sensible way to aggregate those.