Hello all,
I have a differential expression analysis pipeline that goes - 1) alignment with HISAT2, 2) featureCounts, 3) Differential expression analysis with DESeq2. I have a list of DE genes from the last step, and now I need to extract the sequences correlating to these genes. The format of my DE genes is 000000F.g00.t0. However, when I try to extract the sequences from my reference genome (which I used in HISAT2), I see that the gene IDs are in the format 000000F.
When I check the SAM files, the IDs are still in the 000000F format. It's only in the featureCounts output that the 000000F.g00.t0 format. So my questions are:
1) Why does featureCounts alter my IDs?
2) How can I get the sequences related to my DE genes? /do I have to remove the .00.t0 and just use the original ID?
Thanks in advance!
It sounds like you have a name mismatch between your reference and annotation files.
featureCounts
is probably taking the ID's that are in your GTF files. I am actually surprised that this is working (i.e. you have counts?).Can you show us example lines from your reference and annotation files?
A possible explanation
and
If you have
000000F.g00.t0
id's in your reference (and presumably annotation) then it is HISAT2 that is truncating them000000F
in your alignment files. Assuming you are referring to column 3 in SAM record.Reference example lines:
Gene ID examples (
grep '^>' genome.fa | head
):Annotation example lines:
I do have 000000F.g00.t0 IDs in my annotation (in column 9, not column 1), and 000000F IDs in my reference.
Also, example SAM lines:
FeatureCounts was run with -g set to Parent, which seems to explain the ID format disparity.
For this example
ID = 000000F.g139.t1
so featureCounts is simply using that as gene ID.Since this is annotated as a transcript with a start/stop (4099240 4100215 ) on negative strand you can extract that interval from
000000F
using examples here: Extract User Defined Region From An Fasta File (and similar threads)