Initial post title : How does cufflinks find the strand of a novel transcript?
I am using cufflinks
to create a RABT assembly of a genome.
I have my newly created merged.gtf
file.
Most of new transcripts found by cufflinks
are present on both strands. I mean that very close transcripts (in term of sequence) are reported twice in the gtf once with strand + and once with strand -
How does cufflinks finds the strand of each novel transcript? If he doesn't know, is there a way to report "unknown" and to write only one transcript instead of both ?
EDIT :
I found that my transcripts strand was determined by the XS
field of my SAM input.
I also found that I had unstranded data, and that I chose during my alignment a stranded mode, explaining why I have transcripts on both strands in the end.
I would like to run my alignment with unstranded mode and to run Cufflinks
with lib-type unstranded
. But Cufflinks
requires a mandatory XS
field in the SAM
for the spliced alignments.
How can I get the strand (XS
field) assuming my data is unstranded ?
Why does cufflinks requires a value in XS file only for spliced alignments ?
EDIT 2 :
Aligner used : Hisat2
Prokaryote / Bacterial species?
When aligning, Bowtie/TopHat will attempt to align each read as it appears in your input file. If it doesn't align, it will see if the reverse complement of your read aligns. In this way, it can infer the original strand (+/-, plus/minus, coding/non-coding, sense/anti-sense) from which each read derived. If a read does not align, it is then not providing any information on strand, and, thus, there is no 'in between' level where we have a read and don't also have strand information.
You can choose 'unstranded' in TopHat, in which case strand orientation will not be given and reads are instead piled up indiscriminately over each genomic loci whether they are sense/anti-sense.
That was helpful, thanks, I just edited my post
which aligner have you used for the unstranded alignment?
I used Hisat2 for the unstranded alignment