Question

How to get an XS field in SAM with unstranded data ?

1

Entering edit mode

7.1 years ago

corend ▴ 70

Initial post title : How does cufflinks find the strand of a novel transcript?

I am using cufflinks to create a RABT assembly of a genome.

I have my newly created merged.gtf file.

Most of new transcripts found by cufflinks are present on both strands. I mean that very close transcripts (in term of sequence) are reported twice in the gtf once with strand + and once with strand -

How does cufflinks finds the strand of each novel transcript? If he doesn't know, is there a way to report "unknown" and to write only one transcript instead of both ?

EDIT :

I found that my transcripts strand was determined by the XS field of my SAM input.

I also found that I had unstranded data, and that I chose during my alignment a stranded mode, explaining why I have transcripts on both strands in the end.

I would like to run my alignment with unstranded mode and to run Cufflinks with lib-type unstranded. But Cufflinks requires a mandatory XS field in the SAM for the spliced alignments.

How can I get the strand (XS field) assuming my data is unstranded ?

Why does cufflinks requires a value in XS file only for spliced alignments ?

EDIT 2 : Aligner used : Hisat2

cufflinks RNA-Seq alignment • 2.4k views

ADD COMMENT • link updated 7.1 years ago by Friederike 9.0k • written 7.1 years ago by corend ▴ 70

1

Entering edit mode

Prokaryote / Bacterial species?

When aligning, Bowtie/TopHat will attempt to align each read as it appears in your input file. If it doesn't align, it will see if the reverse complement of your read aligns. In this way, it can infer the original strand (+/-, plus/minus, coding/non-coding, sense/anti-sense) from which each read derived. If a read does not align, it is then not providing any information on strand, and, thus, there is no 'in between' level where we have a read and don't also have strand information.

You can choose 'unstranded' in TopHat, in which case strand orientation will not be given and reads are instead piled up indiscriminately over each genomic loci whether they are sense/anti-sense.

ADD REPLY • link 7.1 years ago by Kevin Blighe 88k

0

Entering edit mode

That was helpful, thanks, I just edited my post

ADD REPLY • link 7.1 years ago by corend ▴ 70

0

Entering edit mode

which aligner have you used for the unstranded alignment?

ADD REPLY • link 7.1 years ago by Friederike 9.0k

0

Entering edit mode

I used Hisat2 for the unstranded alignment

ADD REPLY • link 7.1 years ago by corend ▴ 70

score 1 · Answer 1 · 2017-11-30

1

Entering edit mode

7.1 years ago

Friederike 9.0k

If you're using HISAT2, it seems that you need to set the dta flag. Disclaimer: haven't used HISAT2 myself.

Why does cufflinks requires a value in XS file only for spliced alignments?

Presumably (and based on the comments in the reference in my first line) because it seems to have become a convention for spliced-read aligners to store information that's valuable for the transcript assemblers in the XS field. Generally, the XS field is one of those optional and only loosely defined fields in SAM files, which is why you'll see all sorts of values there, including the strand (TopHat's choice) or the number of alignments (BWA).

ADD COMMENT • link 7.1 years ago by Friederike 9.0k

0

Entering edit mode

Thanks a lot, I'll try this option, I see that there is also a dta-cufflinks option. I could be what I am looking for.

Still, I don't understand why does cufflinks requires a strandness information, when I don't have the strand information in my data. Also, we if I use lib-type unstranded, why would cufflinks need a strand ?

PS: I see here that dta could change the number of aligned reads, I'll see if it really is a problem.

ADD REPLY • link 7.1 years ago by corend ▴ 70

1

Entering edit mode

I did not understand that HISAT stores information about the strand in the XS tag, but rather some information about the spliced alignment details.

ADD REPLY • link 7.1 years ago by Friederike 9.0k