Correlation of read count values from STAR and Kallisto

1

Entering edit mode

Kallisto has been highly anticipated, even though it hasn't been published yet. These comparisons you are making are very important and pretty much needed for validation. It doesn't help if Kallisto was 1000x faster but the counts would not correlate well. But the counting function in STAR is also quite new, so I would compare the counts to another method using e.g. htseqcount or easyRNAseq in R and see how that works out.

ADD REPLY • link 9.1 years ago by Michael 55k

0

Entering edit mode

Thanks for the reply Michael,

I'm trying to use RSEM downstream of STAR. I don't want to take a naive read count approach (like HTSeq/Feature Count/easyRNA). I'm looking forward carrying out isoform based DE calling (EBSeq downstream of RSEM+STAR and Sleuth downstream of Kallisto).

ADD REPLY • link 9.1 years ago by parashar.dhapola ▴ 160

1

Entering edit mode

I haven't yet tried out Kallisto, but I wonder how much of an effect will this really have on down-stream DE analysis. Just because Kalloisto counts is really different from STAR doesn't actually necessarily mean the downstream DE will be vastly different. Unless you are trying to perform DE on genes within one sample, which I don't think is very valid anyways. Whatever biases introduced by Kallisto or STAR might be consistent among the samples you are comparing, or it might not.

I recommend try this on two datasets, perform the DE, and then maybe look at the correlation of fold-changes.

If that shows good correlation, then I think the differences will probably just come down to the genes with many multi-mapped reads (conserved domain, isoforms...) and how Kallisto/STAR deals with that.

ADD REPLY • link updated 2.7 years ago by Ram 45k • written 9.1 years ago by Damian Kao 16k

0

Entering edit mode

Dear Damian,

Thanks for your reply.

I' have got replicates and other samples in this datset. I'll share the results of those too. However, I'm not so sure about using any DE tool to make comparison. For example, DESeq is known to not play very well with Kallisto output. Hence, the compatibility of the Star/Kallisto with DE tool might itself introduce some biases. Nevertheless, it is worth a try and will surely get back to you with the results.

ADD REPLY • link 9.1 years ago by parashar.dhapola ▴ 160

0

Entering edit mode

I would do a comparison of STAR's GeneCount vs a dedicated DE tool like EdgeR or DESeq on STAR's output. They should be the same right, given the same input data - but since I've seen so many Kallisto vs Salmon vs Tophat comparisons and no one has ever mentioned a difference in distribution before, I would suspect STAR's GeneCount over Kallisto (as much as I love STAR).

Great post though - and thank you for taking the time to show us this graphic :)

ADD REPLY • link 9.1 years ago by John 13k

1

Entering edit mode

As the developer of Sailfish and Salmon, I've done quite a bit of comparison against STAR counts at the gene level. While you will see (sometimes systematic) differences, I've never seen anything this stark. Further, given the similarities between Sailfish and Kallisto, by transitivity, I wouldn't expect to see such a tremendous difference between those methods. Could you provide a bit more detail about how you've computed these results? That is what transcriptome did you use for Kallisto, how did you aggregate the counts to the gene level etc.? Typically, we see (spearman) correlations in the high 0.8's to the mid 0.9's between Sailfish or Salmon and STAR at the gene level --- I'd expect something similar from Kallisto.

ADD REPLY • link updated 2.7 years ago by Ram 45k • written 9.1 years ago by Rob 7.1k

0

Entering edit mode

Dear Rob,

I obtained annotation data from Ensembl release 83 (FTP Link).

I wrote a small piece of code (it was an overkill cause I tried to make a more Generic GTF parser, but it did its job right). You can review the code here:

	import sys
	from itertools import groupby

	def read_fasta(fasta_name):
	fh = open(fasta_name)
	faiter = (f[1] for f in groupby(fh, lambda line: line[0] == ">"))
	for head in faiter:
	head = head.next()[1:].strip()
	s = "".join(s.strip() for s in faiter.next())
	yield head, s

	class GenericFeature(object):
	def __init__(self, name, uid, chrom, start, end, strand, parent, feat_type, misc):
	self.chrom = chrom
	self.start = start
	self.end = end
	self.strand = strand
	self.parent = parent
	self.children = []
	self.type = feat_type
	self.name = name
	self.id = uid
	self.info = misc

	def set_child(self, child_id):
	self.children.append(child_id)


	class GTF(object):
	def __init__(self, gtf_file):
	self.file = gtf_file
	self.genes = {}
	self.transcripts = {}
	self.exons = {}
	self.summary = {}
	self.file_parser()
	self.make_chilren()

	def lazy_reader(self):
	with open(self.file) as h:
	for l in h:
	if l[0] != "#" and len(l) > 20:
	c = l.rstrip('\n').split('\t')
	yield c

	def file_parser(self):
	gtf_stream = self.lazy_reader()
	for row in gtf_stream:
	anno = {x.split(' ')[0]: x.split(' ')[1].rstrip(';').strip('"') for x in row[-1].split('; ')}
	if row[2] == "gene":
	if 'gene_name' not in anno:
	anno['gene_name'] = None
	feature = GenericFeature(anno['gene_name'], anno['gene_id'], row[0], row[3], row[4],
	row[6], None, 'gene', {'gene_biotype': anno['gene_biotype']})
	self.genes[anno['gene_id']] = feature
	elif row[2] == 'transcript':
	if 'transcript_name' not in anno:
	anno['transcript_name'] = None
	feature = GenericFeature(anno['transcript_name'], anno['transcript_id'], row[0],
	row[3], row[4], row[6], anno['gene_id'], 'transcript',
	{'transcript_biotype': anno['transcript_biotype']})
	self.transcripts[anno['transcript_id']] = feature
	elif row[2] == 'exon':
	feature = GenericFeature(None, anno['exon_id'], row[0],
	row[3], row[4], row[6], anno['transcript_id'], 'exon',
	{'exon_number': anno['exon_number']})
	self.exons[anno['exon_id']] = feature

	def summarize(self):
	if self.summary == {}:
	self.summary = {
	'genes': len(self.genes),
	'transcripts': len(self.transcripts),
	'exons': len(self.exons)
	}
	print self.summary

	def make_chilren(self):
	for exon in self.exons.values():
	self.transcripts[exon.parent].children.append(exon.id)
	for transcript in self.transcripts.values():
	self.genes[transcript.parent].children.append(transcript.id)

	def make_transcriptome_json(self):
	chrom_wise = {}
	for gene in self.genes.values():
	for transcript_id in gene.children:
	transcript = self.transcripts[transcript_id]
	coords = []
	for exon_id in transcript.children:
	exon = self.exons[exon_id]
	coords.append([int(exon.start), int(exon.end)])
	if transcript.chrom not in chrom_wise:
	chrom_wise[transcript.chrom] = []
	chrom_wise[transcript.chrom].append({
	'transcript_id': transcript.id,
	'transcript.name': transcript.name,
	'gene.id': gene.id,
	'gene.name': gene.name,
	'seq_coords': coords,
	'strand': transcript.strand
	})
	return chrom_wise

	def make_genome_dict(self, fasta_file):
	genome_dict = {}
	for h,s in read_fasta(fasta_file):
	genome_dict[h] = s
	return genome_dict

	def make_transcriptome_fasta(self, genome_file, out_fasta):
	OUT = open(out_fasta, 'w')
	genome_dict = self.make_genome_dict(genome_file)
	chrom_wise_info = self.make_transcriptome_json()
	for chrom in chrom_wise_info:
	seq = genome_dict[chrom]
	for t in chrom_wise_info[chrom]:
	t_seq = []
	for coord in t['seq_coords']:
	t_seq.append(seq[coord[0]:coord[1]])
	OUT.write(">%s\n%s\n" % (t['transcript_id'], "".join(t_seq)))
	OUT.close()
	return True

	if __name__ == "__main__":
	gtf_file = sys.argv[1]
	gtf = GTF(gtf_file)
	gtf.summarize()
	gtf.make_transcriptome_fasta('./resource/genome.fa', './resource/transcripts.fasta')

view raw GTFparser.py hosted with ❤ by GitHub

I'm trying to see how Salmon performs on this dataset next. I welcome your further comments.

ADD REPLY • link updated 6.6 years ago by Ram 45k • written 9.1 years ago by parashar.dhapola ▴ 160

0

Entering edit mode

Interesting; any reason to not use the existing cdna file? ~~Is the experiment you're sequencing public?~~ (I see this is in the original post ;P). I'll be interested to take a look.

ADD REPLY • link updated 6.6 years ago by Ram 45k • written 9.1 years ago by Rob 7.1k

1

Entering edit mode

Check out the latest update. It is quite interesting: Salmon actually has very high correlation with STAR+RSEM (r=0.98). You were definitely right about your tests. But this brings me back to original question: whats wrong with read count values from Kallisto?

ADD REPLY • link 9.1 years ago by parashar.dhapola ▴ 160

1

Entering edit mode

Thank for reply John,

As you can see from updated post STAR counting strategy is not so bad after all. Bt this is nowhere near conclusive. I wish to see how Salmon performs here.

ADD REPLY • link 9.1 years ago by parashar.dhapola ▴ 160

0

Entering edit mode

Hello parashar.dhapola!

We believe that this post does not fit the main topic of this site.

Post closed

For this reason we have closed your question. This allows us to keep the site focused on the topics that the community can help with.

If you disagree please tell us why in a reply below, we'll be happy to talk about it.

Cheers!

ADD REPLY • link updated 6.6 years ago by Ram 45k • written 9.1 years ago by parashar.dhapola ▴ 160

0

Entering edit mode

Why did you close this (and apparently remove most of the content)?

ADD REPLY • link 9.1 years ago by Devon Ryan 105k

0

Entering edit mode

I was wondering the same thing. Was the post closed by the original poster, or, considering the fairly cryptic final message, the admins? And, yes, what happened to all of the content?!

Update: ~~For posterity (or in case the OP wants to re-start the discussion) --- here is the content of the main post at the time it was closed:~~ (see below; since it's not clear exactly why the original post was closed and all of the content removed, I'm removing the below unless OP requests it).

ADD REPLY • link updated 2.7 years ago by Ram 45k • written 9.1 years ago by Rob 7.1k

1

Entering edit mode

re-opening.

ADD REPLY • link 9.1 years ago by Pierre Lindenbaum 165k

0

Entering edit mode

If the thread author wants to remove it, that's OK no? I know its not exactly great, but I would have thought their wishes would be most important.

Moreover, they are probably closing it because they discovered that the weird Kallisto result was due to a little user-error (different annotation file or input file being used, etc ) and just wanted the whole thing closed so as to not waste anyone else's time. Speaking as an expert on the subject, they were probably embarrassed....

ADD REPLY • link 9.1 years ago by John 13k

2

Entering edit mode

I agree, but in that case, OP should say so (e.g. via an update at the top of the message with the fixed result or some such). The last message before the post was originally closed is very cryptic, and suggests that it was closed by mods (even though that doesn't seem to be the case). EDIT: In light of John's interpretation, I'm removing the the content of the original post from my status until / unless OP requests it be put back.

ADD REPLY • link updated 2.7 years ago by Ram 45k • written 9.1 years ago by Rob 7.1k

1

Entering edit mode

Agreed. If you're out there parashar, please let us know what happened so we can help other users in the future who run into the same issue :)
User-errors are far more common (and difficult to identify) than program errors - so dissecting them is non-trivial :)

ADD REPLY • link 9.1 years ago by John 13k

0

Entering edit mode

The author of the closing message is the OP. We do allow people to close their own post but once closed they cannot reopen it, only mods can. He might have been playing around with the options...

ADD REPLY • link 9.1 years ago by Istvan Albert 102k