Question

Snp Discovery With Rnaseq And Reference

3

Entering edit mode

13.6 years ago

Leszek 4.2k

Hi,

I've data from several RNAseq experiments. I've mappend the reads onto reference and want to screen for SNPs with this. I'm able to catch them by visualising the alignment in IGV, but I cannot find them with samtools:

  samtools mpileup -ugf gem/$s.fa gem/${s}.bam | bcftools view -bvcg - > gem/${s}.raw.bcf
  bcftools view gem/${s}.raw.bcf | vcfutils.pl varFilter -D 10000 > gem/${s}.flt.vcf

I think it may be due to variable coverage and low depth for RNAseq. Could you recommend me some other tools that might be used with RNAseq reads?

UPDATE/SOLUTION:

I've wrote small script solving the problem. It incorporates only coverage and frequency of to call heterozygous sites. Hope someone else will benefit from this.

Regards,

rna snp snp reference • 6.5k views

ADD COMMENT • link updated 13.3 years ago by lh3 33k • written 13.6 years ago by Leszek 4.2k

1

Entering edit mode

I was trying your script, But got following error. I have no idea about python.

cat test_place/accepted_hits.mpileup | python heterozygosity.py -o test_place/het_test -f 0.2 -d 10

Traceback (most recent call last): File "heterozygosity.py", line 105, in <module> main( sys.argv[1:] ) File "heterozygosity.py", line 101, in main parse_mpileup( parameters["fpath"],parameters["out_base"],parameters["minDepth"],parameters["minFreq"] ) File "heterozygosity.py", line 53, in parse_mpileup contig,pos,base,cov,alg,quals=line.split('\t')#; contig,pos,base,cov,alg,quals ValueError: need more than 1 value to unpack

Could you help me ?

ADD REPLY • link 11.8 years ago by Gvj ▴ 470

1

Entering edit mode

something is wrong with you mpileup file. each line should contain 6 columns separated by tabs. check that.

ADD REPLY • link 11.8 years ago by Leszek 4.2k

0

Entering edit mode

It has 6 columns; head testplace/acceptedhits.mpileup

contig_10 86 C 1 ^S. C

contig_10 87 A 2 G^SG C@

contig_10 88 C 2 .. CC

contig_10 89 C 2 .. FC

contig_10 90 A 2 .. FF

I have "><"values in 5th Column, What does it mean ? From the pileup format I can't find a explanation for this.

ADD REPLY • link 11.8 years ago by Gvj ▴ 470

0

Entering edit mode

I know what may be the problem. The script is updated. Let me know if it works.

ADD REPLY • link 11.8 years ago by Leszek 4.2k

0

Entering edit mode

Thanks for your time. But unfortunately it doesn't work. This time every line is printed as an STD error and nothing inside output file.

Err: wrong line: contig10, 86, C, 1, ^S., C Err: wrong line: contig10, 87, A, 2, G^SG, C@ Err: wrong line: contig_10, 88, C, 2, .., CC

ADD REPLY • link 11.8 years ago by Gvj ▴ 470

score 1 · Answer 1 · 2011-04-14

I don't know if that helps, it's more a wild guess, but did you try to lower the -D 10000 parameter. I don't fully understand what it does yet, but the documentation says:

The -D option of varFilter controls the maximum read depth, which should be adjusted to about twice the average read depth. One may consider to add -C50 to mpileup if mapping quality is overestimated for reads containing excessive mismatches. Applying this option usually helps BWA-short but may not other mappers.

This one sentence is the only documentation I found on this. But for your example that would mean you had an average genome base coverage of 5000x. That would really surprise me, so how high is your read depth in relation to the genome? Also, even if you had extremely high coverage locally, you are right with your assumption that the coverage is more variable in RNA-seq than in DNA sequencing. One solution might be to remove the largest coverage outliers (e.g. ribosomal RNA, other stable RNA genes) to calculate the average coverage to adjust the -D parameter. I'd try this.

score 1 · Answer 2 · 2011-04-15

I'm not sure if it helps, as I haven't looked at the tools/pipeline/resources that are available from the methods described in this paper:

Haplotype and isoform specific expression estimation using multi-mapping RNA-seq reads

If you look at Figure 1, you can see that part of their pipeline (if you follow from Start(b)) involves calling genotypes from rna-seq data.

Maybe you can get a hold of their tools(?).

score 1 · Answer 3 · 2011-04-15

1

Entering edit mode

13.6 years ago

lh3 33k

You may try to add "-A" to mpileup. If fail, you may implement a naive SNP caller based on the pileup output.

It would be more helpful if you have the link to a screenshot or something.

ADD COMMENT • link 13.6 years ago by lh3 33k