I am trying to extract read sequences from a cram file. It would be easiest with pysam but this is not working for me on my work environment in DNAnexus RAP for some reason.
Is there an alternative way of printing and saving just the read sequences from a cram file with samtools e.g. to an array or text file? I want just the sequence, not any other information associated like cigar strings
Additionally, is there any way to filter the reads, e.g. according to mapping quality before displaying them?
Thank you
Thank you, I've converted them to FASTQ files, though they are now much larger.
I was planning on trying to use the read mapping quality field in the cram files to specifically extract the reads which were of low quality - is there any way for me to do that with samtools, and then I could convert them to FASTQ?
You can use the '-F' parameter in sambamba view for filtering, for example, -F "mapping_quality < 10" to obtain reads with mapping quality lower than 10. It is worth noting that versions of sambamba >= 0.8 no longer support the CRAM format, while versions < 0.8 do support the CRAM format.
Try:
samtools view -u -e 'mapq<10' in.cram | samtools fasta -
The main samtools.1 man page has a section "FILTER EXPRESSIONS" describing the syntax. http://www.htslib.org/doc/samtools.html#FILTER_EXPRESSIONS.
It's worth noting that if your read data is paired, then any filtering is problematic as it'll break pairs up.
Thank you! Sorry I just saw this, is there an alternative for paired read data? I ran this
but it is running indefinitely for several hours, much longer than just viewing the cram file itself. I'm guessing it's because my data might be paired? I wasn't sure