Small SAM Examples
3
0
Entering edit mode
9.4 years ago
jdimatteo ▴ 80

Hello, please help me find/create small SAM files (e.g. with less than 10 reads) to help me:

  1. better understand the SAM file format, and
  2. test bamliquidator (which I helped to develop, but note that my background is in Computer Science not Biology)

For example, to test handling of duplicate reads I manually typed up this example based on the SAM spec (tabs not preserved):

@SQ    SN:chr1    LN:50
read1    16    chr1    1    255    50M    *    0    0    ATTTAAAAATTAATTTAATGCTTGGCTAAATCTTAATTACATATATAATT    <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<    NM:i:0
read1    1032    chr1    1    255    50M    *    0    0    ATTTAAAAATTAATTTAATGCTTGGCTAAATCTTAATTACATATATAATT    <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<    NM:i:0

I hope this is correct, but having lots of tiny SAM examples could help me more quickly and confidently understand the SAM spec so I can generate better test cases.

I am about to create test cases with paired end reads with gaps, and before doing so I hope I might find more tiny SAM examples and/or better resources given my background.

I have found this link useful for manually creating SAM files: http://genome.ucsc.edu/goldenPath/help/bam.html

Thanks

SAM • 18k views
ADD COMMENT
0
Entering edit mode

It's unfortunate that the sam spec is always named SAMv1.pdf, when the actual specification changes rapidly.

(edited - I'm not really sure what I was trying to say).

Anyway, yes, using a random read generator and aligning small numbers of reads to a reference is a good way to explore the sam format, considering that the optional tags are poorly documented.

ADD REPLY
0
Entering edit mode

I'm currently experimenting with wgsim while following this helpful tutorial: http://biobits.org/samtools_primer.html

This online utility to decode/encode a SAM flag to/from plain English is helpful: http://broadinstitute.github.io/picard/explain-flags.html

Perhaps a read generator is the most pragmatic way of generating small correct SAM files for testing purposes.

ADD REPLY
7
Entering edit mode
9.4 years ago
jdimatteo ▴ 80

Probably the best thing to do for making small SAM examples is to simulate a small number of reads.

This is a nice overview, including how to generate reads using wgsim: http://biobits.org/samtools_primer.html

Here are sample steps to generate a single paired read from hg19:

  1. download hg19 reference genome, e.g.

    wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.fai
    wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz
    gunzip human_g1k_v37.fasta.gz
    
  2. filter out a single chromosome and index it, e.g.

    samtools faidx human_g1k_v37.fasta 20 > human_g1k_v37_chr20.fasta
    bowtie2-build human_g1k_v37_chr20.fasta homo_chr20
    
  3. simulate a single read sample, e.g. here is for a single (-N 1) paired read:

    wgsim -N 1 human_g1k_v37_chr20.fasta single.read1.fq single.read2.fq > wgsim.out
    
  4. generate the sam, e.g.

    bowtie2 homo_chr20 -1 single.read1.fq -2 single.read2.fq -S single_pair.sam
    
  5. generate a bam

    samtools view -b -S -o single_pair.bam single_pair.sam
    
  6. sort and index it

    samtools sort single_pair.bam single_pair.sorted
    samtools index single_pair.sorted.bam
    

If you modify the simulated reads and/or create them from scratch, these are useful resources:

ADD COMMENT
0
Entering edit mode

After reading the comments and other answers, this is what I ended up doing, so I figured I'd share the explicit steps I took in case it helps someone else.

ADD REPLY
7
Entering edit mode
9.4 years ago

under samtools/examples : https://github.com/samtools/samtools/tree/develop/examples

* toy.sam
* ex1.sam.gz
..
ADD COMMENT
1
Entering edit mode
9.4 years ago

You can use following links to download a real bam file for 1) human: http://www.1000genomes.org/data and 2) mouse: ftp://ftp-mouse.sanger.ac.uk/REL-1410-BAM/. For RNAseq files, you can download whole brain transcriptome data for various mouse strains here: ftp://ftp-mouse.sanger.ac.uk/current_rna_bams

The specifications for SAM format can be downloaded using this link: https://samtools.github.io/hts-specs/SAMv1.pdf

I would suggest you to go through the source code of existing tools that process the sam/bam files. For example, there is a feature in Picard tool (http://broadinstitute.github.io/picard/) to validate the SAM/BAM format. You can go through the code to get an idea of what information it looks for to validate the file.

ADD COMMENT
0
Entering edit mode

Thanks, picard-tools ValidateSamFile seems helpful

Please note however that I am specifically looking for small SAM files that are easily manually understandable for testing / unit testing, not real/large BAM files. (Real/large BAM files seem abundantly available unlike small SAM test files accompanied by explanations of meaning).

ADD REPLY

Login before adding your answer.

Traffic: 2048 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6