Question

Keep the paired reads whose length less than 400,000 bp

0

Entering edit mode

5.1 years ago

ek699 ▴ 10

Hi, I am very new to bioinformatics and I am struggling with filtering out my bam file from STAR alignment.

Previously, I filtered out the original bam file by keeping only chr1-19, X, Y, and M and also by keeping only properly paired reads.

And the current bam file head looks like:

A00521:147:H2H22DSXY:1:1262:16640:7294  99      chr1    3176591 0       101M    =       3176597 107     CCCAGAATGCATTCCTGAACTCTTCACCCCAGAATGCATTCCTGAACTCCTCACCCTAGAGTTAGAACCCTCCCAACTAAAAACTGTTCCAAGAACATTTT    FFFFFFFFFFFFF:FFFFFFFFF,FFFFFF::FFFFFFFF:FFFFFFFF:FFFFFFF:FFFFFFFFFFF:FFF:FFFFFFFFFF,FFF,FFFFFFFFFF:F    NH:i:7  HI:i:1  AS:i:192        nM:i:4
A00521:147:H2H22DSXY:1:1262:16640:7294  147     chr1    3176597 0       101M    =       3176591 -107    ATGCATTCCTGAACTCTTCACCCCAGAATGCATTCCTGAACTCCTCACCCTAGAGTTAGAACCCTCCCAACTAAAAACTGTTCCAAGAACATTTTTGAGAT    FFFFFF:FF::FFF,F::FFFFFFFFF:FFFFFFFFFFF:F,FF:FFFFFF:FFF:FFFF:FFFFFFFF:FFF,FFFFFFFFFFFF,FFFFFFFFFFFFFF    NH:i:7  HI:i:1  AS:i:192        nM:i:4
A00521:147:H2H22DSXY:1:2417:15835:21026 355     chr1    3501034 0       100M    =       3501059 126     GCCCATATCTTCAAGGCTTTTCCCCACCTTCTCCTCTATAAGTTTCAGTGTCTCTGGTTTTATGTGAAGTTCCTTGATCCACTTAGATTTGACCTTAGTA     FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFF     NH:i:5  HI:i:3  AS:i:199        nM:i:0
A00521:147:H2H22DSXY:1:2417:15835:21026 403     chr1    3501059 0       101M    =       3501034 -126    ACCTTCTCCTCTATAAGTTTCAGTGTCTCTGGTTTTATGTGAAGTTCCTTGATCCACTTAGATTTGACCTTAGTACAAGGAGATAAGTATGGATCGATTCG    FFF::FFFFFFFFF:FFFFFFFF:FFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFF:F    NH:i:5  HI:i:3  AS:i:199        nM:i:0

And the current bam file flagstat looks like:

36466002 + 0 in total (QC-passed reads + QC-failed reads)
2598452 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
36466002 + 0 mapped (100.00% : N/A)
33867550 + 0 paired in sequencing
16933775 + 0 read1
16933775 + 0 read2
33867550 + 0 properly paired (100.00% : N/A)
33867550 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

From here, how can I filter out the current bam file by the length between each pair of reads less than 400,000 bp??

I am pretty lost here, so I am thankful for any help.. Thank you so much.

RNA-Seq alignment bam sam filter • 2.1k views

ADD COMMENT • link 5.1 years ago by ek699 ▴ 10

0

Entering edit mode

do you mean 400bp or 400Kbp ? (the later sounds rather strange though)

what did you align: RNA or DNA reads ?

ADD REPLY • link 5.1 years ago by lieven.sterck 15k

0

Entering edit mode

I aligned RNA-Seq. Based on a given task, 400Kbp is correct though.. The distance between each pair of reads should be less than 400Kbp...

ADD REPLY • link 5.1 years ago by ek699 ▴ 10

0

Entering edit mode

Does STAR allow for reads to map that far apart? With bbmap one would need to set maxindel=400000 for it to allow for reads to map that far apart.

ADD REPLY • link 5.1 years ago by GenoMax 151k

0

Entering edit mode

I used STAR aligner for bulk RNA-seq alignment. 400bp makes sense more?? The only instruction I got is to filter out the bam file so that the distance between each pair of reads should be less than 400,000 bp.. If it were 400bp, how can I proceed further? I didn't add any special option for the value 400,000 bp when I aligned it (mouse genome, GRCm38) though.. When I used STAR, I only kept the default options.

ADD REPLY • link 5.1 years ago by ek699 ▴ 10

0

Entering edit mode

The default for relevant options in STAR appear to be

--alignIntronMax  
default:0
    maximum intron size, if 0, max intron size will be determined by(2ˆwinBinNbits)*winAnchorDistNbins
--alignMatesGapMax  
default:0
    maximum gap between two mates, if 0, max intron gap will be determined by(2ˆwinBinNbits)*winAnchorDistNbins

If you are truly looking for reads that come from exons that are 400kb apart then you may need to re-run STAR with those two options set to a value higher than 400kb.

ADD REPLY • link 5.1 years ago by GenoMax 151k

0

Entering edit mode

I resolved problem with filtering by TLEN

ADD REPLY • link 5.1 years ago by ek699 ▴ 10

0

Entering edit mode

Ah, I think you're referring to the insert size. Check out this post on filtering a bam file on insert size. Also you can use the bbmap tools to look at a histogram of your insert size for some diagnostics

ADD REPLY • link 5.1 years ago by andrew.j.skelton73 6.6k

0

Entering edit mode

maybe you should also state what is it that you are trying to achieve

right now you seem to assume that filtering by 400Kb will take you this goal, frankly it does not sound right.

I can't quite think of a RNA-Seq analysis where filtering for an insert size under 400Kb would make sense.

And to answer your question, if all you want is to filter properly paired alignments, that information is already present in your BAM file in the flag column, for example 99:

samtools flags 99
0x63    99      PAIRED,PROPER_PAIR,MREVERSE,READ1

to keep the proper pairs you could filter with:

samtools view -b -f 2 -F 4 input.bam > output.bam

ADD REPLY • link 5.1 years ago by Istvan Albert 102k