Extracting sequence, not reads, by position from bam
1
1
Entering edit mode
6.3 years ago
geneware ▴ 10

Hello all,

I want to extract sequence from a bam file,

suppose that I have a bam file "normal.bam",

I learned that I can use

samtools view -b normal.bam 2:129208172-129208196 > extract.bam

and then

samtools bam2fq extract.bam | seqtk seq -A > extract.fa

However, extract.fa contain the reads, not the sequence at the specific position. like in the following format

>ERR424937.5101065/1
CTCTGTCTCTGTCTCTCTCCCTCTCCCTTCTCCCCTCACCCTCTCCTCTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTATGC

I want to index the extract.fa file and use the faxid command like the following

samtools faidx extract.fa
samtools faidx extract chr2:129208172-129208196

But as you can see, the extract.fa does not contain the position information, so I can't really do that.

Can anybody help?

Thanks, Gene

sequence • 4.0k views
ADD COMMENT
0
Entering edit mode

However, extract.fa contain the reads, not the sequence at the specific position.

it's not clear to me.

ADD REPLY
0
Entering edit mode

Thanks Pierre, the position 129208172-129208196 has 25 bases, the reads fetched by samtools bam2fq is longer than that. So I think the sequence that I am interested should be a sub sequence of that?

I am a novice please correct me if I am wrong.

Thanks

ADD REPLY
0
Entering edit mode
6.3 years ago

using bioalcidaejdk: http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html

get the read in the region , loop over the cigar string, build a new sequence.

final String CONTIG="RF01";
final int START = 100;
final int END = 150;
stream().
    filter(R->!R.getReadUnmappedFlag()).
    filter(R->R.getContig().equals(CONTIG) && !(R.getStart()>END || R.getEnd()<START)).
    forEach(R->{
        final Cigar cigar = R.getCigar();
        if(cigar==null || cigar.isEmpty()) return;
        int refpos=R.getUnclippedStart();
        int readpos=0;
        final String bases = R.getReadString();
        final StringBuilder sb = new StringBuilder();
        for(final CigarElement ce:cigar)
            {
            final CigarOperator op = ce.getOperator();
            switch(op)
                {
                case P: break;
                case D: case N:
                    {
                    refpos+= ce.getLength();
                    break;
                    }
                case H:
                    {
                    for(int i=0;i< ce.getLength();++i)
                        {
                        if(refpos>=START && refpos<=END) sb.append("N");
                        refpos++;
                        readpos++;
                        }
                    break;
                    }
                case X:case EQ:case M:case S:
                    {
                    for(int i=0;i< ce.getLength();++i)
                        {
                        if(refpos>=START && refpos<=END) sb.append(bases.charAt(readpos));
                        refpos++;
                        readpos++;
                        }
                    break;
                    }
                case  I:
                    {
                    for(int i=0;i< ce.getLength();++i)
                        {
                        if(refpos>=START && refpos<=END) sb.append(bases.charAt(readpos));
                        readpos++;
                        }
                    break;
                    }
                default: throw new IllegalStateException();
                }
            }
        if(sb.length()==0) return;
        out.println(">"+R.getReadName()+":"+R.getStart()+"-"+R.getEnd());
        out.println(sb);
    });

example:

/samtools view -b src/test/resources/S1.bam  "RF01:100-150" |\
java -jar dist/bioalcidaejdk.jar -F BAM -f biostar.code 



>RF01_44_622_1:0:0_1:0:0_3a:44-113
TATTCTTCCAATAG
>RF01_44_499_0:0:0_3:0:0_7b:44-113
TATTCTTCCAATAG
>RF01_67_565_0:0:0_2:0:0_67:67-136
TATTCTTCCAATAGTGAATTAGAGAATAGATGTATTG
>RF01_94_620_1:0:0_2:0:0_15:94-163
TATTCTTCCAATAGTGAATTAGAGAATAGATGTATTGAATTTCATTCTAAA
>RF01_102_665_1:0:0_1:0:0_71:102-171
TTCTTCCAATAGTGAATTAGAGAATAGATGTCTTGAATTTCATTCTAAA
>RF01_110_504_2:0:0_1:0:0_5d:110-179
ATAGTGAATTAGATAATAGATGTATTGAATTTCCTTCTAAA
>RF01_121_598_1:0:0_3:0:0_6e:121-190
GAGAATAGATGTATTGAATTTCATTCTAAA
ADD COMMENT
0
Entering edit mode

Thank you so much Pierre. didn't realize that it can be so complicated. The manipulation of CIGAR strings is illuminating, never thought of that.

But I am still now sure how to interpret the results from the example, at which positions are those sequences in the example output?

I think it's a good idea to let you know what I am trying to do because I may have gone a long way to achieve a simple task. Here is my ultimate goal:

I know some functional mutations and their positions, all I want to do is to extract the *mers (eg. 9mers) around it from a aligned bam file, and translate these polymers into peptide, ideally in fasta format, so that I can make some functional predictions.

The above question that you helped solving is one step towards my goal, but there may be shortcuts.

Do you have experience in this and can you advise?

Thanks,

Gene

ADD REPLY
0
Entering edit mode

at which positions are those sequences in the example output?

it's the two last number RF01_44_622_1:0:0_1:0:0_3a:44-113 was mapped on 44-113

I think it's a good idea to let you know what I am trying to do be

it sounds like a XY problem... http://xyproblem.info/

I know some functional mutations and their positions, all I want to do is to extract the *mers (eg. 9mers) around it from a aligned bam file, and

why not just using snpeff ?

ADD REPLY
0
Entering edit mode

As far as I understand the code above, it will return the reference sequence, right? In case of a Deletion it would be great to output "-" instead of the reference base. Could you adapt your code accordingly?

ADD REPLY

Login before adding your answer.

Traffic: 2032 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6