Hello All,
I have aligned the query sequence with the human genome. I am interested in looking at exons of a specific gene. When I look at the IGV I can see several reads that are mapping to the gene of interest across all the exons. I am confused how can I get the read length of each read which is aligning to exon 1, exon 2 and so on.
I tried to extract the reads giving coordinates of exons using samtools. However, when i converted the resulting bam file to fasta and tried to calculate the read length this gives me the original read length and not the read length of read matching to the region I specified.
I used the following command:
For extracting the reads from original bam file for the region of interest:
samtools view -h in.bam "chr3:region1-region2" > exon.bam
Converting bam to fasta :
samtools bam2fq exon.bam | seqtk seq -A > exon.fasta
for counting the lengths of reads from fasta file:
awk '/^>/ {if (seqlen) print seqlen;print;seqlen=0;next} {seqlen+=length($0)}END{print seqlen}' exon.fasta
Am I doing anything wrong here or missing anything totally? Thank you.
Some reads are going to extend outside these boundaries. I am not sure what exactly you are trying to do? What does "tried to calculate the read length" mean?
When I tried to calculate read length of the reads which I got from ::
samtools view -h in.bam "chr3:region1-region2" > exon.bam (chr3:region1-region2 is one of the exons)
This gave me read lengths of original reads and the reads that align with the region I specified.
I am trying to get the length of reads that align with exon 1, exon 2 and so on.
Still not clear. Individual reads are all going to be more or less the same length (subject to trimmed lengths). Are you trying to see what is the length of soft clipping during alignments?
Note: Are these Nanopore/PacBio reads? Again you may have left that important bit out.
So you want to find the part of read that aligns to Exon 1 and then Exon 2 etc?
These are nanopore reads. I am trying to see what is the length of the reads that mapped to region I specified.
If you are only interested in length of the part that is aligned then you could look at the CIGAR's and flags/tags in your alignments.
Thank you. How can I identify from cigar? The cigar remains the same for the read across all the exons.
If you have the CIGAR fields in front of you, the read length is column 6.
Can you post a read that maps to multiple exons from your SAM file?
Use this tool to visualize the alignment: sam2pairwise: a tool to visualize SAM entries as pairwise alignments