extracting the soft clipped seq only from a sam file
3
4
Entering edit mode
9.9 years ago
Varun Gupta ★ 1.3k

HI Everyone,

I have a sam file something like this

SRR959756.117   272     RNU6-170P       141     9       23S21M  *       0       0       TAGAGACGGGGTCTCGCTATGTTGCTCAGGCTGGAGTGCAGTGG
SRR959756.117   272     RNU6-171P       139     9       24S20M  *       0       0       TAGAGACGGGGTCTCGCTATGTTGCTCAGGCTGGAGTGCAGTGG
SRR959756.1740  272     RNU2-65P        349     14      2S23M19S        *       0       0       CAGCTGGGATTACAGGCATGAGCCACCACGCCTGGCACCCAGCT
SRR959756.117   16      RNU6-1295P      110     9       19S25M  *       0       0       TAGAGACGGGGTCTCGCTATGTTGCTCAGGCTGGAGTGCAGTGG
SRR959756.212   256     RNU2-41P        38      1       25M19S  *       0       0       ATCTGTTCTTATCAGTTTAATATCTGATACGTCCTCTATCCGAG

(I am only showing the first 10 columns of it). I want to extract the soft clipped sequence and make a fasta file with header as >column1_column2_column3_column6. How can I do this. I am a little confused with the strand/flag column and hence having difficulty interpreting how to extract the soft clipped seq. Also where the CIGAR STRING is 2S23M19S (as in 3rd case), I just want to extract the 19S part of the seq. How can I do this.

Hope to hear soon

Regards

Varun

sam bam soft clip • 19k views
ADD COMMENT
3
Entering edit mode
9.9 years ago

I actually wrote a tool to do something similar that you could base yours off of. It's the extractSoftclipped.c program in this github repo. It extracts soft-clipped regions from alignments in a BAM or SAM file into a fastq formatted file and uses the aligned position in the header line, but it'd be easy enough to change all of that. That program also allows filtering by MAPQ and soft-clipped region length (in the case of multiple soft-clipped regions, it'll use every region above whatever the length threshold is).

ADD COMMENT
1
Entering edit mode

Hi Devon,

Your script works fine. I made a dummy bam file to test

@HD    VN:1.0    SO:unsorted
@SQ    SN:RNU6-170P    LN:304
@SQ    SN:RNU6-1295P    LN:303
@PG    ID:bowtie2    PN:bowtie2    VN:2.0.0-beta7
SRR959756.117    16    RNU6-1295P    110    9    19S25M    *    0    0    TAGAGACGGGGTCTCGCTATGTTGCTCAGGCTGGAGTGCAGTGG    VW[XY__c_f]^a^^a[dad[YYZ\V]]b]YZa\Y^[^^afaaa    AS:i:50    XS:i:42    XN:i:0    XM:i:0    XO:i:0    XG:i:0    NM:i:0    MD:Z:25    YT:Z:UU
SRR959756.117    272    RNU6-170P    141    9    23S21M    *    0    0    TAGAGACGGGGTCTCGCTATGTTGCTCAGGCTGGAGTGCAGTGG    VW[XY__c_f]^a^^a[dad[YYZ\V]]b]YZa\Y^[^^afaaa    AS:i:42    XS:i:42    XN:i:0    XM:i:0    XO:i:0    XG:i:0    NM:i:0    MD:Z:21    YT:Z:UU

and ran the tool as

extractSoftclipped test.bam > fest.fq.gz

and my fastq looks like

@RNU6-1295P:109
TAGAGACGGGGTCTCGCTA
+
VW[XY__c_f]^a^^a[da
@RNU6-170P:140
TAGAGACGGGGTCTCGCTATGTT
+
VW[XY__c_f]^a^^a[dad[YY

I just want read name and flag info in the header. Is it possible for you to modify it?

Is it this line in your c file where I have to modify it

fprintf(of, "@%s:%"PRId32"\n", hdr->target_name[b->core.tid], pos);
ADD REPLY
1
Entering edit mode

Yup, you would want something like:

fprintf(of, ">%s_%i_%i\n", bam_get_qname(b), b->core.tid, b->core.flag);

That won't add the CIGAR, which actually requires a loop since it's not stored as a string (I can show you how to add this if you really want it). You can also comment out (Just add // to the beginning) lines 26-29 so you get a fasta file instead (I already changed the @ to a > above so the headers will be correct).

ADD REPLY
0
Entering edit mode

Hi Devon,

If I also wanted to include the position pos which you already had in your original code , the new code should look like this

fprintf(of, ">%s_%s_%i_%"PRId32"\n", bam_get_qname(b), hdr->target_name[b->core.tid], b->core.flag,pos+1);

Also It would be nice if you can show me how to add the CIGAR string in the header itself. Also may I ask , is this all coming from pysam and C together?

ADD REPLY
0
Entering edit mode

That looks correct for adding position. I'm glad you caught the tid mistake!

To insert a CIGAR string (we'll add it to the end), you'd do something like:

Note lines 23-24, which are the main changes. I've not tested this, but something along those lines should work.

ADD REPLY
0
Entering edit mode

BTW, this is all pure C. There are other programs in that repository written in python, but this isn't one of them.

ADD REPLY
0
Entering edit mode

Hi Devon,

Can you also add a feature for which end does the softclip reads comes from (5 Prime or 3 Prime) ? This one is very important for my downstream analysis and I've no idea about C...

Thank you!

ADD REPLY
1
Entering edit mode

Hi Devon,

I am interested in extractSoftclipped C script, which unfortunately I can't run. I downloaded the extractSoftclipped.c and Makefile files to my project directory. The next step I am trying to compile the C code with make function. So I run the following:

make extractSoftclipped

And I get:

make -C htslib
make: *** htslib: No such file or directory. Stop.
make: *** [htslib] Error 2

If I compile with:

gcc extractSoftclipped.c

I get following error:

extractSoftclipped.c:1:24: fatal error: htslib/sam.h: No such file or directory
compilation terminated

Could you please help me with this issue, as I have never done this type of work before and just don't know how ELSE to tackle the problem. Probably I am running the script from the wrong directory,

I would appreciate your help.
Thanks

ADD REPLY
1
Entering edit mode

you need to install the C htslib library

ADD REPLY
1
Entering edit mode

What Pierre said, though you can just git submodule update --init while in the directory with that source code. I tend to put htslib in as a submodule on everything that uses it.

ADD REPLY
0
Entering edit mode

Thank you Pierre Lindenbaum and Devon Ryan for prompt respond,

we have all libraries at the right location, but for some reasons I get the same error message. Unfortunately, I have never been dealing with such stuff so it's very difficult for me to debug the problem. However, I assume that it can't find a path for unknown to me reasons. Is it possible to get pre-compiled version of that particular script, because I am interest in extracting soft clipped reads.

Or if its not difficult for you could you navigate me in debugging

Thanks once again,

ADD REPLY
1
Entering edit mode

If you just type the following it should work:

git clone https://github.com/dpryan79/SE-MEI.git
cd SE-MEI
git submodule update --init
make
ADD REPLY
0
Entering edit mode

Hi, IrK, sorry for bothering you. Have you solved this problem? It will be very appreciated that you can give some advice.

ADD REPLY
0
Entering edit mode

HI Devon,

No idea about C. I will still look at the code to make out some thing from it..

ADD REPLY
0
Entering edit mode

You might wanna check out Devon's solution. His perspective is more relevant than mine - I see the sequence as a string and do not make any assumptions on the features, which is not a good approach at times.

ADD REPLY
0
Entering edit mode

I tried out the script. And it seems to work. Every once in a while I get a read that is peculiar. Probably, my expectation is off. This is one of the peculiar softclipped parts:

@T:2468
GTAAAACAATTTAAAA
+
JJJJJJJJJJJJJJJJ

I looked for a read, mapping on chromosome T at position 2469 in the file that I fed into the extractSoftclipped.c, yet there is none like it. This is my expectation. But maybe this is only true for reads that are softclipped at front? These are the reads, that fall into the the region of position T : 2468:

HWI-ST999:188:C49E6ACXX:8:1102:18487:21113  147 T   2443    41  23M1D3M1D25M    =   2362    -134 TTACCAAATTATAAACAAATAAGTCAATTTAATATTTTATATACGAAATCA    GIGJJJJJJJJJJJIJJJJJJJJJJJJJJJJJJJIJJJHHHHHFFFFFCCC AS:i:78 XN:i:0  XM:i:1  XO:i:2  XG:i:2  NM:i:3  MD:Z:23^T3^T7T17    YS:i:78 YT:Z:CP

HWI-ST999:188:C49E6ACXX:8:1101:12833:68128  163 T   2458    41  8M1D3M1D40M =   2563    156 CAAATAAGTCAATTTAATATTTTATATACGAAATCAATTAAATCTTAATCA B@?DDEDDBFHFHJJJIJJJIJJIJJGGJIJJIIIIJJIJJJIHIIIJGGF AS:i:78 XN:i:0  XM:i:1  XO:i:2  XG:i:2  NM:i:3  MD:Z:8^T3^T7T32 YS:i:87 YT:Z:CP

HWI-ST999:188:C49E6ACXX:8:1101:7229:39427   133 T   2471    0   *   =   2471    0   CGCCATTTATATATGTTTTTGTTCCTGATGAGCCACATAACTTATAGATTT <@?=DD;DAHDB4CE:@F>B<?EHIBC4FHCG*:ED><9DGEDD9HBF>FD YT:Z:UP

HWI-ST999:188:C49E6ACXX:8:1101:7229:39427   89  T   2471    36  7S44M   =   2471    0   TAAGTCAATTTAATATTTTATATACAAAATCAATTAAATCTTAATCACATA <GHHB<IGAGGDIGGGJJIJIIFEIFJIGJIIIGIJIGGHHDHFEDDF@@@ AS:i:74 XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:7T10G25    YT:Z:UP

HWI-ST999:188:C49E6ACXX:8:1101:15690:30209  163 T   2484    36  49M2S   =   2752    318 TATACGAAATCAATTAAATATTAATCACATAAATATTAGCACAAAAAAATA CCCFFFFFHHHHHJJJJJJJJJJJJJJIJJIJJIJJIJJIJJJIJJJJJIH AS:i:74 XN:i:0  XM:i:3  XO:i:0  XG:i:0  NM:i:3  MD:Z:19C13A4A10 YS:i:76 YT:Z:CP

I am sorry to bother you with this.

ADD REPLY
1
Entering edit mode
9.9 years ago

I wrote a tool named SamExtractClip but I haven't much used it.

input is a SAM/BAM, output is a FASTQ

$ curl -L -s "https://raw.githubusercontent.com/samtools/samtools/develop/test/dat/mpileup.1.sam" | java -jar dist/samextractclip.jar 2> /dev/null 
@ERR013140.3521432/1:0:17:1:99
AGAGGTCCCCAACTTCTTTGCA
+
@AEDGBHIIIIIFJGIKHGHIJ
@ERR156632.12704932/2:0:17:1:163
TGGAGAAGGGGACAAGAGGTCCCCAACTTCTTTGCA
+
BFAFGFEIGFEFHHEIDKJGHHHJIIE=@KKGGKJG
@ERR156632.9601178/1:0:17:1:99
CTATGACAGGGAGGTCATGTGCAGGCTGGAGAAGGGGACAAGAGGTCCCCAACTTCTTTGCA
+
DEEEIIHHKIJILKHLHIKEKHHMKLKKJGKKKKLKLFIHEKIKL=KLJLKIILHKMH9LJJ
@ERR162872.21706338/1:0:17:1:99
CTTCTTTGCA
+
BHBFH<​EIFG
@ERR243039.1049231/2;0;17;1;163;43S57M;5'
TGCAGGCTGGAGAAGGGGACAAGAGGTCCCCAACTTCTTTGCA
+
AEEFIFHIJDGIGIJHHIAGGGLGJIEJHJHHFIJGJJDFJIG
ADD COMMENT
0
Entering edit mode

Hi Pierre

I am still working on it to download. Meanwhile I have a doubt. So if I have a CIGAR STRING as 2S23M19S then will it output only the 19S part of the seq because I see -m option which has a default of 5.

Also installing the tool with this command

git clone "https://github.com/lindenb/jvarkit.git"

gave me this error

Cloning into jvarkit...
warning: templates not found /usr/local/share/git-core/templates
error: SSL certificate problem, verify that the CA cert is OK. Details:
error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed while accessing https://github.com/lindenb/jvarkit.git/info/refs

fatal: HTTP request failed
ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Hi Pierre

I downloaded the tool and ran it. I used this as my dummy sam file

@HD    VN:1.0    SO:unsorted
@SQ    SN:RNU6-170P    LN:304
@SQ    SN:RNU6-1295P    LN:303
@PG    ID:bowtie2    PN:bowtie2    VN:2.0.0-beta7
SRR959756.117    16    RNU6-1295P    110    9    19S25M    *    0    0    TAGAGACGGGGTCTCGCTATGTTGCTCAGGCTGGAGTGCAGTGG    VW[XY__c_f]^a^^a[dad[YYZ\V]]b]YZa\Y^[^^afaaa    AS:i:50    XS:i:42    XN:i:0    XM:i:0    XO:i:0    XG:i:0    NM:i:0    MD:Z:25    YT:Z:UU
SRR959756.117    272    RNU6-170P    141    9    23S21M    *    0    0    TAGAGACGGGGTCTCGCTATGTTGCTCAGGCTGGAGTGCAGTGG    VW[XY__c_f]^a^^a[dad[YYZ\V]]b]YZa\Y^[^^afaaa    AS:i:42    XS:i:42    XN:i:0    XM:i:0    XO:i:0    XG:i:0    NM:i:0    MD:Z:21    YT:Z:UU

But I get only this output

@SRR959756.117:0:RNU6-1295P:110:16
TAGAGACGGGGTCTCGCTA
+
VW[XY__c_f]^a^^a[da

Why not the other read??

Also for some alignments, my quality string , 11th column is a *. Would that matter??

ADD REPLY
0
Entering edit mode
9.9 years ago
Ram 44k

awk should do it:

awk -F "\t" '$6 ~ /19S$/ { print ">"$1"_"$2"_"$3"_"$6"\n"substr($10,27,19); next } { print ">"$1"_"$2"_"$3"_"$6"\n"$10 }'

Assumptions:

  1. All sequences are of equal length
  2. All CIGAR strings have 19S at the end when appropriate
ADD COMMENT
0
Entering edit mode

Hi Ram

This gives me this error

awk: $6 ~ /19S$/ { print ">"$1"_"$2"_"$3"_"$6\nsubstr($10,44,8)} { print ">"$1"_"$2"_"$3"_"$6\n$10 }  
awk:                                         ^ backslash not last character on line
ADD REPLY
1
Entering edit mode

You need to surround the \n with quotes:

awk -F "\t" '$6 ~ /19S$/ { print ">"$1"_"$2"_"$3"_"$6"\n"substr($10,44,8)} { print ">"$1"_"$2"_"$3"_"$6"\n"$10 }'
ADD REPLY
0
Entering edit mode

Hmmm. I'll try and tweak it.

ADD REPLY
0
Entering edit mode

I made a wrong assumption. How do you pick the 19S part of a sequence? Can you demonstrate with the 3rd case as your example?

ADD REPLY
0
Entering edit mode

It's CCACGCCTGGCACCCAGCT, the last 19 bases. Aside from hard-clipped bases, soft-clipped bases will always be on the start/end of reads.

ADD REPLY
0
Entering edit mode

Hi Devon I agree with you but doesn't flag info matter. Like read mapping to forward or the reverse strand? Like in the third case read maps to the reverse strand.

ADD REPLY
1
Entering edit mode

The flag tells you the orientation of the original read relative to what you're seeing. The soft-clipped region of the original read is AGCTGGGTGCCAGGCGTGG, which is just the reverse complement of what I wrote above. Whether you care to reverse complement or not depends on what you want to do next.

ADD REPLY
0
Entering edit mode

So basically i want to extract all these soft clip sequences and then map them to hg19 genome. So tools like bowtie will automatically align the read to forward or reverse strand, but AGCTGGGTGCCAGGCGTGG is basically the actual 5p part of the read SRR959756.1740 from the fastq file

@SRR959756.1740 HWI-EAS222_0038:7:1:14629:1494 length=50
<code>AGCTGGGTGCCAGGCGTGG</code>TGGCTCATGCCTGTAATCCCAGCTG
+SRR959756.1740 HWI-EAS222_0038:7:1:14629:1494 length=50
gggggggfgddfdefcafcZ^``\adbddfgdfdfdfbfafffg
ADD REPLY
0
Entering edit mode

Right, the github repo has tools that are basically doing something very similar to what you're talking about (they can be used to find novel mobile element insertions using soft-clipped single-end reads). If your end goal is to find either large deletions or chromosomal breakpoints then tracking orientation might make life simpler (it'd be easier to locate the exact breakpoint). The original tool I wrote wasn't intended for those purposes, so I didn't bother. Regardless, you'd get the same alignment of the soft-clipped region either way, just the flag would be different.

ADD REPLY
0
Entering edit mode

if that's the case, add another comparison on the strand field and reverse the string based on the strand.

ADD REPLY
0
Entering edit mode

Reverse complement, not reverse :)

ADD REPLY
0
Entering edit mode

Ah, of course. I'm good with awk, bad with biology, it would seem :)

ADD REPLY
0
Entering edit mode

I suspect I've just made that exact mistake a few more times than you, give it time :)

ADD REPLY
0
Entering edit mode

Thank you for the reassurance, Devon :)

ADD REPLY
0
Entering edit mode

I am interested in any soft clipped thing which is equal to 19 or more than that like 19S,20S,21S,22S etc. The assumption for 19S and more(20S,21S etc) is random. Also is your code giving read seq for 23S as shown in the example?

ADD REPLY
0
Entering edit mode

If you're interested in all >=19S, don't compare to strings ending in 19S, compare to strings containing the regex "(19|[2-9][0-9])S"

ADD REPLY

Login before adding your answer.

Traffic: 2546 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6