From primers sequence to Bed file
4
0
Entering edit mode
16 months ago
ManuelDB ▴ 110

I have the sequence of some primers and I need to obtain the Start and End coordinates of these primers.

I have been given two files

p1.fa

>TP53_02
TTGGAAGTGTCTCATGCTG
>TP53_03
ATGGGACTGACTTTCTGCTC
>TP53_04a
tctgactgctcttttcaccc >TP53_04b
ACCAGCAGCTCCTACACC
>TP53_05
TGCCCTGACTTTCAACTCTG
>TP53_06
CCCAGGCCTCTGATTCC

p2.f

>TP53_02
GGCCTGCCCTTCCAATG
>TP53_03
CCAGCCCAACCCTTGTC
>TP53_04a
AGGGACAGAAGATGACAGGG
>TP53_04b
GGCCAGGCATTGAAGTC
>TP53_05
AGCCCTGTCGTCTCTCCAG
>TP53_06
AGGGCCACTGACAACCAC

I also have the code as shown below

#!/bin/bash
set -euo pipefail


module load bwa
module load bedtools
module load samtools

# map primers to reference sequence
echo "RUNNING BWA MEM"

bwa mem \
-k 8 \
./Homo_sapiens_assembly38.fasta \
p1.fa \
p2.fa | \
samtools view \
-b > out.bam

samtools view -h -o out.sam out.bam


# convert BAM to bedpe format
echo "Converting BAM to bedpe format"

bedtools \
bamtobed \
-i out.bam \
-bedpe > primers.txt

echo "END"

The output of BWA is

[M::bwa_idx_load_from_disk] read 3171 ALT contigs
[M::process] read 36 sequences (732 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 2, 0, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] skip orientation FR as there are not enough pairs
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 36 reads in 0.081 CPU sec, 0.080 real sec
[main] Version: 0.7.17-r1188

and the BAM/SAM obtained doesn't show any mapping

@PG     ID:samtools.1   PN:samtools     PP:samtools     VN:1.16.1       CL:samtools view -h -o out.sam out.bam
TP53_02 77      *       0       0       *       *       0       0       TTGGAAGTGTCTCATGCTG     *       AS:i:0  XS:i:0
TP53_02 141     *       0       0       *       *       0       0       GGCCTGCCCTTCCAATG       *       AS:i:0  XS:i:0
TP53_03 77      *       0       0       *       *       0       0       ATGGGACTGACTTTCTGCTC    *       AS:i:0  XS:i:0
TP53_03 141     *       0       0       *       *       0       0       CCAGCCCAACCCTTGTC       *       AS:i:0  XS:i:0
TP53_04a        77      *       0       0       *       *       0       0       TCTGACTGCTCTTTTCACCC    *       AS:i:0  XS:i:0
TP53_04a        141     *       0       0       *       *       0       0       AGGGACAGAAGATGACAGGG    *       AS:i:0  XS:i:0
TP53_04b        77      *       0       0       *       *       0       0       ACCAGCAGCTCCTACACC      *       AS:i:0  XS:i:0
TP53_04b        141     *       0       0       *       *       0       0       GGCCAGGCATTGAAGTC       *       AS:i:0  XS:i:0
TP53_05 77      *       0       0       *       *       0       0       TGCCCTGACTTTCAACTCTG    *       AS:i:0  XS:i:0
TP53_05 141     *       0       0       *       *       0       0       AGCCCTGTCGTCTCTCCAG     *       AS:i:0  XS:i:0
TP53_06 77      *       0       0       *       *       0       0       CCCAGGCCTCTGATTCC       *       AS:i:0  XS:i:0
TP53_06 141     *       0       0       *       *       0       0       AGGGCCACTGACAACCAC      *       AS:i:0  XS:i:0
TP53_07 77      *       0       0       *       *       0       0       GCCACAGGTCTCCCCAAG      *       AS:i:0  XS:i:0
TP53_07 141     *       0       0       *       *       0       0       GTGTGCAGGGTGGCAAGTG     *       AS:i:0  XS:i:0
TP53_08 77      *       0       0       *       *       0       0       AGGACCTGATTTCCTTACTG    *       AS:i:0  XS:i:0
TP53_08 141     *       0       0       *       *       0       0       ATCTGAGGCATAACTGCAC     *       AS:i:0  XS:i:0
TP53_09 77      *       0       0       *       *       0       0       CAGTTATGCCTCAGATTCAC    *       AS:i:0  XS:i:0
TP53_09 141     *       0       0       *       *       0       0       CCACTTGATAAGAGGTCCC     *       AS:i:0  XS:i:0
TP53_10 77      *       0       0       *       *       0       0       GTACTGTGTATATACTTACTTCTCC       *       AS:i:0  XS:i:0
TP53_10 141     *       0       0       *       *       0       0       TGGCCAGGAAGGGGCTGAG     *       AS:i:0  XS:i:0
TP53_11 77      *       0       0       *       *       0       0       TCTCACTCATGTGATGTCATC   *       AS:i:0  XS:i:0
TP53_11 141     *       0       0       *       *       0       0       CTGTCAGTGGGGAACAAGAA    *       AS:i:0  XS:i:0
FLT3_TKD1N676   77      *       0       0       *       *       0       0       CTGGGAAGCCACGAGAATATTG  *       AS:i:0  XS:i:0
FLT3_TKD1N676   141     *       0       0       *       *       0       0       GTTACCTGACAGTGTGCACG    *       AS:i:0  XS:i:0
FLT3_TKD1F691   77      *       0       0       *       *       0       0       GAAACTGTAACTATTTCAGGACCA        *       AS:i:0  XS:i:0
FLT3_TKD1F691   141     *       0       0       *       *       0       0       TTGAGAAGATCACCATAGCAAC  *       AS:i:0  XS:i:0
FLT3_TKD2       77      *       0       0       *       *       0       0       AAAGTGGTGAAGATATGTGACT  *       AS:i:0  XS:i:0
FLT3_TKD2       141     *       0       0       *       *       0       0       TCACATTGCCCCTGACAAC     *       AS:i:0  XS:i:0
NPM1_860        77      *       0       0       *       *       0       0       ATGTCTATGAAGTGTTGTGGTTCC        *       AS:i:0  XS:i:0
NPM1_860        141     *       0       0       *       *       0       0       TGTTACAGAAATGAAATAAGACGG        *       AS:i:0  XS:i:0
IDH2_R140       77      *       0       0       *       *       0       0       ATGTGGAAAAGTCCCAATGGAAC *       AS:i:0  XS:i:0
IDH2_R140       141     *       0       0       *       *       0       0       TGTTTTTGCAGATGATGGGCTC  *       AS:i:0  XS:i:0
IDH2_R172       77      *       0       0       *       *       0       0       GACCAAGCCCATCACCATTG    *       AS:i:0  XS:i:0
IDH2_R172       141     *       0       0       *       *       0       0       GCCTACCTGGTCGCCATGG     *       AS:i:0  XS:i:0
IDH1_R132       77      *       0       0       *       *       0       0       GTCTTCAGAGAAGCCATTATCTGC        *       AS:i:0  XS:i:0
IDH1_R132       141     *       0       0       *       *       0       0       CAAAATCACATTATTGCCAACATGA       *       AS:i:0  XS:i:0

and consequently, the bed file doesn't show anything.

.       -1      -1      .       -1      -1      TP53_02 0       .       .
.       -1      -1      .       -1      -1      TP53_03 0       .       .
.       -1      -1      .       -1      -1      TP53_04a        0       .       .
.       -1      -1      .       -1      -1      TP53_04b        0       .       .
.       -1      -1      .       -1      -1      TP53_05 0       .       .
.       -1      -1      .       -1      -1      TP53_06 0       .       .
.       -1      -1      .       -1      -1      TP53_07 0       .       .
.       -1      -1      .       -1      -1      TP53_08 0       .       .
.       -1      -1      .       -1      -1      TP53_09 0       .       .
.       -1      -1      .       -1      -1      TP53_10 0       .       .
.       -1      -1      .       -1      -1      TP53_11 0       .       .
.       -1      -1      .       -1      -1      FLT3_TKD1N676   0       .       .
.       -1      -1      .       -1      -1      FLT3_TKD1F691   0       .       .
.       -1      -1      .       -1      -1      FLT3_TKD2       0       .       .
.       -1      -1      .       -1      -1      NPM1_860        0       .       .
.       -1      -1      .       -1      -1      IDH2_R140       0       .       .
.       -1      -1      .       -1      -1      IDH2_R172       0       .       .
.       -1      -1      .       -1      -1      IDH1_R132       0       .       .

What I am doing wrong?? It should work because I used to do this task before with the same code and input (I believe) and that worked.

bedtools bwa • 2.1k views
ADD COMMENT
1
Entering edit mode

This is the perfect use case for UCSC blat: https://genome.ucsc.edu/cgi-bin/hgBlat?command=start

   ACTIONS      QUERY    SCORE START   END QSIZE IDENTITY  CHROM  STRAND  START       END   SPAN
------------------------------------------------------------------------------------------------
browser details TP53_03     20     1    20    20   100.0%  chr17  -     7676415   7676434     20
browser details TP53_04a    20     1    20    20   100.0%  chr17  -     7676281   7676300     20
ADD REPLY
2
Entering edit mode
16 months ago
rfran010 ★ 1.3k

Perhaps you need to adjust the minimum output score to account for the small read sizes.

-T INT        minimum score to output [30]
ADD COMMENT
0
Entering edit mode

I though the -k 8 is already doing that. What -T value would you recommend???

ADD REPLY
0
Entering edit mode

That should be the seed size. From my understanding, that means to even consider an alignment there needs to be a stretch of 8 bp that perfectly align. Then, if the rest of the alignment contains mismatches, those are factored into alignment score using the scoring matrix.

ADD REPLY
0
Entering edit mode

I have read the documentation and I beliebe this doesn't help becaouse this is to limite the output by filtering out alignments with lower than the -T value

ADD REPLY
0
Entering edit mode

Default value appears to be 30. If your primers are less than 30 bp, this seems impossible unless you adjust the scoring scheme. If the alignment scores lower than this output filter, it will be reported as unaligned.

ADD REPLY
0
Entering edit mode

That worked!! I have tried with -T 20 and many of the sequences have now an START and END. This makes me think the following question: I though that as these sequences are amplicon, this means that there are perfect match with the start and end of the DNA we want to sequence. If I need to reduce the score to be able to map these sequences, this means that there is not a perfect match. Or the low score is only due to the sequence is very short resulting in a low score.

ADD REPLY
0
Entering edit mode

Right, scoring depends on the aligner. In this case bwa-mem grants +1 per match and penalties (negative points) for different mismatches or gaps. So if you only have 20bp, even if its perfectly aligned, you only can score 20. But if there was a mismatch (19/20 bp aligned), you might score 16. It seems like you may have primers shorter than 20bp, so it would be useful to decrease the -T score even more. At least to the minimum primer length, if you still want to use bwa-mem.

The amplicon (the product of PCR with the primer pair) is technically unknown until the primers are mapped and is often not accounted for in the score, but instead used as a filter. E.g. if the primers map more than 2,000 bp away, the aligner may reject this alignment as valid since the assumption is the fragment is not that large. So, after mapping, the distance between the primers is the amplicon, wherever in the genome that is.

Or the low score is only due to the sequence is very short resulting in a low score.

In your case, yes. You can adjust other parameters to make sure the matches are 100% match, as suggested by ATpoint.

Also note, there is a difference between alignment score and mapping score.

ADD REPLY
0
Entering edit mode

Great explanation. Thank you so much!

ADD REPLY
3
Entering edit mode
16 months ago
tbayer ▴ 50

You could also use usearch's search_oligodb command for this, however you would have to convert its output to bed format or define it using the user defined fields. usearch also has a search_pcr command if you actually need amplicon sequences not just primer matches.

ADD COMMENT
2
Entering edit mode
16 months ago
ATpoint 85k

I usually do this with bowtie2, setting all mismatch and penalties to a large number to only keep perfect end-to-end matches:

atpoint@DESKTOP-O7JK99H:~$ cat genome.fa
>chrX
TTGGAAGTGTCTCATGCTGATGGGACTGACTTTCTGCTCtctgactgctcttttcacccACCAGCAGCTCCTACACCTGCCCTGACTTTCAACTCTGCCCAGGCCTCTGATTCC

atpoint@DESKTOP-O7JK99H:~$ cat primers.fa
>TP53_02
TTGGAAGTGTCTCATGCTG
>TP53_03
ATGGGACTGACTTTCTGCTC
>TP53_04a
tctgactgctcttttcaccc
>TP53_04b
ACCAGCAGCTCCTACACC
>TP53_05
TGCCCTGACTTTCAACTCTG
>TP53_06
CCCAGGCCTCTGATTCC

atpoint@DESKTOP-O7JK99H:~$ bowtie2 -x genome --end-to-end --rdg 10000,10000 --rfg 10000,10000 -f -U primers.fa 2> /dev/null | samtools view -q 42 -b | bedtools bamtobed -i -
chrX    0       19      TP53_02 42      +
chrX    19      39      TP53_03 42      +
chrX    39      59      TP53_04a        42      +
chrX    59      77      TP53_04b        42      +
chrX    77      97      TP53_05 42      +
chrX    97      114     TP53_06 42      +

Alignment via bowtie2, then samtools to filter for perfect mapping score (42 is the highest in bowtie2), and then convert to bed with bedtools. All done without any custom awk/sed-fu. You need to index the genome first of course.

ADD COMMENT
0
Entering edit mode

Thanks you so much. I am working on your answer. I will come back to you soon.

ADD REPLY
1
Entering edit mode
16 months ago
ManuelDB ▴ 110

Just to explain what I have done. I have finally used BWA MEM reducing the T parameters to 10 and all primers seems to be mapped correctly.

Thank you so much all for helping to deal with this task and helping to improve :)

ADD COMMENT

Login before adding your answer.

Traffic: 2781 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6