Map Fasta to Fasta + Bed file
1
0
Entering edit mode
4.7 years ago

I've been thinking a lot about this problem and I don't know how to solve it. I have a fasta file with microRNAs precursors and a bed file with the coordinates of this precursors in a genome.

fasta file microRNAs precursors:

>LQNS02278089.1_34108
CGGTCGTGATGGGAGCAAATTTGAACAATTAAATAGCAAATTGCACTCGTCCCGGCCTGC
>LQNS02278089.1_34106
CGGTCGTGATTGGTGCAACTTGGGTCACTTAACCGCCAATTGCACTGATCCCGGCCTGC
>LQNS02278089.1_34110
CGGCCGGTATGAGGGCAAATCAATTTCTGTATAAATGACGAATTGCACTCGTCCCGGCCTTC

bed file precursors:

LQNS02278089.1  848170  848230  LQNS02278089.1_34108    2249659.3   -   848170  848230  0,0,255
LQNS02278089.1  847652  847711  LQNS02278089.1_34106    1566285.5   -   847652  847711  0,0,255
LQNS02278089.1  848490  848552  LQNS02278089.1_34110    882643.1    -   848490  848552  0,0,255

I also have a fasta file of mature microRNAs sequences that I would like to map to the precursors fasta file to obtain another bed file for the matures microRNAs.

fasta file mature microRNAs:

>LQNS02278089.1_34108
AATTGCACTCGTCCCGGCCTGC
>LQNS02278089.1_34106
AATTGCACTGATCCCGGCCTGC
>LQNS02278089.1_34110
AATTGCACTCGTCCCGGCCTTC

Any help or recommendation will be very appreciated!

alignment fasta bed • 876 views
ADD COMMENT
0
Entering edit mode
4.7 years ago

Hi, assuming that column 4 of the precursors BED file is the key for lookup, the following awk command will work. It prints non-matches, too, so that you can verify what was / was not matched. The output will be ordered exactly as per the order of the FASTA headers in your mature microRNAs FASTA file.

Tested on Ubuntu 16.04:

cat precursos.bed
LQNS02278089.1  848170  848230  LQNS02278089.1_34108    2249659.3   -   848170  848230  0,0,255
LQN_negcontrol  1   34  LQN_negcontrol  123.6   -   1   34  0,0,255
LQNS02278089.1  847652  847711  LQNS02278089.1_34106    1566285.5   -   847652  847711  0,0,255
LQNS02278089.1  848490  848552  LQNS02278089.1_34110    882643.1    -   848490  848552  0,0,255

cat mature_mir.fasta
>LQNS02278089.1_34108
AATTGCACTCGTCCCGGCCTGC
>LQNS02278090_34110
AATTGCACTCGTCCCGGCCTTC
>LQNS02278089.1_34106
AATTGCACTGATCCCGGCCTGC
>LQNS02278089.1_34110
AATTGCACTCGTCCCGGCCTTC
>LQNS02278089.1_34119
AATTGCACTCGTCCCGGCCTTC

awk 'FNR==NR {arr[$4]=$0; next} /^>/{lookup = gensub(/^>/, "", "g", $0); if (arr[lookup]) {print arr[lookup]} else {print "NA - "lookup}}' FS="\t" precursos.bed mature_mir.fasta

LQNS02278089.1  848170  848230  LQNS02278089.1_34108    2249659.3   -   848170  848230  0,0,255
NA - LQNS02278090_34110
LQNS02278089.1  847652  847711  LQNS02278089.1_34106    1566285.5   -   847652  847711  0,0,255
LQNS02278089.1  848490  848552  LQNS02278089.1_34110    882643.1    -   848490  848552  0,0,255
NA - LQNS02278089.1_34119

...or, tidied (same command broken over multiple lines):

awk 'FNR==NR {arr[$4]=$0; next} \
  /^>/{lookup = gensub(/^>/, "", "g", $0); \
    if (arr[lookup]) {print arr[lookup]} \
      else {print "NA - "lookup}}' FS="\t" precursos.bed mature_mir.fasta

Kevin

ADD COMMENT

Login before adding your answer.

Traffic: 1830 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6