reference fasta creation
2
0
Entering edit mode
3.6 years ago
geneart$$ ▴ 50

Hi all,

I am hoping to map my NGS reads onto a reference file I have created using multiple short reads, where each read is represented in the reference file in a fasta format. I intend using bowtie or any other short read aligner.

Generally I have mapped to a reference file has one fasta sequence of the genome and here I plan to do something different. I am concerned at many levels and hence posting the question here even before I do my mapping, seeking NGS experts' opinions: thinking out loud here...

  1. Mapping short reads onto multiple short sequences, would I be able to decode which reads mapped to which fasta sequence in my reference file properly?
  2. How computationally intensive would this process be, given short read mapping onto short reads reference?
  3. Are there better ways to do this ?
  4. Any other better aligner than bowtie for short reads?

Thankyou and appreciate your time!

reference mapping fasta • 1.9k views
ADD COMMENT
0
Entering edit mode

I am hoping to map my NGS reads onto a reference file I have created using multiple short reads, where each read is represented in the reference file in a fasta format.

That is a very uncommon approach. I have never seen anything like this. I have no idea why you would want to do it this way. Can you elaborate why you think this is the best solution?

ADD REPLY
0
Entering edit mode

Short answer is it should be possible to do this. Instead of a small number of references now you will have hundread of thousands of references depending on number of entries in your reference. The fasta header of that read will show up in your SAM/BAM as reference. You may get a massive amount of secondary alignments (depending on how similar reads are in this reference).

What exactly are you trying to do? Do you simply want to see if your reads are represented in the reference? Since you are trying to align short reads to other short reads it may be best to use bowtie v.1.x that does non-gapped alignments.

ADD REPLY
0
Entering edit mode

Thankyou Genomax, Yes ,exactly ! I want to see if my reads are represented in the short read reference file I got from someone else. Will try bowtie v.1.x .Appreciate your discussion and suggestion.

WouterDeCoster, I understand this is unusual and I have never done this way before. I always have mapped to a genome reference using bowtie or BWA mem. Hence wanted to get other opinions before I waste my time .

ADD REPLY
0
Entering edit mode

When everyone here thinks your approach is ...unorthodox, it's probably best to stop and ask a new question outlining exactly what goal X you are trying to do, instead of asking about how to do approach Y, because approach Y might not be a good way to get to X.

ADD REPLY
0
Entering edit mode

It does seem unorthodox and there are times when some cannot reveal the details due to constraints that are not under their control. The idea of discussing glitches and workflows and seeking help on good/bad bioinformatics questions in general opens up conversations that can help oneself and others to learn something new as well, even though this may not exactly be a solution or it may not seem logical. Always there is an option to help with one can or totally ignore :) But i do appreciate your comment and thankyou for your time.

ADD REPLY
3
Entering edit mode
3.6 years ago
GenoMax 147k

I want to see if my reads are represented in the short read reference file I got from someone else.

Use bbduk.sh from BBMap suite in filter mode to identify reads that match your reference.

$ more ref.fa 
>ref1
AGCAATATTTTCACGAGCGAGTAACTTTTGAACTAACTGCTCAATTTTGTTGCTTTCATTATGGTCGTGGCATGCGGTAAGTGGGGTAGATATATTAGAG
>ref2
GTGCATTGGGGAGCAACCAGTCAGGATATTTTAGACACAGCTTGTATTTTGCAATGCCGTGATGCGTTGAATATTGTGGAAGGTCAACTTCAACAGTGCT

$ more toy.fq
@SYN_0_3183191_3183290_268_+_3191191_1_._NZ_CP015121.1_Acinetobacter_baumannii_strain_ab736_chromosome,_complete_genome&0_3183359_3183458_268_-_3191359_1_._NZ_CP01512
GCCAAAAATACTTTAGCCAATCAAACTCAGTCAATTGCTCAAAAACAAGCGGATATTGTGGCTGCGCAGGCTAAAGTAGAACAAGTTAAGGCTCAATATG
+
7;6:685;=?;<=<>A56>9=7CBC><@9D5AC58BB=<:595>C;:5;9<=A>;<=;A9=6@>;>6:=C97:7>:B;B=::55A5B;9<=5?<56=555
@SYN_1_1681048_1681147_207_+_1689048_1_._NZ_CP015121.1_Acinetobacter_baumannii_strain_ab736_chromosome,_complete_genome&1_1681155_1681254_207_-_1689155_1_._NZ_CP01512
GTGCATTGGGGAGCAACCAGTCAGGATATTTTAGACACAGCTTGTATTTTGCAATGCCGTGATGCGTTGAATATTGTGGAAGGTCAACTTCAACAGTGCT
+
8A7>97>==CBA@A>@AAB?ADD=@AAA@EB<@8D8E9F<CA>==DC;F@F?A=CD@@8?9B88@><;>8?7=ABC=:>9C>>>B<7><7?9<<9797<@
@SYN_2_2800746_2800845_190_-_2808746_1_._NZ_CP015121.1_Acinetobacter_baumannii_strain_ab736_chromosome,_complete_genome&2_2800656_2800755_190_+_2808656_1_._NZ_CP01512
AGCAATATTTTCACGAGCGAGTAACTTTTGAACTAACTGCTCAATTTTGTTGCTTTCATTATGGTCGTGGCATGCGGTAAGTGGGGTAGATATATTAGAG
+
;87;?6?6AB?CF>>A@B>=E@CB=8<?=??@9B?;D?@FA?=>D>;7A;@>C=@?D=?>;?A7>>?;:?>?F>?:==A;@A=>6=669C6@A;<=<666
@SYN_3_3453929_3454028_154_+_3461929_1_._NZ_CP015121.1_Acinetobacter_baumannii_strain_ab736_chromosome,_complete_genome&3_3453983_3454082_154_-_3461983_1_._NZ_CP01512
CTTCACTGGCGCGAGACATTACCTGAGCAACAGTATCGTACTTCGACTCTTTATCAGCGCGAAGTTGAACTGTCGGTTTAACCGCACCTTGACCTGCTTC
+
=:=<;?8@@ABBCBBAA9A?DD>AC@A?>>CAB@@A@BA@@AA@A>A=B?B?>8?@>=B@?C?B?>AD?A=????@D>?AB>AB@DBBB?=@?>9=95;7
@SYN_4_3005690_3005789_109_+_3013690_1_._NZ_CP015121.1_Acinetobacter_baumannii_strain_ab736_chromosome,_complete_genome&4_3005699_3005798_109_-_3013699_1_._NZ_CP01512
ATGCACTGGTTGCTCGCGGAATTTTGGACCATGCCACTGTGCGTTCACCTCTGCTTCCACTTGCAGATGGTGCTGAGCAAGAAATTCATGATGCAATGCG
+
7675;?@?>?A??B<@?<C?BAB@???B?>?>@;=?=>B=<>=;>?==>?5=?9<;C>;@=7<?@<>4>>7;9<;;<><6@8<=?B;A>B=33<:69743
@SYN_5_1010750_1010849_248_-_1018750_1_._NZ_CP015121.1_Acinetobacter_baumannii_strain_ab736_chromosome,_complete_genome&5_1010602_1010701_248_+_1018602_1_._NZ_CP01512
TCACGGCGGAAAGCCAAGACTGAGTAATTACCCTTTTGTTTAGCATATTTCTCAATAAGTTTTGTTGTCGCTTTTAGTCCATCAATTGCAATAATGACAT
+
9:A9BA?@CHHE9H?EF@:AFA>GE?;B@A?C<DGDCGA@=A==C<?BCBE999;B?>C9<@@=B=GAHD?B@BDCF=@D:=@<BH??9;9?B9999;99
@SYN_6_3513088_3513187_181_-_3521088_1_._NZ_CP015121.1_Acinetobacter_baumannii_strain_ab736_chromosome,_complete_genome&6_3513007_3513106_181_+_3521007_1_._NZ_CP01512
TTCATCAATAGCAGCTCGGACTTCTTCACCAGCAATAACGGCAATAGGTTTTACAAGATTGATGCTATGAGCGAGAACCTTACCGACTTGAGATGTATTT
+
89<>;==@A?C?@B@CA@@?=C@A@ABC:CBC@AC@A6@@?BCC??BB@@A@?@@AA@@?>ACA@BAC@AA@CAB>?9@A@?>BBB@A:BA==>8<?:<:
@SYN_7_1608059_1608158_230_-_1616059_1_._NZ_CP015121.1_Acinetobacter_baumannii_strain_ab736_chromosome,_complete_genome&7_1607929_1608028_230_+_1615929_1_._NZ_CP01512
TCCCATGTCCATTACAGTCGGTTGTACCATTGCCATCAGATATTGCGGTATAACCGCTTAACACACGTCCTGAAAACTCTTGGTGGCTAGATAATATACC
+
9:>=<=@?;B@DC?@@?@B@BA@>@?A@=AACB@:A?>@B@=AC@BA?C;@@@<?B@:B??@@B>?A<:<?:C>A4=4>5>?@?<<=<B>D=?89;=4;5

Run bbduk.sh to select reads that match your reference. You can play with parameters to allow for partial matches etc (if you need to).
Note: Total removed actually enumerate reads that match the reference.

$ bbduk.sh -Xmx2g in=toy.fq ref=ref.fa outm=stdout.fq k=23 

Version 38.87

0.028 seconds.
Initial:
Memory: max=2058m, total=2058m, free=2015m, used=43m

Added 156 kmers; time:  0.026 seconds.
Memory: max=2058m, total=2058m, free=1983m, used=75m

Input is being processed as unpaired
Started output streams: 0.008 seconds.
@SYN_1_1681048_1681147_207_+_1689048_1_._NZ_CP015121.1_Acinetobacter_baumannii_strain_ab736_chromosome,_complete_genome&1_1681155_1681254_207_-_1689155_1_._NZ_CP01512
GTGCATTGGGGAGCAACCAGTCAGGATATTTTAGACACAGCTTGTATTTTGCAATGCCGTGATGCGTTGAATATTGTGGAAGGTCAACTTCAACAGTGCT
+
8A7>97>==CBA@A>@AAB?ADD=@AAA@EB<@8D8E9F<CA>==DC;F@F?A=CD@@8?9B88@><;>8?7=ABC=:>9C>>>B<7><7?9<<9797<@
@SYN_2_2800746_2800845_190_-_2808746_1_._NZ_CP015121.1_Acinetobacter_baumannii_strain_ab736_chromosome,_complete_genome&2_2800656_2800755_190_+_2808656_1_._NZ_CP01512
AGCAATATTTTCACGAGCGAGTAACTTTTGAACTAACTGCTCAATTTTGTTGCTTTCATTATGGTCGTGGCATGCGGTAAGTGGGGTAGATATATTAGAG
+
;87;?6?6AB?CF>>A@B>=E@CB=8<?=??@9B?;D?@FA?=>D>;7A;@>C=@?D=?>;?A7>>?;:?>?F>?:==A;@A=>6=669C6@A;<=<666
Processing time:                0.007 seconds.

Input:                          8 reads                 800 bases.
Contaminants:                   2 reads (25.00%)        200 bases (25.00%)
Total Removed:                  2 reads (25.00%)        200 bases (25.00%)
Result:                         6 reads (75.00%)        600 bases (75.00%)

Time:                           0.042 seconds.
Reads Processed:           8    0.19k reads/sec
Bases Processed:         800    0.02m bases/sec
ADD COMMENT
0
Entering edit mode

Thankyou GenoMax ! This is worth a try and I appreciate your time ! Will update my post soon on the outcome.

ADD REPLY
0
Entering edit mode
3.6 years ago

You can use minimap2 to do all against all mapping, with output in PAF format. This approach is used in all vs all mapping before long read error correction.

But as others have said, you're barking up the wrong tree here.

ADD COMMENT
0
Entering edit mode

Thankyou for your comment. Appreciate your input!

ADD REPLY

Login before adding your answer.

Traffic: 1882 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6