Entering edit mode
5.8 years ago
sullis02
▴
40
I have an old Sanger fasta file and its associated qual file, as well as a 'clip' file that shows where to trim low-quality sequence for each fasta sequence. e.g. for the first record in the fasta file here are the corresponding data in each file:
fasta:
>gnl|ti|713904976 1047095293492
CCCCAGAAGCCCCGCGCTTTTTCCATTTCACGACGTTGTAAAACGACGGCCAGTGAATTGTAATACGACT
CACTATAGGGCGAATTCGAGCTCGGTACCCGGGGATCCCACGTACGCTTCATCATCGCCCTTTTCAATTT
CTGCATATTCCGATGGCTTGTATGAGACAAACGCATTCATTTTCCAGCTTTTATCCAAAGCAAGTTACTC
AACTCCATATTATTCAAGTTTGATCCACTCCAGGGCCATCTTTTCTCGATGGACAACAGGGGACATCATT
GATGGATCGGATGACGGAGGTGCTTCTGTTTCTGCAAAGAAAACTCTGCAAAAAGACGTAAATGAAACAG
GAGGAATCTTCAGAGGAACAAGAAGTTACATGGACTTCCCATTTGCAAACTACAGAGATACTTCCAACTC
GGTTTTGAACCTCGGACCAAGTGAATCC
qual:
>gnl|ti|713904976 1047095293492
9 9 9 8 6 9 6 7 10 10 10 10 9 9 7 9 7 10 10 9
9 10 9 9 9 9 9 9 12 12 10 9 9 9 9 12 21 12 13 17
20 21 20 21 28 25 24 30 21 25 21 21 23 23 31 28 25 28 23 23
29 23 44 33 33 29 34 34 26 29 32 32 32 32 40 40 38 29 45 35
28 23 29 29 22 22 22 26 32 29 32 29 29 29 30 30 33 38 33 25
25 35 20 20 22 20 20 29 29 27 24 22 30 24 27 24 34 25 34 38
29 31 29 23 24 24 23 32 32 40 39 39 27 19 21 14 22 22 22 45
40 26 44 38 28 28 26 40 32 34 36 40 36 40 39 27 21 23 21 39
39 32 28 32 40 26 32 26 25 35 34 32 40 40 28 36 40 38 39 39
36 32 29 28 28 28 28 29 45 40 36 40 34 29 32 24 18 18 23 27
32 36 47 47 38 33 25 30 24 30 20 19 24 24 27 28 27 26 18 29
25 23 27 27 21 15 22 23 30 29 35 38 40 31 34 24 21 26 23 18
17 19 19 22 28 24 24 26 36 36 28 40 32 40 28 23 32 22 24 22
22 22 22 29 29 26 39 40 29 32 24 28 29 29 40 36 32 28 29 24
24 24 20 28 31 40 38 38 34 40 38 27 39 24 18 22 22 24 16 26
23 39 40 28 28 49 49 32 27 27 23 28 29 27 44 34 29 25 20 27
27 29 23 23 39 32 49 40 28 25 28 39 28 28 23 39 32 32 40 28
23 27 27 23 27 36 39 17 17 26 49 49 32 39 39 39 28 28 19 18
21 14 13 18 11 18 27 25 23 17 23 25 27 23 24 17 23 23 23 21
39 27 25 32 32 36 32 18 16 12 23 19 20 29 26 28 21 26 26 26
27 29 23 16 12 25 18 23 24 24 18 23 27 23 28 27 23 13 15 13
13 14 13 20 11 15 25 23 27 24 25 27 24 19 23 27 23 27 23 20
18 27 27 27 23 10 15 14
clip:
TI CLIP_LEFT CLIP_RIGHT
713904976 112 445
I'd like to generate a fastq file consisting of the trimmed fasta sequences + their qual scores. (Or, generate the fastq file of the whole fasta records first, then trim it using the clip data as instructions. )
What tools can do that?
Thank you very much!!
Oops, wrote to soon. Have you tried your script with input files that contain more than a single fasta/clip/qual record? With a single record as input (as in your example) it works fine. With normal fasta /qual/clip files (which contain numerous records) it doesn't work it reports ' clip id doesn't match sequence id, printing sequence without modifications' and then concatenates all the fasta and qual text into giant globs.
e.g the input fasta here has two records, as do the qual and clip files
The script is intended to work on one sequence per file, to use with a multifasta, the data needs to be saved in memory in a Hash.