Question

how to count the nucleotide frequency of paired ended fasta files

0

Entering edit mode

5.8 years ago

Learner ▴ 280

I have read this Counting nucleotide frequency using perl script which is very interesting. however, the fastq files are not in this format and I would like to know if there is any avaable code for paired ended sequencing

genomics perl • 2.2k views

ADD COMMENT • link updated 5.8 years ago by h.mon 35k • written 5.8 years ago by Learner ▴ 280

1

Entering edit mode

You need to be a little more specific:

What exactly do you want to achieve?
What type of data do you have?
Why do you think the script is not doing what you need, i.e. do you have an error message or example output that illustrates why you think it's failing you?

ADD REPLY • link 5.8 years ago by Friederike 9.0k

0

Entering edit mode

@Friederike

So cliche but I try to answer What exactly do you want to achieve? achieving to read the nucleotide frequency What type of data do you have? fastq paired ended Why do you think the script is not doing what you need, I think you need to know how the structure of a paired ended is !!!

sorry but your questions are not useful whatsoever

ADD REPLY • link 5.8 years ago by Learner ▴ 280

0

Entering edit mode

well, the original script you linked to was for fastq files and paired-end sequencing is typically simply dumped in two separate fastq files, so I did not understand why that script should not be working. I admit I did not fully read the title of your post, only the content which started with "This script is interesting". We might be able to point you to better solutions if you told us, for example, if you needed a specific table output or a plot (such as this and so on.... Anyway, I shall refrain from further unhelpful comments :)

ADD REPLY • link 5.8 years ago by Friederike 9.0k

0

Entering edit mode

@Friederike comments is always good as long as one does not send another for chasing a white goose ! or just blah blah. I try to be very sharp in my questions so that I don't waste people time and sorry if I am not perfect. Yes there are two Fasta and not fastq ! it is just the matter of structure of the data , I am trying to figure out

ADD REPLY • link 5.8 years ago by Learner ▴ 280

1

Entering edit mode

the fastq files are not in this format

That is easily fixed using reformat.sh from BBMap suite.

reformat.sh in=R1.fq.gz out=R1.fa

On a serious note you could also look at this: https://digibio.blogspot.com/2017/12/nucleotide-base-frequency-per-read-and.html

ADD REPLY • link 5.8 years ago by GenoMax 147k

0

Entering edit mode

@genomax can you paste the reformat.sh here please? or is it this bash file ? https://github.com/BioInfoTools/BBMap/blob/master/sh/reformat.sh

ADD REPLY • link 5.8 years ago by Learner ▴ 280

1

Entering edit mode

Official BBMap repo is located here. Reformat is a versatile utility that does a ton of other things. You can find a guide here. It is part of a much bigger BBTools package.

ADD REPLY • link 5.8 years ago by GenoMax 147k

0

Entering edit mode

@genomax , do you know what exactly it does to each fastq ? I means I should do that for both forward and reverse right ?

Thanks

ADD REPLY • link 5.8 years ago by Learner ▴ 280

0

Entering edit mode

If you were planning to use the fasta perl script linked in your original post reformat command above will convert your fastq format reads into fasta format. You will have to do it for F/R reads.

ADD REPLY • link 5.8 years ago by GenoMax 147k

1

Entering edit mode

Are your fastqs paired in the same order already? Do you have entries where one of the pair (forward or reverse) is missing?

ADD REPLY • link 5.8 years ago by Damian Kao 16k

0

Entering edit mode

@Damian Kao great point, how to check for those ? wow, thanks for such a comment

ADD REPLY • link 5.8 years ago by Learner ▴ 280

score 1 · Answer 1 · 2019-02-21

1

Entering edit mode

5.8 years ago

h.mon 35k

To parse fastq files, you may use MAPK original idea of parsing every four lines:

$count++;
if($count eq 4){ ... }
$count = 0;

Although the fastq specs doesn't make mandatory 4 lines (header, sequence, header, qualities) per record, it is so widely adopted it is nearly a standard.

To parse paired end sequencing, just open, parse and close the R1 file, then open, parse and close the R2 file, then output the overall results.

ADD COMMENT • link 5.8 years ago by h.mon 35k

0

Entering edit mode

@h.mon can you please direct me to the right (original idea of MAPK) ?

ADD REPLY • link 5.8 years ago by Learner ▴ 280

0

Entering edit mode

What do you mean? I pointed you to the original parsing idea, it is the code snippet above.

ADD REPLY • link 5.8 years ago by h.mon 35k

0

Entering edit mode

@h.mon when I click on his name, it takes me to his all questions not exactly his original idea

ADD REPLY • link 5.8 years ago by Learner ▴ 280

0

Entering edit mode

His original idea can be found at the post you linked: Counting nucleotide frequency using perl script

ADD REPLY • link 5.8 years ago by h.mon 35k