Entering edit mode
8.5 years ago
cristina_sabiers
▴
110
First at all thanks for all the support and help...to a newbee
Following a tutorial I did manage to generate my sam file, the problem I get when I want to convert to bam, I cant, so I have been reading the best is to run validatesamfile, so I did
But at this point I get really lost ;_; whats the main reason my sam file is bad? Im biology not a programmer and is the first time I do this and learn on my own, so whatever help or link to tutorial will be grateful for me...
Thanks!!!
java -jar picard.jar ValidateSamFile \
> I=out.sam \
> MODE=SUMMARY
[Thu Jun 23 18:17:20 CEST 2016] picard.sam.ValidateSamFile INPUT=out.sam MODE=SUMMARY MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true INDEX_VALIDATION_STRINGENCY=EXHAUSTIVE IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Thu Jun 23 18:17:20 CEST 2016] Executing as cri@cri-To-be-filled-by-O-E-M on Linux 4.4.0-21-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_91-8u91-b14-0ubuntu4~16.04.1-b14; Picard version: 2.4.1(7c4d36e011df1aec4689b51efcada44e92d1817f_1464389670) JdkDeflater
[Thu Jun 23 18:17:20 CEST 2016] picard.sam.ValidateSamFile done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=494403584
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.IllegalArgumentException: Invalid fastq character:
at htsjdk.samtools.SAMUtils.fastqToPhred(SAMUtils.java:407)
at htsjdk.samtools.SAMUtils.fastqToPhred(SAMUtils.java:385)
at htsjdk.samtools.SAMRecord.setBaseQualityString(SAMRecord.java:277)
at htsjdk.samtools.SAMLineParser.parseLine(SAMLineParser.java:326)
at htsjdk.samtools.SAMTextReader$RecordIterator.parseLine(SAMTextReader.java:248)
at htsjdk.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:236)
at htsjdk.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:212)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:545)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:519)
at htsjdk.samtools.SamFileValidator.validateSamRecordsAndQualityFormat(SamFileValidator.java:269)
at htsjdk.samtools.SamFileValidator.validateSamFile(SamFileValidator.java:200)
at htsjdk.samtools.SamFileValidator.validateSamFileSummary(SamFileValidator.java:128)
at picard.sam.ValidateSamFile.doWork(ValidateSamFile.java:187)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:209)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)
Can you backtrack a bit and include commands that you used for the alignment and conversion to BAM? What software were you using for that part?
Thanks genomax!
I did this
single paired ION_EX..:
Did you replace
<ID>
<LIBRARY_NAME>
and<SAMPLE_NAME>
with real names (my guess is you did not)?<OPTION>
is used in unix to indicate that a real name/string should be used in that location.What does the single paired ION_EX mean?
Did you replace <library_name> and <sample_name> with real names (my guess is you did not)? No I didnt ;_;
What does the single paired ION_EX mean? Exome Ion Torrent, just a note for myself :)
That may be one of the issues. Can you show us the output of
head -30 out.sam
?ok, I will try to do that step right, thanks a lot
mmm by the output of head -30 you mean like this? with this code?Im sorry... Im really newbee just 2 weeks working on this (before crashing my head to learn linux and install all the tools)
That looks good except I don't see any quality values. Can we also look at
head -6 input.fastq
output?head -n 6 input.fastq > head2.csv
Hope this is right....I know...I still need to learn a lot
There is something strange with the first record (if the copy/paste has preserved the content). There appear to be more nucleotides than Q-scores. Do you see that for later records as well?
I don't know. Are you allowed to have RG tags with "<"?
Hopefully @cristina is re-running the samse step with real names (since this is Ion data should not be very large files) so that should address the "<" issue. The original error was referring to invalid fastq characters (which was what I was going after first).
Thanks! Sadly bc Im new I need to wait 5 hours to next post ;_;
As genomax told I re-running the samse step, I tried again to export to bam, but doesnt work.
It appears that your original data file is malformed (number of bases =/= number of Q-scores). Can you run fastqValidator and see how bad the problem is? Can you get another copy of the original data file in case the one you have is corrupt?
Finally I managed to run fastqvalidator..
I think the report looks not so good, is any way to fix all of that?
Thanks!!!
Finished processing /home/cri/Desktop/hg19/input.fastq with 181246044 lines containing 45311511 sequences. There were a total of 112895 errors. Returning: 1 : FASTQ_INVALID
Those are just short reads than the default (because by default fastqvalidator seems to be looking for a minimum of 10 bases, you can probably change that on the command line, take a look at
fastqvalidator -h
).I am surprised that there was no flag for the example you had posted above where there were fewer Q-scores than bases in that particular read.
Not sure if that is saying that one example you posted is the only invalid sequence. Can you delete that and see if
Thanks Geomax, i was checking in tutorial how and I checked twice the file with different commands
I have been reading when you get the file from illumina, Ion ...etc there exist different programs to check the quality and filter...like PRINSEQ , I always need to do that process before I aligning the sequences to the human genome if I get like in this case FASTQ_INVALID??
First
./fastQValidator --file /home/cri/Desktop/hg19/input.fastq --minReadLen 40 --baseComposition --avgQual --disableSeqIDCheck --quiet --params ON --maxErrors -1
Overall Average Phred Quality = 21.74 (I only put the result....this forum dont allow me write so many characters)
Second ./fastQValidator --file /home/cri/Desktop/hg19/input.fastq --minReadLen 40
ERROR on Line 11: Raw Sequence is shorter than the min read length: 9 < 40
ERROR on Line 15: Raw Sequence is shorter than the min read length: 13 < 40
ERROR on Line 203: Raw Sequence is shorter than the min read length: 31 < 40
ERROR on Line 1707: Raw Sequence is shorter than the min read length: 38 < 40
ERROR on Line 1711: Raw Sequence is shorter than the min read length: 39 < 40
ERROR on Line 1715: Raw Sequence is shorter than the min read length: 29 < 40
ERROR on Line 1719: Raw Sequence is shorter than the min read length: 22 < 40
ERROR on Line 1731: Raw Sequence is shorter than the min read length: 9 < 40
ERROR on Line 3459: Raw Sequence is shorter than the min read length: 38 < 40
ERROR on Line 3475: Raw Sequence is shorter than the min read length: 35 < 40
ERROR on Line 3483: Raw Sequence is shorter than the min read length: 33 < 40
ERROR on Line 3607: Raw Sequence is shorter than the min read length: 16 < 40
ERROR on Line 3611: Raw Sequence is shorter than the min read length: 10 < 40
ERROR on Line 3615: Raw Sequence is shorter than the min read length: 10 < 40
ERROR on Line 3627: Raw Sequence is shorter than the min read length: 23 < 40
ERROR on Line 3959: Raw Sequence is shorter than the min read length: 37 < 40
ERROR on Line 3979: Raw Sequence is shorter than the min read length: 12 < 40
ERROR on Line 4123: Raw Sequence is shorter than the min read length: 34 < 40
ERROR on Line 4231: Raw Sequence is shorter than the min read length: 32 < 40
ERROR on Line 4647: Raw Sequence is shorter than the min read length: 12 < 40
Finished processing /home/cri/Desktop/hg19/input.fastq with 181246044 lines containing 45311511 sequences. There were a total of 2340507 errors. Returning: 1 : FASTQ_INVALID
It's throwing an error because it's getting an ascii value in the quality column either < 33 or > 126. How was this SAM file made? Does it look reasonable if you scroll through it a bit?
Thanks Devon, due Im in spain I write with bit delay :P
I followed this tutorial
http://seqanswers.com/wiki/How-to/exome_analysis
My steps were these to get my sam file
I have few fastq files to use (mostly are 16-20 gb), yesterday night I tried the same steps with another fastq file in this case I get like same results, probably I do something wrong :(
I tried to convert my sam file to bam with two diferents paths...so i got
FIRST this shorter way
SECOND WAY. from the tutorial..seems even worst, think I need to fix more things than just the sam file O_o
Can you post the first few alignments in the SAM file (i.e., skip the lines starting with
@
)?BTW,
samtools view -Sbo output.bam input.sam
is the modern version of the samtools command you wrote. I'm not entirely sure thatsamtools import
works anymore, the command was renamed years ago.@Devon: @cristina has a fastq file with at least one sequence where the Q-scores are less than the bases. That example is in this thread if you go up.
Ah, there's the problem!
ok...so If I understand right my main problem is I got a fastq file with less quality scores than nucleotides
I have been checking how delete that part, and found can use fastx-toolkit, I don't know if I do right at all
mmm which value I should use 48?
I can use something else where I copy and select the part I want to delete?
Thanks a lot for the help!
Best option would be to use
reformat.sh
from BBMap suite to remove the broken reads. You will use it like (no spaces between the = other words)You will then need to redo the alignments.
thanks a lot!!! :)))) I own you a big coffe!!!
Afther doing reformat have same quality scores than nucleotides :)
Now I do aligment once more...in my pc will take around 5 hours, so I m looking forward for the results and hope I can continue and learn how to call snps and generate an excel with predictors. One of my main goals is to generate a report with de novo mutations... well... step by step :P
Since you have BBMap installed on your machine you can look into using
bbmap.sh
an alternative for bwa for alignments. You can go from your fastq file directly to an aligned BAM (as long as you have samtools available on your machine).I tried with
reformat.sh in=/path_to/your.fastq out=/path_to/fixed.fastq tossbrokenreads=t
Seems didn't work, after I did alignment and got same error, I should have check twice the fastq before doing it. At first sight look like had same quality scores than nucleotids... but it isnt. :(
That fastq record looks ok to me.
Are there extra blank lines in your fastq file after each line (I removed them when I reformatted the post above)?
Can you try
bbmap.sh
to do an alignment?I added this blank spaces on pour pose to see better, when i copy and paste the file appear all in one line
you are right are same quality scores, I count them now....stupid of me, at my excell looks like much more less...(quality fonts are smaller)
I will try with bbmap.sh ! Thanks
Did you try @Devon's suggestion for current samtools command to covert sam to bam? Did the following produce same error?
If you do decide to map with
bbmap.sh
then as long as samtools is available in your $PATH then BBMap can directly create bam files (saving you a step). Be sure to use:yes, I tried devon advice, didnt work.
mmm nice to know the shortcut...Im too late, I have used :
my cpu fan run fast XD need buy one that make less noisy, hope will not burn out o_O
THANKS!
Is the mapping running now? Once done, try the sam conversion to bam on the file produced by BBMap.
yes...still running I asume will take 6 hours like did with the other tools, hope this time will be fine!....we will see tomorrow, time for me to go to sleep...zzz...
thanks! :)
Perhaps BBMap is running with a single thread (I don't remember if it uses all cores by default or just one). Next time add
threads=N
(N cores in your processor) option to your command to speed things up a bit.gosh need do all again, my father did shut down the pc..... ;;;;;____;;;;;;
Can't help you with that one :-D
XD sadly no...
Now I put this code, so I hope this time I get the file !!!
./bbmap.sh ref=/media/cri/CRIS_DATA/hg19.fa in=/media/cri/CRIS_DATA/fixed.fastq out=/media/cri/CRIS_DATA/out.bam nodisk threads=4
Just by curisosity...
how long time takes bbmap to give my bam file? Is still working with 100% my 4 cores, and use 23 gb ram...and until now, 6 hours has process a part of bam, 2.3 gb....thats normal?
How big is going to suposse to be my bam file if the fastq is 15 gb?
thanks
As long as the job is running (and the output file is increasing in size) leave it alone. Hard to tell what sized BAM you will end up with at the end.
ok, thanks :)
thought was some kind of average...is bit bad I cant see % of running on this kind of jobbs...
You could figure out how many reads have been aligned by looking in the BAM but since all 4 cores on your machine are busy aligning the reads let us not try that now.
I GOT MY FIRST BAM FILE!!!! YES YES YES !!!! ^^
Think took around 10 hours...Hope my file is fine and I can use it !!
Now I sort and index it with this code
Bbmap do it too?
If everything go fines I hope I can visualize it at IGV :)
Thanks a lot genomax!
Assuming you are using the latest samtools the commands would be slightly different
With a large BAM file it may take a couple of hours to sort. So be patient.
oh thanks... I note it..for me worked with the other code...now I see my bam file at IGV ^^ next step will be call snps....probably I will end crying XD
I asked to the person who provided me my fastq file and the bam as well, why my bam file is just 5 gb and their bam file is 17-20 gb ....he says depends all on number of alignments...
thanks!
Sorted BAM files compress better so if your BAM is sorted them it may look smaller than the other one. Can you post the lines from your BBMap alignment stats to see how well it did? This is ion data and as such more prone to errors.
Those strange characters look like an artifact of copy/paste so let us ignore those for the moment.