Hello all,
Please find assignment #4 attached here: MCB_5472_Assignment_4. This week we will be starting to work with command line BLAST, using it to answer some interesting questions about our genomes from assignment #3. We will also be continuing to develop our perl skills, especially using arrays to parse the BLAST output and hashes to tabulate it into something useful. This assignment is due before class on Wednesday Feb 19/14.
The corresponding lecture can be found here: MCB5472_Assignment_4_lecture_Feb-12-14 (pptx) MCB5472_Assignment_4_lecture_Feb-12-14 (pdf)
hey guys,
If you also have trouble using the -outfmt 7 flag, here is the fact I just found out: If you just copy the syntax mentioned in the assignment notes, it probably won’t work because the double quotation marks are not the correct form for terminal. You’ll only need to type the quotation marks by yourself and copy the other words, and then you get the results!
Thanks Peihua!
Jonathan
Hi Dr. Klassen,
For the second part of the assignment, we do blast agains the genome itself to find paralogs. When we are analyzing the results, shall we get rid of all the blast results with 100% identity, or shall we just remove those pairs whose query id and subject id are the same? In another words, does 100% identity necessarily suggests that the two genes being blasted are the same?
Good question Peihua. A sequence will always have 100% identity to itself but potentially not over its entire length because of low complexity masking. Also consider: a sequence blasted to itself will have 100% identity but the opposite logic will not always be true, i.e., different sequences can also be 100% identical. Comparing names is safer for this reason.
Hello guys,
For the second part of the assignment, I am having trouble setting the evalue and the minimum alignment length requirement to 70%. Do I specify the evalue by typing in the command line -evalue 1e^-5? When I did this, I got an error saying command not found. I know this sounds like a trivial question, but the struggle is real. Any help is appreciated!
No question too trivial! I believe the problem is that it should be “1e-5” instead of “1e^-5”. Sorry, no way for me to test this right now.
When I specify an e-vaulue cutoff i usually put it as a decimal “.00001” and this works for me. Although, it might be harder to use if you were trying to do 1e^-30 or something else where you would need a lot of zeros.
Thanks guys! I will try these suggestions tomorrow and report my results!
Hi yall,
I’ve hit a bit of a wall on the second part of the assignment. So far I’ve been unable to get my BLAST results into a format that I can work with. At first I figured I would need to store each line of my results in an array and then split each of those lines by “\t”, but I was unable to find a way to do this. I then noticed the suggested “@blastline = split $line, “\t”” and attempted to use this while my results file was being read line by line. I messed around with this code (for longer than I’d like to admit) but as of yet, the program has either returned a tab (literally a blank space) or an fun error saying:
Invalid [] range “L-2” in regex; marked by <– HERE in m/# Query: gi|410480140|ref|YP_006767686.1| chromosomal replication initiation protein [Escherichia coli O104:H4 str. 2009EL-2 <– HERE 050]
/ at jamesmcg_4pt2.pl line 10, line 2.
Anyone have any luck getting their results into “tabular output format”? Thanks!
James, it looks to me like something is weird with the quotation marks. Try typing in the command that I gave you instead of copying (if that is what you did). Remember that things get weird coming from word processors, powerpoint and pdfs. It looks like perl is trying to interpret the match as a regular expression, not text. I think the quotation marks are forcing it this way. Hope that helps, else let me know.
Hi Prof. Klassen,
Oh yea, that might be because I put quotation marks around the code in my message. I stripped my code down and tired typing the command in again and got the same error. When I use the following code, adding the bit about the line starting with “gi”, the program prints only a tab.
while ($line = ){
if ($line =~ /^gi/){
@blastline = split $line, “\t”;
}
else{
print “”;
}
}
print @blastline;
Maybe I’m missing something obvious? Is your code meant to be used at this step or should I incorporate it into some sort of loop? Thanks!
Actually, I’ve already figured out how to store my BLAST results in a big array in which each element was created by splitting with “\t”. I thought we needed to keep the results organized in lines however, so I dismissed this bit of code, figuring it wasn’t very useful. Reevaluating it though, it seems this may be a useful first step way to go about things. I’ll give it a shot in the morning. Thanks for your time!
Ack! Error in my presentation! Should read: @blastline = split “\t”, $line; I have corrected this in the presentation now – I’m very sorry!
You have another problem though: because you redefine @blastline every time you go through the loop, all that’s kept is the last line. You probably want to process this array each time through the loop.
How do we specify the minimum alignment length to be 70%?
Nevermind, I think I figured it out.
Oh wow, I knew it had to be somethings silly. Anyway, what I hope is a good idea came to me right before I fell asleep last night, so I’m going to give it a shot. Thanks!
Hey yall,
Oops, accidentally hit tab there. Anyway, if anyone is having trouble print the final, tabulated hash table in a way that makes sense, Data::Dumper is useful. I found out about it here:
http://stackoverflow.com/questions/1162245/how-can-i-print-the-contents-of-a-hash-in-perl
So when BLASTing the fnn file against it self I got a whole but of the follow error:
Warning: (1431.1) CFastaReader: Ignoring invalid residue & at line 75086, position 0
The character after residue varied. Will this effect my results?
Thanks!
Had to ask SeqAnswers.com this one. Looks like you somehow have an “&” character in your fasta file that BLAST does not like. I suggest that you try the “tail” command in the terminal to check the bottom few lines of the file. You may have to download it again – maybe the download was interrupted before complete? Really I’m just guessing here – bottom line is that it’s probably corrupted somehow.
Yep, there’s some ugliness down there. For example:
>gi|410485110|ref|NC_018651.1|:c108169-107834 Escherichia coli O104:H4 str. 2009EL-2050 plasmid p09EL50, complete sequence
Luckily, I think its just in the plasmids. I’ll try downloading them differently and running everything again. Thanks for the help.
Wow, weird, the beginning of that line appeared completely differently when I pasted it from the cluster into this text window…
Fixed it!
Huzzah!