Assignment #6

Posted on February 26, 2014March 21, 2014 by Jonathan Klassen

Hello all,

Please find assignment #6 here: MCB_5472_Assignment_6. Note that this assignment is due in TWO WEEKS, not next week.

The presentation is here:

MCB5472_Assignment_6_lecture_Feb-26-14 (pptx)

MCB5472_Assignment_6_lecture_Feb-26-14 (pdf)

23 thoughts on “Assignment #6”

Cera Fisher says:

February 27, 2014 at 10:37 am

I had some trouble getting the hmmscan command to work. At first, it seemed to be a problem with my muscle alignment– I got an error that said something like “missing # STOCKHOLM header” which meant that hmmscan was not properly detecting the format of the muscle alignment. That seemed to resolve on its own and I’m not sure why. (I did try changing the muscle alignment format to a couple of different things before going back to the default.)

The second problem I had was with getting hmmscan to properly put the output in my out file. Here is the command I ended up using:
“hmmscan -o /dev/null –tblout hmmscan.out both.hmm drafts.faa”
where the -o option dumps the on-screen output to the garbage because I didn’t want to deal with it while testing my loop, and –tblout directions the output to my hmmscan.out file. This worked like a charm. One could also direct the -o option to some other file (maybe “hmmscan.screen.out”) if one wanted to be able to look at the human-readable version of the output.
Cera Fisher says:

February 27, 2014 at 10:48 am

Oops, that was totally a known bug. Should have checked my email first. 🙂
Jonathan Klassen says:

February 27, 2014 at 10:49 am

Hi Cera – Peihua also brought that error to my attention yesterday. Turns out the cluster had an older version of HMMER3 installed that did not detect the input file format properly. I worked with the Biotech cluster folks and this was fixed as of ~9pm last night. If you are still getting this error please let me know and I will investigate further. They also requested that you use a –cores 1 flag when running jobs on the cluster because their new version multithreads by default and this can cause trouble of the cluster under high usage.

Thanks for pointing out my syntax error for hmmscan: I should have had “–tblout” instead of “-o”. I have uploaded the correction.
Sumaira says:

March 4, 2014 at 10:25 am

Hey guys,

I don’t know how far along you guys are with assignment but one of things that threw me off while running my script on cluster is that seqret split my files in to yp.acessionnumber.fasta. When I ran my script on cluster, it told me it couldn’t find my files and this was because of lowercase “yp” and my script did not have file names with the fasta extension. Shout out to Professor Klassen who figured this out!
Becca Newcomer says:

March 6, 2014 at 10:44 am

While running my perl script, for steps 6-8, I receive this error:

“Error: Looks like Becca_both.hmm is already pressed (.h3i file present, anyway):
Delete old hmmpress indices first”

I first tried running the script with die, and it works successfully, but when I try to run the script against all the files that error shows up ~2 minutes into the run.

My script is as follows:

foreach my $filename (@table) {

system “cat $file1 $file2 > Becca_both.faa”;
system “muscle -in Becca_both.faa -out Becca_both.muscle.faa”;
system “hmmbuild Becca_both.hmm Becca_both.muscle.faa”;
system “hmmpress Becca_both.hmm”;
system “hmmscan -o/dev/null –tblout Becca_hmmscan.out Becca_both.hmm Drafts.faa”;
}

If I delete .h3i follow as it recommends, the script will continue to run, but I feel as if I shouldn’t have to be sitting here deleting the file as it appears. Also if I run the script using die, and then re-run without having deleted the h3i file, the same error message will appear.

Should I insert something into my loop to delete the file at the end of each run?
Jonathan Klassen says:

March 6, 2014 at 1:06 pm

Deleting the files each time is exactly what you should do. i.e., “system rm *.h3*”
James says:

March 6, 2014 at 7:07 pm

Hi yall,
So we are supposed to calculate the number of hits present in each draft genome? I could see how to do this, except the table generated with hmmscan isn’t returning any info on the genome the hit subject came from. For example:

mult.muscle – gi|429772785|ref|ZP_19304803.1| – 1.4e-12 33.3 8.3 1.4e-12 33.3 8.3 2.5 2 1 0 2 2 2 1 –

I guess the fourth spot is suppose to contain the accession number…but it doesn’t. Has anyone else run into this problem? Thanks.
Jonathan says:

March 6, 2014 at 10:01 pm

Hi James – The question does not specify “per genome” so you don’t need to worry about this. Just count the number of hits that you get above whatever e-value threshold you decide to use to consider the protein an ortholog.

FYI probably the simplest way to count orthologs in an individual genome would be to run the hmmscan vs. one genome at a time.
James says:

March 7, 2014 at 4:20 pm

Ah, yes, that seems nicer. Thank you!
Alicia says:

March 9, 2014 at 5:23 pm

I’m having some trouble getting my code to parse through the hmmscan output file. Psuedocode looks something like this:

open INFILE RBH.out

while my $line = INFILE {
Regular expressions
system (Step 5)
system (Step 6)
system (Step 7)
system (Step 8) which outputs table.out file
while my $line = table.out {
count orthologs }
}

How do I get my second while loop to open table.out each time it goes through the loop? Stating it as an infile doesn’t work because it doesn’t get made until I’m inside the while loop.
Rob says:

March 9, 2014 at 6:10 pm

you could always define it within the loop or create another loop within the loop
Alicia says:

March 9, 2014 at 6:50 pm

How simple. Thanks Rob!
Jonathan Klassen says:

March 9, 2014 at 9:17 pm

Alicia, also be careful because you used $line twice for two different things. That could bite you if you are not careful. You might want to consider also opening table.out as a separate infile to parse it – as written I don’t think that it will work.
Alicia says:

March 10, 2014 at 6:37 pm

So… I’m getting the same number of orthologs in the draft genomes for every single sequence. My both.faa files change each time through the loop, but the table output looks completely identical every single time. In the table file there’s no accession number or anything to identify the sequence.

Anyone else having this problem?
Jonathan says:

March 10, 2014 at 8:58 pm

Alicia, did you try manually deleting all of the intermediate output files and seeing what happens? It is possible that the one of the intermediate steps (e.g., hmmbuild) is not working but because there are .hmm output files from a previous iteration the script still runs.

Another way to troubleshoot is to run a few loop iterations while keeping all of the intermediate output files. The simplest way is to include a $counter variable that increments each time through the loop. Then give all of your intermediate files names like “both.$counter.hmm”. You can include a line like “die if ($counter == 10)” to stop after the tenth time through the loop.

The same principle applies as before: always know everything that your script is doing. If in doubt, print everything – including your intermediate output files.
Becca Newcomer says:

March 11, 2014 at 1:05 pm

Hey guys!

Just wondering as to how long some of your scripts are running. I’m up to 13 hours, and hopefully done soon. Anyone around there?
Jonathan Klassen says:

March 11, 2014 at 1:18 pm

Seems a bit long to me. Are you getting the output that you expect? Another possibility is that everyone is overloading the cluster, having all left the assignment until the last minute. Or those darn EEB folks are at their Bayesian analyses again! 😉
Becca Newcomer says:

March 11, 2014 at 1:35 pm

Yup, I’m getting the output I expect. But I am running a foreach loop, which at this point I think might’ve been a mistake. Oh well. I’ll just blame EEB, and not my script.
Jonathan Klassen says:

March 11, 2014 at 1:41 pm

Ha!
Greg says:

March 16, 2014 at 7:38 pm

Alright so I ran my script a number a times and a number of ways, but my count in my second while loop always comes back adding to my previous number. My script reads like this:
my $count = 0;
while ( $filecount = ) {
$filecount =~ /^both(.*)\|\s[-]\s+(0|\d+[.]\d+e[-+]\d+)\s(.+)/;
my $evalue = $2;
if ($evalue <= evalue threshold) {
$count ++;
Then my output comes back something like –
14
28
42
56
70
84
98
Any ideas on why my $count will not reinitialize as a zero every loop through?
Thanks!
Jonathan Klassen says:

March 16, 2014 at 8:02 pm

No idea, really. I recommend printing $count both every time through the central loop and every time through the hmmscan parsing loop, including some marker that you are at that part of the code (e.g., print “loop starts\n”; print “in the if\n” etc.). That’s the best way to tell what exactly is changing each time and why it is not being initialized. Let me know what happens and I’ll try to help further.
Greg says:

March 17, 2014 at 4:01 pm

So when I ran it and asked to print in the locations you mentioned –
It printed something like the following-
loop starts
loop starts
loop starts
loop start ……
14
28
142 …….
Jonathan Klassen says:

March 17, 2014 at 4:14 pm

So according to that you never actually enter your “if” statement. But count increments still so your print statements either can’t be in the right place or $count is being added too elsewhere in your script. I guess you could be also printing something other than $count, conceivably.

Also FYI perl can accept evalues as numbers, so you don’t necessarily need the fancy regular expression.

Also also FYI a handier way to deal with this file more like NCBI’s output would be @array = split /\s+/, $filecount; your evalue would then be $array[2], I think from your regex.

Comments are closed.