Assignment #5

Hello all,

This week’s assignment has two parts. In the first, we will do one last experiment with our E.coli dataset, finding reciprocal BLAST hit orthologous proteins between the two complete genomes and counting the conserved and unique proteins in each genome. An additional aspect of this exercise is to reinforce the utility of hashes. The second half of this exercise is a fairly brief comparison of different BLAST applications showing the power of PSI-BLAST for finding distant homologs.

Assignment: MCB_5472_Assignment_5

Lecture: MCB5472_Assignment_5_lecture_Feb-19-14 (pdf) MCB5472_Assignment_5_lecture_Feb-19-14 (ppt)


19 thoughts on “Assignment #5

  1. Hey guys,

    So I am having some trouble with the assignment. I have all my best hits stored in the hash for both blast hits. What I am having trouble is actually using the hash table to find the orthologs.

    This is what I have so far:

    Open (INFILE, “blast1”)
    while ($line=INFILE) {
    $match = includes only those hits where query length was at least 70% of the alignment length.
    $match1 = includes only those hits from $match where subject length was at least 70% of the alignment length.
    $besthit = includes only those hits from $match1 where % identity was at least 80%.
    @array3 = split “\t”, $besthit;
    %hash 1= (
    $array3[0] => “$array3[1]”); # this hash contains the query id (keys) and the subject id (values)
    }

    Open (INFILE1, “blast2”)
    while ($line1=INFILE1) {
    #repeat the same process
    $besthit2 = includes only those hits where alignment length is 70% of subject and query length and % identity at least 80%.
    @array7 = split “\t”, $besthi2t;
    %hash2 = (
    $array7[0] => “$array7[1]”); # this hash contains the query id (keys) and the subject id (values) from blast 2
    }

    If $hash1{$array3[0]} eq $hash2{$array7[1]}
    {print $hash1{$array3[0]} $hash2{$array7[1]}, “\n”;

    The last two lines are the problem. I am trying to say if my keys from hash1 match my values from hash2, print those keys and values. However, I keep getting a syntax errors, so somewhere in my code, the logic is faulty. Should I be doing this all in one while loop?

    1. Sumaria, this looks like pseudocode so we can’t help troubleshoot your syntax errors from this. I’ll point out though that the @array3 and @array7 values called after the input “while” loops are only going to contain only the values from the last time through the “while” loops. Remember: use “print”, which will make these sorts of issues apparent.

  2. Also, is anyone else having trouble running psiblast on the cluster. I was trying to make pssm of the query against the NCBI database, and I figured it would take a long time, so I left it running all night and it still wasn’t done. Figured I was doing something wrong. This is what I typed at command line:

    psiblast -db nr -query t4.faa -num_iterations 5 -num_threads 2 -inclusions_ethresh 1e-5 -out blast.out -out_pssm faa1.psm -com_based_stats 1

    Any help is appreciated!

  3. Hello Professor Klassen,

    I know that I have to stay inside the while to get all the data, so right now I have a while loop within a while loop (for reading the second blast results). If I type the following code inside the while loop, terminal seems to get stuck in an infinite loop:

    if ($hash{keys} eq $hash2{values})
    {print $hash{keys} , $hash2{values} , “\n”;
    }

  4. Don’t see how the “while” loops above are nested, but I’ll believe you! How long that evaluation you gave takes will depend on what “keys” and “values” are, and how many times you are running it. Remember that you don’t have to loop through all of the keys in both hashes – you can loop through one set and then evaluate if they are in the other set directly, i.e., you don’t have to loop through both hashes, just one and evaluate if the inverted key/value pair is in the second hash each time.

    Again, PRINT!!!!!! You should be checking “keys”, “$hash{keys}”, “values”, “$hash2{values}” and probably more. My guess is that this is not actually an infinite loop but instead your loops not doing what you expect. I’ll also point out that if you loop through 5000 hits in genome #1 by 5000 in genomes #2 you are having to do 25,000,000 comparisons and I would expect it to take a while! Printing will also tell you how long you need to wait for the script to finish.

    And a hint: if your code is taking forever to run, at the very least there is a better way to do it.

  5. I’m having issues getting only the top hit to show up in my hash. I’m using next if ($hash_table{keys}) but I have a feeling I’m not doing it right. Any advice?

    while (my $line = ) { #reads infile line by line

    push (my @blast_table, $line); #push each line into an array
    @blast_table = split “\t”, $line; #split array by tabs
    #print “$blast_table[0]\n”;
    #print “$blast_table[1]\n”;

    my %hash_table = ();

    next if ($hash_table{keys});

    %hash_table = (
    $blast_table[0] => $blast_table[1]);

    # print “$hash_table{$blast_table[0]}\n”; #check for values

    my @keys = keys %hash_table;
    print “@keys\n”; #check for keys
    }

  6. Hi Alicia – Not entirely sure I know everything that’s happening here, e.g., you only indicate 1 of 2 infiles. I guess just trying to set up the first hash? It looks like you are pushing all of the lines into a single array instead of converting them to arrays one by one. Have you printed out $blast_table[0] and $blast_table[1] and are they what you expect? Your “next” statement looks pretty good to me, except that you need to specify an actual key value (e.g., $blast_table[0]) instead of the generic “keys”. Helps?

  7. As I’m running my PSIblast I am getting these errors:

    Warning: lcl|Query_1 gi|281306660|ref|YP_003345466.1| hypothetical protein SBWP25_0001 [Pseudomonas phage phi-2]: Warning: Frequency ratios for PSSM are all zeros, frequency ratios for BLOSUM62 will be used during traceback in composition based statistics
    Warning: lcl|Query_4 gi|281306663|ref|YP_003345469.1| hypothetical protein SBWP25_0004 [Pseudomonas phage phi-2]: Warning: Frequency ratios for PSSM are all zeros, frequency ratios for BLOSUM62 will be used during traceback in composition based statistics
    Warning: lcl|Query_5 gi|281306664|ref|YP_003345470.1| hypothetical protein SBWP25_0005 [Pseudomonas phage phi-2]: Warning: Frequency ratios for PSSM are all zeros, frequency ratios for BLOSUM62 will be used during traceback in composition based statistics
    Warning: lcl|Query_6 gi|281306665|ref|YP_003345471.1| hypothetical protein SBWP25_0006 [Pseudomonas phage phi-2]: Warning: Frequency ratios for PSSM are all zeros, frequency ratios for BLOSUM62 will be used during traceback in composition based statistics
    Warning: lcl|Query_7 gi|281306666|ref|YP_003345472.1| hypothetical phage membrane protein [Pseudomonas phage phi-2]: Warning: Frequency ratios for PSSM are all zeros, frequency ratios for BLOSUM62 will be used during traceback in composition based statistics
    Warning: lcl|Query_10 gi|281306669|ref|YP_003345475.1| hypothetical protein SBWP25_0010 [Pseudomonas phage phi-2]: Warning: Frequency ratios for PSSM are all zeros, frequency ratios for BLOSUM62 will be used during traceback in composition based statistics
    Warning: lcl|Query_11 gi|281306670|ref|YP_003345476.1| hypothetical protein SBWP25_0011 [Pseudomonas phage phi-2]: Warning: Frequency ratios for PSSM are all zeros, frequency ratios for BLOSUM62 will be used during traceback in composition based statistics
    Warning: lcl|Query_14 gi|281306673|ref|YP_003345479.1| hypothetical protein SBWP25_0014 [Pseudomonas phage phi-2]: Warning: Frequency ratios for PSSM are all zeros, frequency ratios for BLOSUM62 will be used during traceback in composition based statistics

    I think the script is still running, but I’m not sure what this means…
    Any help is appreciated.

  8. Hi Amy,

    Are you getting these errors when creating the pssm or when actually running the psiblast? I think we only need one specific protein sequence, so the phage coat protein sequence for example is the only one you would use to create your pssm and then use that pssm to run your psiblast. You actually have to go through the NCBI web site to find these specific protein sequence! Hope that helps and hopefully I am not wrong about this!

  9. Hi Amy – Sumaira’s got it right: you only need to BLAST one protein vs your genome of choice, ideally one of the types listed in the assignment or something similar that might be there in multiple copies. Regarding the error, I think it means that the PSI-BLAST finds no hits (based on the only mention of this error in SeqAnswers). This would be consistent with these being hypothetical proteins. I think that if you stick with a more common protein this won’t be an issue.

  10. Hi Professor Klassen,

    So I ran a blastp of a putative coat protein from a phage which I know can infect E. coli strain. My output file is empty though, and I think it just didn’t find any hits between my e.coli strain and the phage coat protein. Should I just explain this in my assignment or should I choose a different molecular parasite?

  11. Hi Sumaira – In that case I recommend picking a different parasite. Maybe search the E.coli genome for the term “transposase” if you are having trouble.

  12. That you got from the same genome? Something is wrong with either your query sequence, your reference, or your command.

  13. This depends entirely on the parameters you ran it with. Possibilities include: very permissive thresholds generating lots of hits, lots of iterations, using >1 query sequence. (I’ve seen several people trying to do the latter and it not finishing overnight.) Try “top” or “ps” in the terminal to see if your process is still running. You can also see if new output is being generated by using “tail” or “ls -l” multiple times to see if anything is changing.

    Having said all that, yes it can take a while. I think Peter recommended in class that if it was taking > 5hr there was probably something wrong.

Comments are closed.