Assignment #3

Hello all,

As you have likely heard, UConn is closed tomorrow and so we will not be having class.  There will be no make up lecture, however I am posting a new assignment building on the perl techniques you have already learned and applying them to think about some of the genome quality issues that we discussed on Monday.  The new assignment will be due before class on Wednesday Feb 12/14 and can be found here: MCB_5472_Assignment_3

Questions #1 and #2 from Assignment #2 will be due at 1pm on Thursday Feb 6/14 to accommodate UConn’s closure tomorrow.  Question #3 from this assignment has been moved to Assignment #3.

As always, please post your questions here or email myself or Peter directly.  Please note that if you question has been answered on the website already we will just refer you there, so you may as well make it your first stop!


27 thoughts on “Assignment #3

  1. Hey guys,

    So I am working on the first part of the third assignment (translating the sequence), and again I am having the same problem of not receiving anything in my out (this time though I am working in the correct directory). This is my code:

    #!/usr/bin/perl
    open (INFILE, “sumaira2.out”);
    open (OUTFILE3, “>>sumaira3.out”);

    while ($line=){
    $codon = $codon.$line;
    @array = split “”,$codon;
    }

    for ($count = 0; $count “F”, TTC => “F”, TTA => “L”, TTG => “L”,
    TCT => “S”, TCC => “S”, TCA => “S”, TCG => “S”,
    TAT => “Y”, TAC => “Y”, TAA => “STOP”, TAG => “STOP”,
    TGT => “C”, TGC => “C”, TGA => “STOP”, TGG => “W”,
    CTT => “L”, CTC => “L”, CTA => “L”, CTG => “L”,
    CCT => “P”, CCC => “P”, CCA => “P”, CCG => “P”,
    CAT => “H”, CAC => “H”, CAA => “Q”, CAG => “Q”,
    CGT => “R”, CGC => “R”, CGA => “R”, CGG => “R”,
    ATT => “I”, ATC => “I”, ATA => “I”, ATG => “M”,
    ACT => “T”, ACC => “T”, ACA => “T”, ACG => “T”,
    AAT => “N”, AAC => “N”, AAA => “K”, AAG => “K”,
    AGT => “S”, AGC => “S”, AGA => “R”, AGG => “R”,
    GTT => “V”, GTC => “V”, GTA => “V”, GTG => “V”,
    GCT => “A”, GCC => “A”, GCA => “A”, GCG => “A”,
    GAT => “D”, GAC => “D”, GAA => “E”, GAG => “E”,
    GGT => “G”, GGC => “G”, GGA => “G”, GGG => “G”,
    );

    if(exists $aacode{$codon}) {
    return $aacode{$codon};
    }else{
    print STDERR “Bad codon \”$codon\”!!\n”;
    exit;
    }
    $aminoacid = $aminoacid.$codon;

    print OUTFILE3 $aminoacid;

    I think I am having trouble executing the hash table. I checked the Beginning Perl for Bioinformatics but they were using hash table via sub commands. Terminal is printing the STDER “bad codon” and it goes on to print the entire infile to the screen.

  2. It seem though some of my code got messed up in the process of copy and paste. Lets try this again.

    #!/usr/bin/perl
    open (INFILE, “sumaira2.out”);
    open (OUTFILE3, “>>sumaira3.out”);

    while ($line=){
    $codon = $codon.$line;
    @array = split “”,$codon;
    }

    for ($count = 0; $count “F”, TTC => “F”, TTA => “L”, TTG => “L”,
    TCT => “S”, TCC => “S”, TCA => “S”, TCG => “S”,
    TAT => “Y”, TAC => “Y”, TAA => “STOP”, TAG => “STOP”,
    TGT => “C”, TGC => “C”, TGA => “STOP”, TGG => “W”,
    CTT => “L”, CTC => “L”, CTA => “L”, CTG => “L”,
    CCT => “P”, CCC => “P”, CCA => “P”, CCG => “P”,
    CAT => “H”, CAC => “H”, CAA => “Q”, CAG => “Q”,
    CGT => “R”, CGC => “R”, CGA => “R”, CGG => “R”,
    ATT => “I”, ATC => “I”, ATA => “I”, ATG => “M”,
    ACT => “T”, ACC => “T”, ACA => “T”, ACG => “T”,
    AAT => “N”, AAC => “N”, AAA => “K”, AAG => “K”,
    AGT => “S”, AGC => “S”, AGA => “R”, AGG => “R”,
    GTT => “V”, GTC => “V”, GTA => “V”, GTG => “V”,
    GCT => “A”, GCC => “A”, GCA => “A”, GCG => “A”,
    GAT => “D”, GAC => “D”, GAA => “E”, GAG => “E”,
    GGT => “G”, GGC => “G”, GGA => “G”, GGG => “G”,
    );

    if(exists $aacode{$codon}) {
    return $aacode{$codon};
    }else{
    print STDERR “Bad codon \”$codon\”!!\n”;
    exit;
    }
    $aminoacid = $aminoacid.$codon;

    print OUTFILE3 $aminoacid;

  3. Hi Sumaira – A bunch of interleaved problems here; you’re going to have to start at the beginning and troubleshoot them one by one. Once you hit more specific headaches we can better work through them here.

    Troubleshooting tip #1: if something is not behaving as you expect, get your perl script to print out each value every time the script does something to change it. Then you can see what is changing when and adjust your script accordingly. E.g., print @array to ensure that there is something in it and that it contains a coding sequence. E.g., print $codon before you evaluate $aacode{$codon} etc. (FYI I don’t think $codon contains what you think it does).

    Troubleshooting tip #2: write as little of your script as possible before testing it. If your problem is on line 6 (which it could very well be), only looking at OUTFILE3 at line 20-something is not going to be informative. Any number of things could have (and did) go wrong before that point.

    Troubleshooting tip #3: the Korf book talks about using “use warnings” and “use strict” – these are more complicated but often provide hints about where your problems start. Of course, they are very “strict” and can cause lots of grief to beginners, which is why I have not taught them to you up to this point. So there is a trade-off but there you are.

    Specific to you: I suggest going back to the Assignment #2 lecture notes and looking at how to define hashes and for loops. There are several problems here.

  4. $codon does contain what I think it does, but for some reason part of that code keeps getting cut out when I post it on the comments section. I did get it to work, however, I am not using $count instead I am using $i. It works fine, the only problem is in the output file, it gives me the entire amino acid sequence in one line which doesn’t seem the most effective way to read it.

  5. I’d have to see your actual code then. As your wrote it above, neither you hash nor your for loop will work, and $codon will be the entire sequence string, not a subset of it split in @array.

    Regarding output in a single line, this will be true if you do not include some sort of new line character in the print statement. But in reality this is not a big deal. Remember that there is a difference between “computer-readable” and “human-readable”.

  6. Hi all/Prof. Klassen,
    So, I’m a bit unsure of the directions for part 2 of this assignment. By “download genome”, is it meant that we should download the complete nucleotide sequence of the organism, as stored in the .fna files on the FTP site (for the complete genomes)? And for proteins, should we download the .faa file? Sorry if this question is a little trivial, but all these different formats are somewhat intimidating .

  7. Oh, I think I figured it out…we’re supposed to figure it out. Does anyone see a way to download all the info we need without crawling through all the FTP directories and downloading each file individually? I’ll let yall know if I do.

  8. Hi James – You’ve got it right on both counts! For what it’s worth I spend some time trying to find a simple way to do this yesterday and…failed utterly. The good news is that there’s only 14 files total (fna and faa) that you need for the complete genomes (chromosomes and plasmids), and that for the drafts all fna and faa files are each in a single tarball for each genome. So it’s a small enough task to do manually (which I did myself, took me 15min or so). Truthfully, I’ve always found getting data from NCBI to be somewhat of a headache no matter what.

    If you are feeling super keen, it is possible to download entire folders from NCBI via the terminal. The command is “ftp http://ftp.ncbi.nlm.nih.gov“, user=anonymous pwd=your email. You can then navigate through folders using “cd” and “ls” just like a normal terminal. “get [filename]” downloads individual files onto your computer and “mget” downloads entire folders or other sets of multiple files. Be warned that this whole set up doesn’t always play nice, which is why I didn’t include it in the instructions; ncftp is another ftp program I’ve also successfully used. You don’t need to know any of this, but you asked!

    Happy hunting!

  9. Hi Prof. Klassen,
    Oh yeah, that wasn’t so bad. It took me about 15 minuets to get everything downloaded and organized. Also, I managed to change the name of my working directory to “ftp” with your suggested command, which was exciting, but then the terminal responded to all my subsequent commands by telling me I wasn’t connected, so I just resorted to doing it manually. Anyway, thanks for the help.

  10. Yep, I’ve found FTP flaky too. Although I did once write an interface via perl that I’ve been happy with. Glad you got it all in the end.

  11. Hey guys!

    For anyone who’s having trouble doing a ‘for’ loop, I suggest trying the command ‘unpack’. It’ll take your scalar sequence, and convert into an array, in which each variable is a group of 3 genes. Hope this helps!

  12. Thanks Becca – that’s a new one for me but looks like it should work. Terrifying documentation though. TMTOWTDI – a perl motto! (“There’s More Than One Way To Do It”)

  13. Sorry all, there was a syntax error in the last note of Assignment #3. Correct version is now attached.

  14. Hey guys,

    So, I am on the last part of the assignment and I am absolutely stuck when we have to determine the genome size. I was hoping that by creating an array that splits all the character in a string, and then just printing the scalar array would give me the number of bases in the genome. However, it seems as though it is only counting the character in the very last line of the infile. Maybe I am approaching the problem wrong. Any help is appreciated.

  15. Hey Sumaira and everyone else,

    First and foremost i finally figured out how to read these comments (momentous achievement i know):

    In regards to what you have said:
    how are you dealing with whitespace characters? it might help to get rid of all of them or to undefine the standard record separator,
    the command for that is:

    undef $/;
    chomp $/;

    The chomp will remove it after it has been undefined.
    This can allow you to deal with entire infiles that may have been entered as an array as a string:
    I dont really know if i am getting your problem exactly but it seems like you might run into issues dealing with whitespace characters that may exist within the file. I dont know if you have any regular expressions but it could be that as well. I also do not know how you sill be printing the length im assuming just printing the length of the array you have created with single nucleotides in each position?

    Also if anyone is looking for a good reference for regex commands here is one below:
    http://www.cs.tut.fi/~jkorpela/perl/regexp.html

    Also here is an intro to regex commands for anyone who finds it useful:
    http://www.zytrax.com/tech/web/regex.htm

  16. Hi Sumaira – Sounds like the approach is right, but the execution is problematic. Remember to use “print” to check exactly what all of you variables are doing. It sounds to me like your array does not contain all of the sequence like you think it does.

    And sorry for the delayed response – I was moving this weekend and only got internet back now.

    Good luck!

  17. Thanks for the feedback guys. Using the print command after each line of code was super helpful in terms of letting me know whether specific lines of code was functioning properly or not. It also turns out that the length command as Jonathan mentioned in assignment 3 comes pretty handy.

  18. Is anyone else having trouble concatenating files?

    I’m using the command cat *.faa > all.faa as suggested but it only prints the first file into my all.faa file

    All my files are in the right place. I’m kind of stumped.

  19. Mysterious. Are you sure that your input files are in the same folder and end with “.faa”? And not “.fna”? (in this case)

  20. Can you please attach the output of of “ls” in your directory and a copy of you exact “cat” command?

  21. Alicias-MacBook-Pro:Desktop MacBookPro$ ls
    02030.faa.tgz AMVR01000015.faa AMVR01000031.faa
    02030.fna.tgz AMVR01000016.faa AMVR01000032.faa
    AMVR01000001.faa AMVR01000017.faa AMVR01000033.faa
    AMVR01000002.faa AMVR01000018.faa AMVR01000034.faa
    AMVR01000003.faa AMVR01000019.faa AMVR01000035.faa
    AMVR01000004.faa AMVR01000020.faa AMVR01000036.faa
    AMVR01000005.faa AMVR01000021.faa AMVR01000037.faa
    AMVR01000006.faa AMVR01000022.faa AMVR01000038.faa
    AMVR01000007.faa AMVR01000023.faa AMVR01000039.faa
    AMVR01000008.faa AMVR01000024.faa AMVR01000040.faa
    AMVR01000009.faa AMVR01000025.faa AMVR01000041.faa
    AMVR01000010.faa AMVR01000026.faa AMVR01000042.faa
    AMVR01000011.faa AMVR01000027.faa AMVR01000043.faa
    AMVR01000012.faa AMVR01000028.faa
    AMVR01000013.faa AMVR01000029.faa
    AMVR01000014.faa AMVR01000030.faa
    Alicias-MacBook-Pro:Desktop MacBookPro$ cat *.faa > all.faa

    >gi|429361479|gb|EKY98133.1| transposase insH for insertion sequence element IS5Y, partial [Escherichia coli O104:H4 str. 11-02030]
    MSHQLTFADSEFSTKRRQTRKEIFLSRMEQILPWQNMTAVIEPFYPKAGNGRRPYPLE
    >gi|429361480|gb|EKY98134.1| hypothetical protein C212_00002 [Escherichia coli O104:H4 str. 11-02030]
    MREISDHMLDFVKRGMNLNGLPREGANKFLI
    >gi|429361481|gb|EKY98135.1| hypothetical protein C212_00003 [Escherichia coli O104:H4 str. 11-02030]
    MYLTKKIIISMMFILPSAAFSSDPPPLQQSLEKTTYFSIGMNGFIGYQSEGEKLYTHILTLDNPEEIFKN
    IIKNRKSTKESKIYAACGLYYLNVENIESLFNKNDKQEYVSVLRGDILTKIKLNDILNSVIINGCNTKLI
    SEHK
    >gi|429361482|gb|EKY98136.1| hypothetical protein C212_00004 [Escherichia coli O104:H4 str. 11-02030]
    MNCDNNHRNEEFIVTFDKGNKQDNSRRKHDNFPIEVESSVEQETHCITNNKSASGIVTHDYDADYICGCG
    EIMCPGCGHDL
    >gi|429361483|gb|EKY98137.1| hypothetical protein C212_00005 [Escherichia coli O104:H4 str. 11-02030]
    MIKDNSKADDSADRLSFTETLFVAHSILPDLTQTKTKENDCYLLFKKSVILSI
    >gi|429361484|gb|EKY98138.1| microcin H47 immunity protein mchI [Escherichia coli O104:H4 str. 11-02030]
    MSYKKLSQLTAIFSLSITILLVSLSSLRIVGEGNSYVDVFLSFIIFLGFIELIHGIRRILVWSGWKNGS
    >gi|429361485|gb|EKY98139.1| hypothetical protein C212_00007 [Escherichia coli O104:H4 str. 11-02030]

  22. Looks like it should have worked to me. What makes you think that it didn’t? Try “ls -l” to show file sizes. “all.faa” should be larger than the other .faa files.

  23. It is showing that the file is bigger than the rest. I guess I assumed when entered less all.faa to look at all the files it would show me all of the files and not just the first one. Should I just assumed that all of them are in there?

  24. I think it probably will show all of the files, but you did not scroll down far enough. Note that each of those .faa files contains all of the proteins predicted on that contig, so each .faa will have multiple protein sequences in fasta format. If you want to see what is at the end of each file, use the UNIX “tail” command. “tail all.faa” should look the same as “tall AMVR01000043.faa”, I think.

  25. Did anyone find Escherichia coli O104:H4 str. LB226692_2.0 ? I found a version of it, but it is not the 2.0 version.

  26. Hi Greg – There is an entry for this strain on the FTP site, and you can tell that it’s version 2.0 from the VERSION field in its gbk file. I believe you can get both versions through the normal website too.

Comments are closed.