Except where otherwise noted, this and all course materials for


Except where otherwise noted, this and all course materials for CS 112 are licensed under Attribution-NonCommercial-ShareAlike CC BY-NC-SA held by the Trustees of the University of Illinois (University of Illinois at Chicago).

Learning objectives:

  • Working with strings.
  • Slicing strings.
  • Basic functions.
  • Working with GenBank.
  • Understanding connection between DNA, mRNA, and proteins

Sequences in GenBank

On your computer, use a web browser to access GenBank: http://www.ncbi.nlm.nih.gov/genbank. Once there, find a nucleotide sequence for the human coagulation factor IX, sometimes called the “Christmas factor” (F9) gene. In other words, find a DNA sequence for the gene that encodes the coagulation factor IX protein. This is found by using the search area at the top of the GenBank web page. You are looking for a specific “accession”—a sequence submission record—with the accession ID: NG_007994.

To summarize, we need to find the nucleotide sequence for the human coagulation factor IX, sometimes called the “Christmas factor” (F9) gene. To do this we:

  1. Use a web browser to access GenBank: http://www.ncbi.nlm.nih.gov/genbank
  2. Use the search area at the top of the GenBank web page
  3. Make sure we are searching for a Nucleotide (select Nucleotide using the drop down menu).
  4. Enter the accession ID: NG_007994 in the search field
  5. Click search
  6. Verify the page we go to specifies NCBI Reference Sequence: NG_007994.1 just under the main title.

Structure of Eukaryotic Genes

Eukaryotic genes (like F9) are composed of messenger RNA (mRNA)-coding sequences called exons (expressed portions of DNA sequence) and intervening sequences called introns (the name emphasizes their intervening role). Intron sequences in pre-mRNA are non-coding and are removed before transcription to mRNA. The exons are then joined together (concatenated) and comprise mature mRNA. The process of removing introns and reconnecting exons is called ‘splicing.’ Mature mRNA is comprised of coding sequence (CDS) and untranslated regions (UTR) at 5′ and 3′ ends. Coding sequence is made up of codons—the portion of mRNA that codes for amino acids.

The amino acid coding portions (CDS), along with other gene features, are annotated on the left side of the description in GenBank records. For example, you will see something similar to this in the annotations for the F9 gene:

CDS         join(5030..5117,11275..11438)

The actual line on the GenBank page will be much longer (i.e. containing more than just the ranges for two exons) but the first two ranges match exactly what is given above.

The word join in a GenBank record is analogous to a function in Python. It is an instruction to slice out and join (concatenate) the segments separated by commas within parentheses. The resulting string represents the amino acid coding sequence (CDS). Assuming we have the entire F9 gene sequence stored in a variable F9, the example above could be written in Python as:

cds = F9[5029:5117] + F9[11274:11438]

Caution: Python indexes start at 0, but GenBank annotations start at 1. Notice how the coordinates differ between the GenBank record example and the Python code above. Failure to adjust indexes correctly is a common situation in computer science and the bugs related to this are known as off-by-one errors. While seemingly trivial, these errors may have serious consequences.

Assignment Description

  1. Write a function named extract_f9_cds which has one parameter is to take the argument of F9, the F9 gene sequence. The goal of this function is to extract the coding regions from the F9 gene sequence (provided in the template), concatenate them, and return the resulting string. Hint: You can confirm your program is functioning correctly by clicking on the CDS annotation in GenBank. This will highlight the relevant parts of the sequence, it should match your output.
  2. Write a function named get_max_possible_codons which has one parameter seq and returns the maximum number of codons this DNA sequence would contain if it was wholly composed of coding regions. Remember that each codon is made up of 3 nucleotide bases.
  3. Write a function named get_gc_percent which has one parameter seq. The goal of this function is to compute the proportion of G and C bases (characters) in seq to the total number of bases (characters) in seq. The returned value should be of type float in the range between 0.0 and 100.0 (as a percentage, not a fraction). To do this, use the string method count( ) to determine the number of ‘G’ bases and the number of ‘C’ bases.
  4. Write a function named get_coding_ratio which has two parameters seq and cds. The goal of this function is to calculate the proportion of coding nucleotides to total nucleotides in the entire sequence. In other words: of the total number of nucleotides in the gene (seq), what is the proportion that codes for amino acids (cds)? Remember that a ratio will a value of type ‘float’ in the range between 0.0 and 1.0.
  5. Write a function named print_seq_info which has two parameters seq and cds. This function should use the functions you wrote for problems 1 through 4 and print a correctly formatted summary:
    Sequence length: ... Coding sequence length: ... Number of possible codons: ... Number of actual codons: ... First 4 codons of the coding sequence: ... Ratio of Coding NT to Total NT: ... GC percent of the entire sequence: ... GC percent of the coding sequence: ...

    • The Sequence length: output should use the built-in len( ) function with the ‘seq’ parameter.
    • The Coding sequence length: output should use the built-in len( ) function with the ‘cds’ parameter.
    • The Number of possible codons: output should use your get_max_possible_codons( ) function with the ‘seq’ parameter.
    • The Number of actual codons: output should use your get_max_possible_codons( ) function with the ‘cds’ parameter.
    • The First 4 codons of the coding sequence: output should use slicing with the ‘cds’ parameter.
    • The Ration of Coding NT to Total NT: output should use the get_coding_ratio( ) function with both the ‘seq’ and ‘cds’ parameters.
    • The GC percent of the entire sequence: output should use the get_gc_percent( ) function with the ‘seq’ parameter.
    • The GC percent of the coding sequence: output should use the get_gc_percent( ) function with the ‘cds’ parameter.
  6. Write a few sentences explaining what this gene is and what its protein does, state the name of a disease caused by a variant (mutation) at the F9 gene, and describe one such disease-causing variant. Hint: look in the right panel on GenBank, or use the web. (You can write your answer in the same file as your Python code by commenting out the text. The starter code for Lab 3 already has a place for this near the top of the file.)
  7. Make sure you are writing your code using Good Programming Style. Aspects of Good Program Style include (but are not limited to):
    • File Header Comment/docstring at the beginning of the file to describe the purpose of the program
    • File Header Comment/docstring at the beginning of the file to give information about the programmer/author of the program
    • Function Comments/docstrings to describe the purpose of EACH function
    • Using meaning variable names
    • In-line comments/docstrings where needed
    • Blank lines to separate sections of your code
    • Proper use of indentation and consistent depth of indentation
Looking for a Similar Assignment? Our ENL Writers can help. Use the coupon code FIRSTUVO to get your first order at 15% off!