Day3

File handling
Karin Lagesen

karin.lagesen@bio.uio.no

Homework
● ATCurve.py
● take an input string from the user
● check if the sequence only contains DNA – if
not, prompt for new sequence.
● calculate a running average of AT content
along the sequence. Window size should be
3, and the step size should be 1. Print one
value per line.
● Note: you need to include several runtime
examples to show that all parts of the code
works.

ATCurve.py - thinking
● Take input from user:
● raw_input
● Check for the presence of !ATCG
● use sets – very easy
● Calculate AT – window = 3, step = 1
● iterate over string in slices of three

ATCurve.py
# variable valid is used to see if the string is ok or not.
valid = False
while not valid:
# promt user for input using raw_input() and store in string,
# convert all characters into uppercase
test_string = raw_input("Enter string: ")
upper_string = test_string.upper()

# Figure out if anything else than ATGCs are present
dnaset = set(list("ATGC"))
upper_string_set = set(list(upper_string))

if len(upper_string_set - dnaset) > 0:
print "Non-DNA present in your string, try again"
else:
valid = True

if valid:
for i in range(0, len(upper_string)-3, 1):
at_sum = 0.0
at_sum += upper_string.count("A",i,i+2)
at_sum += upper_string.count("T",i,i+2)

Homework
● CodonFrequency.py
● take an input string from the user
● if the sequence only contains DNA
– find a start codon in your string
– if startcodon is present
● count the occurrences of each three-mer from start
codon and onwards
● print the results

CodonFrequency.py - thinking
● First part – same as earlier
● Find start codon: locate index of AUG
● Note, can simplify and find ATG
● If start codon is found:
● create dictionary
● for slice of three in input[StartCodon:]:
– get codon
– if codon is in dict:
● add to count
– if not:
● create key-value pair in dict

CodonFrequency.py
input = raw_input("Type a piece of DNA here: ")

if len(set(input) - set(list("ATGC"))) > 0:
print "Not a valid DNA sequence"
else:
atg = input.find("ATG")
if atg == -1:
print "Start codon not found"
else:
codondict = {}
for i in xrange(atg,len(input)-3,3):
codon = input[i:i+3]
if codon not in codondict:
codondict[codon] = 1
else:
codondict[codon] +=1

for codon in codondict:
print codon, codondict[codon]

CodonFrequency.py w/
stopcodon
input = raw_input("Type a piece of DNA here: ")

if len(set(input) - set(list("ATGC"))) > 0:
print "Not a valid DNA sequence"
else:
atg = input.find("ATG")
if atg == -1:
print "Start codon not found"
else:
codondict = {}
for i in xrange(atg,len(input) -3,3):
codon = input[i:i+3]
if codon in ['UAG', 'UAA', 'UAG']:
break
elif codon not in codondict:
codondict[codon] = 1
else:
codondict[codon] +=1

for codon in codondict:
print codon, codondict[codon]

Results

[karinlag@freebee]/projects/temporary/cees-python-course/Karin% python CodonFrequency2.py
Type a piece of DNA here: ATGATTATTTAAATG
ATG 1
ATT 2
TAA 1
[karinlag@freebee]/projects/temporary/cees-python-course/Karin% python CodonFrequency2.py
Type a piece of DNA here: ATGATTATTTAAATGT
ATG 2
ATT 2
TAA 1
[karinlag@freebee]/projects/temporary/cees-python-course/Karin%

Working with files
● Reading – get info into your program
● Parsing – processing file contents
● Writing – get info out of your program

Reading and writing
● Three-step process
● Open file
– create file handle – reference to file
● Read or write to file
● Close file
– will be automatically close on program end, but
bad form to not close

Opening files
● Opening modes:
● “r” - read file
● “w” - write file
● “a” - append to end of file
● fh = open(“filename”, “mode”)
● fh = filehandle, reference to a file, NOT the
file itself

Reading a file
● Three ways to read
● read([n]) - n = bytes to read, default is all
● readline() - read one line, incl. newline
● readlines() - read file into a list, one element
per line, including newline

Reading example
● Log on to freebee, and go to your area
● do cp ../Karin/fastafile.fsa .
● open python
>>> fh = open("fastafile.fsa", "r")
>>> fh

● Q: what does the response mean?

Read example
● Use all three methods to read the file. Print
the results.
● read
● readlines
● readline
● Q: what happens after you have read the
file?
● Q: What is the difference between the
three?

Read example
>>> withread = fh.read()
>>> withread
'>This is the description linenATGCGCTTAGGATCGATAGCGATTTAGAnTTAGCGGAn'
>>> withreadlines = fh.readlines()
>>> withreadlines
[]
>>> withreadlines = fh.readlines()
>>> withreadlines
['>This is the description linen', 'ATGCGCTTAGGATCGATAGCGATTTAGAn', 'TTAGCGGAn']
>>> withreadline = fh.readline()
>>> withreadline
'>This is the description linen'
>>>

Parsing
● Getting information out of a file
● Commonly used string methods
● split([character]) – default is whitespace
● replace(“in string”, “put into instead”)
● “string character”.join(list)
– joins all elements in the list with string
character as a separator
– common construction: ''.join(list)
● slicing

Type conversions
● Everything that comes on the command
line or from a file is a string
● Conversions:
● int(X)
– string cannot have decimals
– floats will be floored
● float(X)
● str(X)

Parsing example
● Continue using fastafile.fsa
● Print only the description line to screen
● Print the whole DNA string
>>> firstline = fh.readline()
>>> print firstline[1:-1]
This is the description line
>>> sequence = ''
>>> for line in fh:
... sequence += line.replace("n", "")
...
>>> print sequence
ATGCGCTTAGGATCGATAGCGATTTAGA
>>>

Accepting input from
command line
● Need to be able to specify file name on
command line
● Command line parameters stored in list
called sys.argv – program name is 0
● Usage:
● python pythonscript.py arg1 arg2 arg3....
● In script:
● at the top of the file, write import sys
●
arg1 = sys.argv[1]

Batch example
● Read fastafile.fsa with all three methods
● Per method, print method, name and
sequence
● Remember to close the file at the end!

Batch example
import sys
filename = sys.argv[1]
#using readline
fh = open(filename, "r")
firstline = fh.readline()
name = firstline[1:-1]
sequence =''
for line in fh:
sequence += line.replace("n", "")
print "Readline", name, sequence

#using readlines()
inputlines = fh.readlines()
name = inputlines[0][1:-1]
sequence = ''
for line in inputlines[1:]:
print "Readlines", name, sequence

#using read
inputlines = fh.read()
name = inputlines.split("n")[0][1:-1]
sequence = "".join(inputlines.split("n")[1:])
print "Read", name, sequence

fh.close()

Classroom exercise
● Modify ATCurve.py script so that it accepts
the following input on the command line:
● fasta filename
● window size
● Let the user input an alternate filename if it
contains !ATGC
● Print results to screen

ATCurve2.py
import sys
# Define filename
windowsize = int(sys.argv[2])

# variable valid is used to see if the string is ok or not.
valid = False
while not valid:
inputlines = fh.readlines()
name = inputlines[0][1:-1]
sequence = ''
for line in inputlines[1:]:
upper_string = sequence.upper()

# Figure out if anything else than ATGCs are present
dnaset = set(list("ATGC"))
upper_string_set = set(list(upper_string))

if len(upper_string_set - dnaset) > 0:
print "Non-DNA present in your file, try again"
filename = raw_input("Type in filename: ")
else:
valid = True

if valid:
for i in range(0, len(upper_string)-windowsize + 1, 1):
at_sum = 0.0
at_sum += upper_string.count("A",i,i+windowsize)
at_sum += upper_string.count("T",i,i+windowsize)
print i + 1, at_sum/windowsize

Writing to files
● Similar procedure as for read
● Open file, mode is “w” or “a”
● fh.write(string)
– Note: one single string
– No newlines are added
● fh.close()

ATContent3.py
● Modify previous script so that you have the
following on the command line
● fasta filename for input file
● window size
● output file
● Output should be on the format
● number, AT content
● number is the 1-based position of the first
nucleotide in the window

ATCurve3.py

import sys
# Define filename
windowsize = int(sys.argv[2])
outputfile = sys.argv[3]

if valid:
fh = open(outputfile, "w")
for i in range(0, len(upper_string)-windowsize + 1, 1):
at_sum = 0.0
at_sum += upper_string.count("A",i,i+windowsize)
at_sum += upper_string.count("T",i,i+windowsize)
fh.write(str(i + 1) + " " + str(at_sum/windowsize) + "n")
fh.close()

Homework:
TranslateProtein.py
● Input files are in
/projects/temporary/cees-python-course/Karin
● translationtable.txt - tab separated
● dna31.fsa
● Script should:
● Open the translationtable.txt file and read it into a
dictionary
● Open the dna31.fsa file and read the contents.
● Translates the DNA into protein using the dictionary
● Prints the translation in a fasta format to the file
TranslateProtein.fsa. Each protein line should be 60
characters long.

Day3

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to Day3 (20)

Recently uploaded (20)

Day3