26/03/2018

Etherpad

Why Are We Here?

  • To learn basic concepts of programming (in Python)
  • How to solve problems in your research by…
    • Building scripts
    • Automating tasks
  • Mechanics of manipulating data
    • File I/O
    • Data structures

XKCD

How Are We Doing This?

Using the Python language

  • we need something ;)
  • free, well-documented, and cross-platform
  • large academic userbase
  • many libraries for specialist work

we won’t be covering the entire language

No, I mean “how are we doing this?”

Text editor

  • the more usual way to write code
  • edit-save-execute cycle

Jupyter notebook

  • interactive notebook-based interface
  • good for data exploration, prototyping, and teaching
  • not so good for writing scripts/‘production code’

Do I need to use Python afterwards?

  • No. ;)
    • The lesson is general, it’s just taught in Python
    • The principles are the same in nearly all languages
    • If your colleagues/field settled on another language(s), maybe learn that
    • (language wars are unproductive… ;) )

What are we doing?

Analysing and visualising experimental data

  • Effectiveness of a new treatment for arthritis
  • Several patients, recording inflammation on each day
  • Tabular (comma-separated) data

We’re going to get the computer to do this for us

  • Why not just do it by hand?
  • AUTOMATION, REUSE, SHARING

01. Setup

Setting Up - 1

Before we begin…

  • make a neat working environment in the terminal
  • obtain data
cd ~/Desktop
mkdir python-novice-inflammation
cd python-novice-inflammation

LIVE DEMO

Setting up - 2

Before we begin…

  • make a neat working environment
  • obtain data
cp 2018-03-29-standrews/lessons/python/files/python-novice-inflammation-data.zip ./
unzip python-novice-inflammation-data.zip
cp 2018-03-29-standrews/lessons/python/files/python-novice-inflammation-code.zip ./
unzip python-novice-inflammation-code.zip

(you can download files via Etherpad: http://pad.software-carpentry.org/2018-03-29-standrews)

LIVE DEMO

02. Getting Started

Python in the terminal

We start the Python console by executing the command python

$ python
Python 3.6.3 |Anaconda custom (64-bit)| (default, Oct  6 2017, 12:04:38) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> 

LIVE DEMO

Python REPL

  • Python’s console is a read-evaluate-print-loop, just like the shell
>>> 3 + 5
8
>>> 12 / 7
1.7142857142857142
>>> 2 ** 16
65536
>>> 15 % 4
3
>>> (2 + 4) * (3 - 7)
-24

LIVE DEMO

My first variable

  • To do interesting things, we want persistent values
  • variables are like named boxes
  • data goes in the box
  • when we use the name of the box, we mean what’s in the box

Creating a variable

  • To assign a value use the equals sign: =
  • The variable name/box label goes on the left, and the data item goes on the right
  • Character strings, or just strings, are enclosed in quotes
>>> name = "Samia"
>>> name
'Samia'
>>> print(name)
Samia

LIVE DEMO

Working with variables

weight_kg = 55
print(weight_kg)
2.2 * weight_kg
print("weight in pounds", 2.2 * weight_kg)
weight_kg = 57.5
print("weight in kilograms is now:", weight_kg)
weight_lb = 2.2 * weight_kg
print('weight in kilograms:', weight_kg, 'and in pounds:', weight_lb)
weight_kg = 100
print('weight in kilograms:', weight_kg, 'and in pounds:', weight_lb)

LIVE DEMO

Exercise 01 (5min)

What are the values in mass and age after the following code is executed?

mass = 47.5
age = 122
mass = mass * 2.0
age = age - 20
  1. mass == 47.5, age == 122
  2. mass == 95.0, age == 102
  3. mass == 47.5, age == 102
  4. mass == 95.0, age == 122

Exercise 02 (5min)

What does the following code print out?

first, second = 'Grace', 'Hopper'
third, fourth = second, first
print(third, fourth)
  1. Hopper Grace
  2. Grace Hopper
  3. "Grace Hopper"
  4. "Hopper Grace"

03. Data Analysis

Examine the data

  • Inspect a data file using the shell
$ head data/inflammation-01.csv 
0,0,1,3,1,2,4,7,8,3,3,3,10,5,7,4,7,7,12,18,6,13,11,11,7,7,4,6,8,8,4,4,5,7,3,4,2,3,0,0
0,1,2,1,2,1,3,2,2,6,10,11,5,9,4,4,7,16,8,6,18,4,12,5,12,7,11,5,11,3,3,5,4,4,5,5,1,1,0,1
0,1,1,3,3,2,6,2,5,9,5,7,4,5,4,15,5,11,9,10,19,14,12,17,7,12,11,7,4,2,10,5,4,2,2,3,2,2,1,1
0,0,2,0,4,2,2,1,6,7,10,7,9,13,8,8,15,10,10,7,17,4,4,7,6,15,6,4,9,11,3,5,6,3,3,4,2,3,2,1
0,1,1,3,3,1,3,5,2,4,4,7,6,5,3,10,8,10,6,17,9,14,9,7,13,9,12,6,7,7,9,6,3,2,2,4,2,0,1,1
0,0,1,2,2,4,2,1,6,4,7,6,6,9,9,15,4,16,18,12,12,5,18,9,5,3,10,3,12,7,8,4,7,3,5,4,4,3,2,1
0,0,2,2,4,2,2,5,5,8,6,5,11,9,4,13,5,12,10,6,9,17,15,8,9,3,13,7,8,2,8,8,4,2,3,5,4,1,1,1
0,0,1,2,3,1,2,3,5,3,7,8,8,5,10,9,15,11,18,19,20,8,5,13,15,10,6,10,6,7,4,9,3,5,2,5,3,2,2,1
0,0,0,3,1,5,6,5,5,8,2,4,11,12,10,11,9,10,17,11,6,16,12,6,8,14,6,13,10,11,4,6,4,7,6,3,2,1,0,0
0,1,1,2,1,3,5,3,5,8,6,8,12,5,13,6,13,8,16,8,18,15,16,14,12,7,3,8,9,11,2,5,4,5,1,4,1,2,0,0
  • To load this data in Python, we’ll use the numpy library

We want to produce summary information about inflammation by patient and by day

Python libraries

  • Python contains many powerful, general tools
  • Specialised tools are contained in libraries or packages
  • We call on libraries/packages, when needed
  • Packages are loaded with import
  • Packages are shared via repositories, e.g. PyPI and conda
>>> import numpy

LIVE DEMO

Load data from file

  • numpy provides a function loadtxt() to load tabular data:
numpy.loadtxt(fname='data/inflammation-01.csv', delimiter=',')
  • dotted notation tells us loadtxt() belongs to numpy
  • fname: an argument expecting the path to a file
  • delimiter: an argument expecting the character that separates columns

Loaded data

>>> numpy.loadtxt(fname='data/inflammation-01.csv', delimiter=',')
array([[ 0.,  0.,  1., ...,  3.,  0.,  0.],
       [ 0.,  1.,  2., ...,  1.,  0.,  1.],
       [ 0.,  1.,  1., ...,  2.,  1.,  1.],
       ..., 
       [ 0.,  1.,  1., ...,  1.,  1.,  1.],
       [ 0.,  0.,  0., ...,  0.,  2.,  0.],
       [ 0.,  0.,  1., ...,  1.,  1.,  0.]])
  • The matrix is truncated to fit the screen
  • ... indicate missing rows or columns
  • If there are no significant digits, they are not shown (1 == 1. == 1.0)

Assign the matrix to a variable called data

LIVE DEMO

What is our data?

>>> type(data)
<class 'numpy.ndarray'>

LIVE DEMO

Members and attributes

  • Creating the array created information, too
  • Info stored in members or attributes that belong to data
  • data.<attribute> e.g. data.shape
>>> print(data.dtype)
float64
>>> print(data.shape)
(60, 40)

LIVE DEMO

Indexing arrays

  • We often work with subsets of data
    • individual rows (patients)
    • individual columns (days)
  • Counting of array elements starts at zero, not at one.
>>> print('first value in data:', data[0, 0])
first value in data: 0.0
>>> print('middle value in data:', data[30, 20])
middle value in data: 13.0

LIVE DEMO

Slicing arrays

  • To get a range of data from the array, index with [ and specify start and end indices
  • 0:4 means start at zero and go up to but not including 4
    • 0, 1, 2, 3
  • Define start and end separated by : (colon).
>>> print(data[0:4, 0:10])
[[ 0.  0.  1.  3.  1.  2.  4.  7.  8.  3.]
 [ 0.  1.  2.  1.  2.  1.  3.  2.  2.  6.]
 [ 0.  1.  1.  3.  3.  2.  6.  2.  5.  9.]
 [ 0.  0.  2.  0.  4.  2.  2.  1.  6.  7.]]

LIVE DEMO

More slices, please!

  • Don’t specify start, Python assumes the first element
  • Don’t specify end, Python assumes the end element
>>> small = data[:3, 36:]
>>> print('small is:\n', small)

QUESTION: What would : on its own indicate?

LIVE DEMO

Exercise 03 (5min)

We can take slices of any series, not just arrays.

>>> element = 'oxygen'
>>> print('first three characters:', element[0:3])
first three characters: oxy

What is the value of element[:4]?

  1. oxyg
  2. gen
  3. oxy
  4. en

Array operations

  • arrays know how to perform operations on their values
  • +, -, *, /, etc. are elementwise
>>> doubledata = data * 2.0
>>> print("original:\n", data[:3, 36:])
original:
 [[ 2.  3.  0.  0.]
 [ 1.  1.  0.  1.]
 [ 2.  2.  1.  1.]]
>>> print("doubledata:\n", doubledata[:3, 36:])
doubledata:
 [[ 4.  6.  0.  0.]
 [ 2.  2.  0.  2.]
 [ 4.  4.  2.  2.]]

LIVE DEMO

numpy functions

  • numpy provides functions to operate on arrays
>>> print(numpy.mean(data))
6.14875
>>> print(data.mean())
6.14875
>>> maxval = numpy.max(data)
>>> print('maximum inflammation:', maxval)
maximum inflammation: 20.0
>>> minval = data.min()
>>> print('minimum inflammation:', minval)
minimum inflammation: 0.0
  • By default, these give summaries of the whole array.

LIVE DEMO

Summary for one patient

  • We want to summarise inflammation by patient

Extract a single row, or operate directly on a row

>>> patient_0 = data[0, :] # temporary variable
>>> print('maximum inflammation for patient 0:', patient_0.max())
maximum inflammation for patient 0: 18.0
>>> print('maximum inflammation for patient 0:', numpy.max(data[0, :]))
maximum inflammation for patient 0: 18.0
>>> print('maximum inflammation for patient 2:', numpy.max(data[2, :]))
maximum inflammation for patient 2: 19.0

LIVE DEMO

Summary for all patients

  • What if we need maximum inflammation for each patient or average inflammation on each day?
  • One line per patient/per day?

Tedious. Prone to errors/typos: easier way to to do this…

numpy operations on axes

  • numpy functions take an axis= parameter: 0 (columns) or 1 (rows)
>>> print(numpy.max(data, axis=1))    # max by patient
[ 18.  18.  19.  17.  17.  18.  17.  20.  17.  18.  18.  18.  17.  16.  17.
  18.  19.  19.  17.  19.  19.  16.  17.  15.  17.  17.  18.  17.  20.  17.
  16.  19.  15.  15.  19.  17.  16.  17.  19.  16.  18.  19.  16.  19.  18.
  16.  19.  15.  16.  18.  14.  20.  17.  15.  17.  16.  17.  19.  18.  18.]
>>> print(data.mean(axis=0))          # mean by day
[  0.           0.45         1.11666667   1.75         2.43333333   3.15
   3.8          3.88333333   5.23333333   5.51666667   5.95         5.9
   8.35         7.73333333   8.36666667   9.5          9.58333333
  10.63333333  11.56666667  12.35        13.25        11.96666667
  11.03333333  10.16666667  10.           8.66666667   9.15         7.25
   7.33333333   6.58333333   6.06666667   5.95         5.11666667   3.6
   3.3          3.56666667   2.48333333   1.5          1.13333333
   0.56666667]

LIVE DEMO

04. Visualisation

Visualisation

Graphics package: matplotlib

matplotlib is the de facto standard/base plotting library in Python

>>> import matplotlib.pyplot

LIVE DEMO

matplotlib.pyplot.imshow()

matplotlib.pyplot.imshow() renders matrix values as an image

>>> image = matplotlib.pyplot.imshow(data)
>>> matplotlib.pyplot.show()
  • small values are dark blue, large values are yellow
  • inflammation rises and falls over a 40-day period

matplotlib.pyplot.plot()

  • matplotlib.pyplot.plot() renders a line graph

We want to plot the average inflammation level on each day

>>> ave_inflammation = numpy.mean(data, axis=0)
>>> ave_plot = matplotlib.pyplot.plot(ave_inflammation)
>>> matplotlib.pyplot.show()

QUESTION: does this look reasonable?

Investigating data

  • The plot of .mean() looks artificial
  • Look at other statistics to gain insight
>>> max_plot = matplotlib.pyplot.plot(numpy.max(data, axis=0))
>>> matplotlib.pyplot.show()
>>> min_plot = matplotlib.pyplot.plot(numpy.min(data, axis=0))
>>> matplotlib.pyplot.show()

QUESTION: does this look reasonable?

Exercise 04 (5min)

Can you create a plot showing the standard deviation (numpy.std()) of the inflammation data for each day across all patients?

Figures and subplots

We can put all three plots into a single figure

  • create a figure (fig) with fig = matplotlib.pyplot.figure()
  • add subplots to fig with ax = fig.add_subplot()
  • set labels on a subplot with ax.set_ylabel()
  • plot data to a subplot with ax.plot()

LIVE DEMO

Exercise 05 (5min)

Can you modify your script to display the three graphs on top of one another, instead of side by side?

Save your new script as exercise_05.py

05. for loops

Motivation

  • We wrote some code that plots values of interest from a single dataset

  • BUT we’re soon going to receive dozens of datasets to plot
  • So we need to make the computer iterate over the data

for loops

Spelling Bee

  • Suppose we wanted to spell a word, one letter at a time
word = "lead"
print(word[0])
print(word[1])
print(word[2])
print(word[3])

QUESTION: Why is this not a good approach?

LIVE DEMO

for loops

  • for loops perform actions for every item in a collection
>>> word = "lead"
>>> for char in word:
...     print(char)
... 
l
e
a
d

LIVE DEMO

for loop syntax

for element in collection:
    <do things with element>
  • The for loop statement ends in a colon, :
  • The code block is indented with a tab (\t)

Counting with a for loop

Values defined outside a loop can be modified in the loop

>>> length = 0
>>> for vowel in 'aeiou':
...     length = length + 1
... 
>>> print("There are", length, "vowels") 

QUESTION: What output does this program give you?

LIVE DEMO

for loop variables

  • The loop variable is updated on each cycle
  • It keeps its value when the loop is finished
>>> letter = "z"
>>> print(letter)
z
>>> for letter in "abc":
...     print(letter)
... 
>>> print("after the loop, letter is:", letter)

LIVE DEMO

range()

range() is a Python function that creates a sequence of numbers

  • It returns a range type that can be iterated over in a loop
>>> seq = range(3)
>>> print("Range is:", seq)
>>> for val in seq:
...     print(val)
>>> seq = range(2, 5)
>>> seq = range(3, 10, 3)
>>> seq = range(10, 0, -1)

LIVE DEMO

Exercise 06 (5min)

Can you write a loop that takes a string, e.g. Newton, and produces a new string with the characters in reverse order, e.g. notweN?

HINTS

  1. You can “add” strings, e.g. ab + cd
  2. An empty string can be created with mystr = ""

06. lists

Lists

  • lists are a built in Python datatype
  • Denoted by square brackets, comma-separated
    • iterable lists of values
    • indexed and sliced like arrays
>>> odds = [1, 3, 5, 7]
>>> print("odds are:", odds)
odds are: [1, 3, 5, 7]
>>> print('first and last:', odds[0], odds[-1])
first and last: 1 7
>>> for number in odds:
...     print(number)

LIVE DEMO

Mutability

  • lists, like strings, are sequences
  • BUT list elements can be changed: lists are mutable
  • strings are not mutable
>>> names = ["Curie", "Darwing", "Turing"] # typo in Darwin's name
>>> print("names is originally:", names)
names is originally: ['Curie', 'Darwing', 'Turing']
>>> names[1] = 'Darwin'    # correct the name
>>> print('final value of names:', names)
final value of names: ['Curie', 'Darwin', 'Turing']
>>> name = "darwin"
>>> name[0] = "D"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'str' object does not support item assignment

LIVE DEMO

Changer danger

There are risks to modifying lists in-place

>>> my_list = [1, 2, 3, 4]
>>> your_list = my_list
>>> print("my list:", my_list)
my list: [1, 2, 3, 4]
>>> my_list[1] = 0
>>> print("your list:", your_list)

QUESTION: What is the value of your_list?

LIVE DEMO

list copies

  • To avoid this kind of effect:
    • make a copy of a list by slicing it or using the list() function
    • new_list = old_list[:]
>>> my_list = [1, 2, 3, 4]           # original list
>>> your_list = my_list[:]           # copy 1
>>> your_other_list = list(my_list)  # copy 2
>>> print("my_list:", my_list)
my_list: [1, 2, 3, 4]
>>> my_list[1] = 0                   # change element
>>> print("my_list:", my_list)
my_list: [1, 0, 3, 4]
>>> print("your_list:", your_list)
your_list: [1, 2, 3, 4]
>>> print("your_other_list:", your_list)
your_other_list: [1, 2, 3, 4]

LIVE DEMO

list functions

  • lists are Python objects and have useful functions (methods)
>>> print(odds)
[1, 3, 5, 7]
>>> odds.append(9)
>>> print("odds after adding a value:", odds)
odds after adding a value: [1, 3, 5, 7, 9]
>>> odds.reverse()
>>> print("odds after reversing the list:", odds)
odds after reversing the list: [9, 7, 5, 3, 1]
>>> odds.pop()
1
>>> print("odds after popping:", odds)
odds after popping: [9, 7, 5, 3]

LIVE DEMO

Overloading

Overloading refers to an operator (e.g. +) having more than one meaning, depending on the thing it operates on.

  • For numbers, + means add
  • For lists, + means concatenate
>>> vowels = ['a', 'e', 'i', 'o', 'u']
>>> vowels_welsh = ['a', 'e', 'i', 'o', 'u', 'w', 'y']
>>> print(vowels + vowels_welsh)
['a', 'e', 'i', 'o', 'u', 'a', 'e', 'i', 'o', 'u', 'w', 'y']
>>> counts = [2, 4, 6, 8, 10]
>>> repeats = counts * 2
>>> print(repeats)
[2, 4, 6, 8, 10, 2, 4, 6, 8, 10]

QUESTION: What does ‘multiplication’ (*) do for lists?

LIVE DEMO

07. Making choices

Conditionals

  • We often want to do <something> if some condition is true
  • To do this, we can use an if statement:
if <condition>:
  <executed if condition is True>
>>> num = 37
>>> if num > 100:
...     print('greater')
... 
>>> num = 149
>>> if num > 100:
...     print('greater')
... 
greater

LIVE DEMO

if-else statements

  • An if statement executes code if the condition evaluates as true
    • But what if the condition evaluates as false?
if <condition>:
    <executed if condition is True>
else:
    <executed if condition is not True>
>>> num = 37
>>> if num > 100:
...     print('greater')
... else:
...     print('not greater')
... 
not greater

LIVE DEMO

Conditional logic

if-elif-else

  • We can chain tests together using elif (else if)
if <condition1>:
    <executed if condition1 is True>
elif <condition2>:
    <executed if condition2 is True and condition1 is not True>
else:
    <executed if no conditions True>
>>> num = -3
>>> if num > 0:
...     print(num, "is positive")
... elif num == 0:
...     print(num, "is zero")
... else:
...     print(num, "is negative")
... 
-3 is negative

LIVE DEMO

Combining conditions

Conditions can be combined using Boolean Logic

  • Operators include and, or and not
>>> if (4 > 0) or (2 > 0):
...     print('at least one part is true')
... else:
...     print('both parts are false')
... 
at least one part is true

LIVE DEMO

Exercise 07 (5min)

What is the result of executing the code below?

>>> if 4 > 5:
...     print('A')
... elif 4 == 5:
...     print('B')
... elif 4 < 5:
...     print('C')
  1. A
  2. B
  3. C
  4. B and C

More operators

Two useful condition operators are == (equality) and in (membership)

>>> print(1 == 1)
True
>>> print(1 ==2)
False
>>> print('a' in 'toast')
True
>>> print('b' in 'toast')
False
>>> print(1 in [1, 2, 3])
True
>>> print(1 in range(3))
True

LIVE DEMO

08. Analysing multiple files

Analysing multiple files

  • We have several files of inflammation study data
    • We want to visualise/analyse each of them
    • We know how to load, visualise, loop over, and make decisions on the data

We will write a new script to do this:

  • analyse_files.py
$ nano analyse_files.py

BUT we need to know how to interact with the filesystem to get filenames

The os module

The os module allows interaction with the filesystem

import matplotlib.pyplot
import numpy as np
import os

LIVE DEMO

os.listdir()

The os.listdir() function lists the contents of a directory

  • The list can be filtered with a for loop or list comprehension
  • Our data is in the data directory
# Get a list of inflammation data files
files = []
for fname in os.listdir('data'):
  if 'inflammation' in fname:
    files.append(fname)
print("Inflammation data files:", files)
$ python analyse_files.py

BUT something’s not quite right…

LIVE DEMO

os.path.join()

  • The os.listdir() function only returns filenames, not the path (relative or absolute)

os.path.join() builds a path from directory and filenames, suitable for the underlying OS

files = []
for fname in os.listdir('data'):
  if 'inflammation' in fname:
    files.append(os.path.join('data', fname))
print("Inflammation data files:", files)
$ python analyse_files.py
Inflammation data files: ['data/inflammation-05.csv', …]

LIVE DEMO

Visualising the data

Now we have all the tools we need to

  • load all the inflammation data files
  • visualise the mean, minimum and maximum values in an array of plots.
  • list of paths to the data files with os
  • load data from a file with np.loadtxt()
  • calculate summary statistics with np.mean(), np.max(), etc.
  • create figures with matplotlib
  • create arrays of figures with .add_subplot()

Visualisation code

We’re going to build the rest of this script together

$ nano analyse_files.py
$ python analyse_files.py 
Analysing data/inflammation-05.csv
Writing image to data/inflammation-05.png
Analysing data/inflammation-11.csv
Writing image to data/inflammation-11.png
[…]

LIVE DEMO

Checking Data

There are two suspicious features to some of the datasets

  1. The maximum values rose and fell as straight lines
  2. The minimum values are consistently zero

We’ll use if statements to test for these conditions and give a warning

Suspicious maxima

Is day zero value 0, and day 20 value 20?

$ nano analyse_files.py
    # Test for suspicious maxima
    if np.max(data, axis=0)[0] == 0 and np.max(data, axis=0)[20] == 20:
        print("Suspicious-looking maxima!")
$ python analyse_files.py

LIVE DEMO

Suspicious minima

Are all the minima zero? (do they sum to zero?)

$ nano analyse_files.py
    # Test for suspicious maxima
    if np.max(data, axis=0)[0] == 0 and np.max(data, axis=0)[20] == 20:
        print("Suspicious-looking maxima!")
    elif np.sum(data.min(axis=0)) == 0:
        print('Minima sum to zero!')
$ python analyse_files.py

LIVE DEMO

Being tidy

If everything’s OK, let’s be reassuring

$ nano analyse_files.py
    # Test for suspicious maxima
    if np.max(data, axis=0)[0] == 0 and np.max(data, axis=0)[20] == 20:
        print("Suspicious-looking maxima!")
    elif np.sum(data.min(axis=0)) == 0:
        print('Minima sum to zero!')
    else:
        print('Seems OK!')
$ python analyse_files.py

LIVE DEMO

09. Conclusions (Part 1)

Learning Outcomes

  • variables
  • data types: numpy.arrays, lists, strings, numbers
  • file IO: loading data, listing files, manipulating filenames
  • calculating statistics
  • plotting data: plots and subplots
  • program flow: loops and conditionals
  • automating multiple analyses
  • Python scripts: edit-save-execute

Well done!

Building Programs With Python (Part 2)

Etherpad

Why are we here?

  • To learn basic concepts of programming (in Python)
  • How to solve problems in your research by…
    • Building scripts
    • Automating tasks
  • Good coding practice
    • Functions
    • Defensive programming

XKCD

What are we doing?

Analysing experimental data

  • Effectiveness of a new treatment for arthritis
  • Several patients, recording inflammation on each day
  • Tabular (comma-separated) data

We’re going to improve our code

  • automation, reuse, sharing
  • functions, documentation
  • defensive programming

Setting up

Before we begin…

return to our neat working environment

$ cd ~/Desktop
$ cd python-novice-inflammation

10. Jupyter notebooks

Starting Jupyter

At the command-line, start Jupyter notebook:

jupyter notebook

LIVE DEMO

Jupyter landing page

LIVE DEMO

Create a new notebook

LIVE DEMO

My first notebook

  • Give your notebook a name (functions)

LIVE DEMO

Cell types

Jupyter documents are comprised of cells

  • A Jupyter cell can have one of several types

Change the first cell to Markdown

LIVE DEMO

Markdown text

Markdown allows us to enter formatted text.

Execute a cell with Shift + Enter

LIVE DEMO

Entering code

Python code can be entered directly into a code cell

Execute a cell with Shift + Enter

LIVE DEMO

11. Functions

Motivation

  • We have code to plot values of interest from multiple datasets

BUT the code is long and complicated

  • It’s not flexible enough to deal with thousands of files
  • We can’t modify it easily

SO we will package our code for reuse: FUNCTIONS

What is a function?

Functions in code work like mathematical functions

\[y = f(x)\]

  • \(f()\) is the function
  • \(x\) is an input (or inputs)
  • \(y\) is the returned value, or output(s)

  • The output \(y\) depends in some way on the value of \(x\) - defined by \(f()\).

Not all functions in code take an input, or produce a usable output, but the principle is generally the same.

My first function

fahr_to_kelvin() to convert Fahrenheit to Kelvin

\[f(x) = ((x - 32) \times \frac{5}{9}) + 273.15\]

LIVE DEMO

Calling the function

  • Calling fahr_to_kelvin() in the notebook is the same as calling any other function
print('freezing point of water:', fahr_to_kelvin(32))
print('boiling point of water:', fahr_to_kelvin(212))

LIVE DEMO

Create a new function

Create a new function in your notebook, and call it.

def kelvin_to_celsius(temp):
  return temp - 273.15
print('freezing point of water', kelvin_to_celsius(273.15))

LIVE DEMO

Composing functions

Composing Python functions works the same way as for mathematical functions: \(y = f(g(x))\)

  • We could convert F (temp_f) to C (temp_c) by executing the code:
temp_c = kelvin_to_celsius(fahr_to_kelvin(temp_f))

LIVE DEMO

New functions from old

We can wrap this composed function inside a new function:

fahr_to_celsius:

def fahr_to_celsius(temp_f):
    return kelvin_to_celsius(fahr_to_kelvin(temp_f))
print('freezing point of water in Celsius:', fahr_to_celsius(32.0))

This is how programs are built:

combining small bits into larger bits until the behaviour we want is obtained

LIVE DEMO

Exercise 08 (10min)

Can you write a function called outer() that:

  • takes a single string argument
  • returns a string comprising only the first and last characters of the input, e.g.
print(outer("helium"))
hm

Function scope

Variables defined within a function, including parameters, are not ‘visible’ outside the function

  • This is called function scope
a = "Hello"

def my_fn(a):
  a = "Goodbye"

my_fn(a)  
print(a)

LIVE DEMO

Exercise 09 (5min)

What would be printed if you ran the code below?

a, b = 3, 7

def swap(a, b):
    temp = a
    a = b
    b = temp

swap(a, b)
print(b, a)
  1. 7 3
  2. 3 7
  3. 3 3
  4. 7 7

12. Refactoring

Tidying Up

Now we can write functions!

Let’s make the inflammation analysis easier to reuse: one function per operation

  • Open the analyse_files.py notebook from the first lesson

What operations should be put into functions?

The code is divisible into two sections

  1. check the data for problems
  2. plot the data

detect_problems()

  • We noticed that some data was questionable
  • This function spots problems with the data
    • Call the function after loading, before plotting
def detect_problems(data):
    if np.max(data, axis=0)[0] == 0 and np.max(data, axis=0)[20] == 20:
        print('Suspicious looking maxima!')
    elif np.sum(data.min(axis=0)) == 0:
        print('Minima add up to zero!')
    else:
        print('Seems OK!')

LIVE DEMO

plot_data()

We’ll write a function called plot_data() that plots the data to file

def plot_data(data, fname):
    # create figure and three axes
    fig = plt.figure(figsize=(10.0, 3.0))
    [...]

LIVE DEMO

Code reuse

Our code is now much more readable

  • Loop over the files, load data, detect_problems() and plot_data()
# Analyse each file in turn
for fname in files:
    print("Analysing", fname)

    # load data
    data = np.loadtxt(fname=fname, delimiter=',')

    # identify problems in the data
    detect_problems(data)

    # plot image in file
    imgname = fname[:-4] + '.png'
    plot_data(data, imgname)        

Good code pays off

Why should I bother?

  • After 6 months, the referee report arrives and you need to rerun experiments
  • Another student is continuing the project
  • Some random person reads your article and asks for the code
  • Helps spot errors quickly
  • Clarifies structure in your mind as well as in the code
  • Saves you time in the long run! (“Future You” will back this up)

13. Command-line programs

Learning objectives

How can I write Python programs that will work like Unix command-line tools?

  • Use the values of command-line arguments in a program.
  • Handle flags and files separately in a command-line program.
  • Read data from standard input in a program so that it can be used in a pipeline (with pipes: |)

The sys module

sys is a Python module for interacting with the operating system

Open a new file called sys_version.py in your editor

$ nano sys_version.py
import sys
print('version is', sys.version)
$ python sys_version.py 
version is 3.6.3 |Anaconda custom (64-bit)| (default, Oct  6 2017, 12:04:38) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]

LIVE DEMO

sys.argv

sys.argv is a variable that contains the command-line arguments used to call our script

Open a new file called sys_argv.py in your editor

$ nano sys_argv.py
import sys
print('sys.argv is', sys.argv)
$ python sys_argv.py 
sys.argv is ['sys_argv.py']
$ python sys_argv.py item1 item2 somefile.txt
sys.argv is ['sys_argv.py', 'item1', 'item2', 'somefile.txt']

LIVE DEMO

Building a new script

We’re going to build a script that reports readings from data files

$ python readings.py mydata.csv
  • We will make it take options --min, --max, --mean
    • The script will report one of these
$ python readings.py --min mydata.csv
  • We will make it handle multiple files
$ python readings.py --min mydata.csv myotherdata.csv
  • We will make it take STDIN so we can use it with pipes
$ cat mydata.csv | readings.py --min

Starting the framework

We start with a script that doesn’t do all that

  • We’ll build features in one-by-one
$ nano readings.py
import sys
import numpy

def main():
    script = sys.argv[0]
    filename = sys.argv[1]
    data = numpy.loadtxt(filename, delimiter=',')
    for m in numpy.mean(data, axis=1):
        print(m)

LIVE DEMO

Calling a script

There’s a way to tell if a Python file is being run as a script

  • If we use this, we can use the same file as:
    • a module (import readings)
    • a script ($ python readings.py)
  • The Python code has __name__ == '__main__' only when run as a script

We run main() only if the file is run as a script

if __name__ == '__main__':
   main()

Add this to readings.py and run the script

LIVE DEMO

Handling multiple files

We want to be able to analyse multiple files with one command

NOTE: wildcards are expanded by the operating system

$ ls data/small-*
data/small-01.csv  data/small-02.csv  data/small-03.csv
$ python sys_argv.py data/small-*
sys.argv is ['sys_argv.py', 'data/small-01.csv', 'data/small-02.csv', 'data/small-03.csv']
  • All arguments from index 1 onwards are filenames
def main():
    script = sys.argv[0]
    for filename in sys.argv[1:]:
        print(filename)
        data = numpy.loadtxt(filename, delimiter=',')
        for m in numpy.mean(data, axis=1):
            print(m)

Handling flags

We want to use --min, --max, --mean to tell the script what to calculate

$ python readings.py --max myfile.csv

The flag will be sys.argv[1], so filenames are sys.argv[2:]

  • We should check that flags are valid
def main():
    script = sys.argv[0]
    action = sys.argv[1]
    filenames = sys.argv[2:]
    if action not in ['--min', '--mean', '--max']:
        print('Action is not one of --min, --mean, or --max: ' + action)
        sys.exit(1)
    for f in filenames:
        process(f, action)

Add process()

We split the script into two functions for readability

  • The process() function returns the summarised data
def process(filename, action):
    data = numpy.loadtxt(filename, delimiter=',')

    if action == '--min':
        values = numpy.min(data, axis=1)
    elif action == '--mean':
        values = numpy.mean(data, axis=1)
    elif action == '--max':
        values = numpy.max(data, axis=1)

    for m in values:
        print(m)

LIVE DEMO

Using STDIN

The final change will let us use STDIN if no file is specified

  • sys.stdin catches STDIN from the operating system
    if len(filenames) == 0:
        process(sys.stdin, action)
    else:
        for f in filenames:
            process(f, action)
$ python readings.py --max < data/small-01.csv

LIVE DEMO

14. Testing and documentation

Motivation

  • Once written, functions are reused
  • Functions might be reused without further checks
  • When functions are written:
    • test for correctness
    • document their function
  • Example: centring a numerical array

Create a new notebook

  • Call it testing

centre()

  • Add the function
import numpy as np

def centre(data, desired):
    return (data - np.mean(data)) + desired

LIVE DEMO

Test datasets

  • We could try centre() on real data
    • but we don’t know the answer!

Use numpy to create an artificial dataset

z = np.zeros((2, 2))
print(centre(z, 3.0))

LIVE DEMO

Real data

Try the function on real data…

data = numpy.loadtxt(fname='data/inflammation-01.csv', delimiter=',')
print(centre(data, 0))
  • But how do we know it worked?

LIVE DEMO

Check properties

  • We can check properties of the original and centred data
    • mean, min, max, std
centred = centre(data, 0)
print('original min, mean, and max are:', numpy.min(data), numpy.mean(data), numpy.max(data))
print('min, mean, and max of centered data are:', numpy.min(centred),
      numpy.mean(centred), numpy.max(centred))
print('std dev before and after:', numpy.std(data), numpy.std(centred))      

LIVE DEMO

Documenting functions

  • Writing comments in the code (using the hash #) is a good thing
  • Python provides for docstrings
    • These go after the function definition
    • Hook into Python’s help system
def centre(data, desired):
    """Returns the array in data, recentered around the desired value."""
    return (data - numpy.mean(data)) + desired

help(centre)

LIVE DEMO

Default arguments

  • The centre() function requires two arguments
  • We can specify a default argument in how we define the function
def centre(data, desired=0.0):
    """Returns the array in data, recentered around the desired value.
    
    Example: centre([1, 2, 3], 0) => [-1, 0, 1]
    """
    return (data - np.mean(data)) + desired
centre(data, 0.0)
centre(data, desired=0.0)
centre(data)

LIVE DEMO

Exercise 10 (10min)

Can you write a function called rescale() that - takes an array as input - returns an array with values scaled in the range [0.0, 1.0] - has an informative docstring

  • HINT: If L and H are the lowest and highest values in the original array, then the replacement for a value v should be (v-L) / (H-L).

15. Errors and Exceptions

Create a new notebook

Errors

Programming n. - the process of making errors and correcting them until the code works

  • All programmers make errors
  • Identifying, fixing, and coping with errors is a valuable skill

Traceback

Python tries to tell you what has gone wrong by providing a traceback

def favourite_ice_cream():
    ice_creams = [
        "chocolate",
        "vanilla",
        "strawberry"
    ]
    print(ice_creams[3])
favourite_ice_cream()

LIVE DEMO

Anatomy of a traceback

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-1-b0e1f9b712d6> in <module>()
      8     print(ice_creams[3])
      9 
---> 10 favourite_ice_cream()

<ipython-input-1-b0e1f9b712d6> in favourite_ice_cream()
      6         "strawberry"
      7     ]
----> 8     print(ice_creams[3])
      9 
     10 favourite_ice_cream()

IndexError: list index out of range
  • (mostly, you can just look at the last couple of levels)

LIVE DEMO

Syntax errors

  • Logic errors occur when the code is ‘correct’ but does something illegal
  • Syntax errors occur when the code is not understandable as Python
def some_function()
    msg = "hello, world!"
    print(msg)
     return msg

LIVE DEMO

Syntax traceback

  File "<ipython-input-3-dbf32ad5d3e8>", line 1
    def some_function()
                       ^
SyntaxError: invalid syntax

LIVE DEMO

Fixed?

def some_function():
    msg = "hello, world!"
    print(msg)
     return msg

LIVE DEMO

Not quite

  File "<ipython-input-4-e169556d667b>", line 4
    return msg
    ^
IndentationError: unexpected indent

LIVE DEMO

Name errors

  • NameErrors occur when a variable is not defined in scope
    • (often due to a typo!)
print(a)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-5-c5a4f3535135> in <module>()
----> 1 print(a)

NameError: name 'a' is not defined

LIVE DEMO

Index Errors

  • Trying to access an element of a collection that does not exist gives an IndexError
letters = ['a', 'b']
print("Letter #1 is", letters[0])
print("Letter #2 is", letters[1])
print("Letter #3 is", letters[2])

Letter #1 is a
Letter #2 is b
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-9-62bced7460d2> in <module>()
      2 print("Letter #1 is", letters[0])
      3 print("Letter #2 is", letters[1])
----> 4 print("Letter #3 is", letters[2])

IndexError: list index out of range

LIVE DEMO

Exercise 11 (10min)

  • Can you read the code below, and (without running it) identify what the errors are?
  • Can you fix all the errors so the code prints abbabbabba?
for number in range(10):
    # use a if the number is a multiple of 3, otherwise use b
    if (Number % 3) = 0:
        message = message + a
    else:
        message = message + "b"
print(message)

16. Defensive programming

(Un)readable code

What does this function do?

def s(p):
    a = 0
    for v in p:
        a += v
    m = a / len(p)
    d = 0
    for v in p:
        d += (v - m) * (v - m)
    return numpy.sqrt(d / (len(p) - 1))

Readable code

What does this function do?

def std_dev(sample):
    sample_sum = 0
    for value in sample:
        sample_sum += value

    sample_mean = sample_sum / len(sample)

    sum_squared_devs = 0
    for value in sample:
        sum_squared_devs += (value - sample_mean) * (value - sample_mean)

    return numpy.sqrt(sum_squared_devs / (len(sample) - 1))

First line of defence: sensible naming, style and documentation

Create a new notebook

Defensive programming

We’ve focused on the basics of building code: variables, loops, functions, etc.

  • We’ve not focused on whether the code is ‘correct’
  • Defensive programming is expecting your code to have mistakes, and guarding against them

Write code that checks its own operation

  • This is good practice
    • speeds up software development
    • helps ensure that your code is doing what you intend

Assertions

  • Assertions are a Pythonic way to see if code runs correctly
    • 10-20% of the Firefox source code is checks on the rest of the code!
  • We assert that a condition is True
    • If it’s True, the code may be correct
    • If it’s False, the code is not correct
assert <condition>, "Some text describing the problem"

Example assertion

numbers = [1.5, 2.3, 0.7, -0.001, 4.4]
total = 0.0
for n in numbers:
    assert n > 0.0, 'Data should only contain positive values'
    total += n
print('total is:', total)

QUESTION: What does this assertion do?

LIVE DEMO

When do we use assertions?

  • preconditions - must be true at the start of an operation
  • postcondition - guaranteed to be true when operation completes
  • invariant - something true at a particular point in code
def normalise_rectangle(rect):
    """Normalises a rectangle to the origin, longest axis 1.0 units."""
    x0, y0, x1, y1 = rect
    
    dx = x1 - x0
    dy = y1 - y0
    
    if dx > dy:
        scaled = float(dx) / dy
        upper_x, upper_y = 1.0, scaled
    else:
        scaled = float(dx) / dy
        upper_x, upper_y = scaled, 1.0
        
    return (0, 0, upper_x, upper_y)

Preconditions

Preconditions must be true at the start of an operation or function

  • Here, we want to ensure that rect has four values
def normalise_rectangle(rect):
    """Normalises a rectangle to the origin, longest axis 1.0 units."""
    assert len(rect) == 4, "Rectangle must have four co-ordinates"
    x0, y0, x1, y1 = rect
    [...]

LIVE DEMO

Postconditions

Postconditions must be true at the end of an operation or function.

  • Here, we want to assert that the upper x and y values are in the range [0, 1]
def normalise_rectangle(rect):
    """Normalises a rectangle to the origin, longest axis 1.0 units."""
    [...]

    assert 0 < upper_x <= 1.0, "Calculated upper x-coordinate invalid"
    assert 0 < upper_y <= 1.0, "Calculated upper y-coordinate invalid"    
        
    return (0, 0, upper_x, upper_y)

LIVE DEMO

Notes on assertions

Assertions help understand programs

  • assertions declare what the program should be doing
  • assertions help the person reading the program match their understanding of the code to what the code expects

Fail early, fail often

  • Turn bugs into assertions or tests: if you’ve made the mistake once, you might make it again