2-3/5/2019

Etherpad

Please do use the course Etherpad:

  • Communal notes: share your understanding, and benefit from others
  • Ask questions: get detailed answers with links and examples
  • A record/reference for after the course
  • All slides and tutor notes are linked:
    • on the course home page
    • on the course etherpad

Why Are We Here?

  • To learn basic concepts of programming (in Python)
  • How to solve problems in your research by…
    • Making a computer do it
    • Building scripts
    • Automating tasks
  • Mechanics of manipulating data
    • File I/O
    • Data structures

XKCD: The best use of your time

How Are We Doing This?

Using the Python language

  • we need something ;)
  • free, well-documented, and cross-platform
  • large academic userbase
  • many libraries for specialist work

we won’t be covering the entire language

XKCD: Python

No, I mean “how are we doing this?”

Text editor

  • the more usual way to write code
  • edit-save-execute cycle

Jupyter notebook

  • interactive notebook-based interface
  • good for data exploration, prototyping, and teaching
  • not so good for writing scripts/‘production code’

Do I need to use Python afterwards?

  • No. ;)
    • The lesson is general, it’s just taught in Python
    • The principles are the same in nearly all languages
    • If your colleagues/field settled on another language(s), maybe learn that (instead or as well)
    • (language wars are unproductive… ;) )

What are we doing?

Analysing and visualising experimental data

  • Analysing the effectiveness of a new treatment for arthritis
  • Several patients, recording level of inflammation on each day
  • Tabular (comma-separated) data

We’re going to get the computer to do this for us

  • Why not just do it by hand?
  • AUTOMATION, REUSE, SHARING

01. Setup

Setting Up - 1

Before we begin…

  • make a neat working environment using the terminal
  • obtain data
cd ~/Desktop        # or any convenient location
mkdir pni           # (python novice inflammation)
cd pni

LIVE DEMO

Setting up - 2

Before we begin…

  • make a neat working environment
  • obtain data
cp 2019-05-02-standrews/lessons/python/files/python-novice-inflammation-data.zip ./
unzip python-novice-inflammation-data.zip
cp 2019-05-02-standrews/lessons/python/files/python-novice-inflammation-code.zip ./
unzip python-novice-inflammation-code.zip

(you can download files via Etherpad: http://pad.software-carpentry.org/2019-05-02-standrews)

LIVE DEMO

02. Getting Started

Python in the terminal

We start the Python console by executing the command python

$ python
Python 3.6.3 |Anaconda custom (64-bit)| (default, Oct  6 2017, 12:04:38) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> 

LIVE DEMO

Python REPL

  • Python’s console is a read-evaluate-print-loop, just like the shell
>>> 3 + 5
8
>>> 12 / 7
1.7142857142857142
>>> 2 ** 16
65536

LIVE DEMO

My first variable

  • To do interesting things, we want persistent values
  • variables are like named boxes
  • data goes in the box
  • when we use the name of the box, we mean what’s in the box

Creating a variable

  • To assign a value use the equals sign: =
  • The variable name/box label goes on the left, and the data item goes on the right
  • Character strings, or just strings, are enclosed in quotes
>>> name = "Samia"
>>> name
'Samia'
>>> print(name)
Samia

LIVE DEMO

Working with variables

weight_kg = 55
print(weight_kg)
2.2 * weight_kg
print("weight in pounds", 2.2 * weight_kg)

LIVE DEMO

Exercise 01 (1min)

What are the values in mass and age after the following code is executed?

mass = 47.5
age = 122
mass = mass * 2.0
age = age - 20
  1. mass == 47.5, age == 122
  2. mass == 95.0, age == 102
  3. mass == 47.5, age == 102
  4. mass == 95.0, age == 122

Exercise 02 (1min)

What does the following code print out?

first, second = 'Grace', 'Hopper'
third, fourth = second, first
print(third, fourth)
  1. Hopper Grace
  2. Grace Hopper
  3. "Grace Hopper"
  4. "Hopper Grace"

03. Data Analysis

Examine the data

LIVE DEMO

We want to produce summary information about inflammation by patient and by day

Python libraries

  • Python contains many powerful, general tools
  • Specialised tools are contained in libraries or packages
  • We call on libraries/packages, when needed
  • Packages are loaded with import
  • Packages are shared via repositories, e.g. PyPI and conda
  • To load our data in Python, we’ll use the numpy library
>>> import numpy

LIVE DEMO

Load data from file

  • numpy provides a function loadtxt() to load tabular data:
numpy.loadtxt(fname='data/inflammation-01.csv', delimiter=',')
  • dotted notation tells us loadtxt() belongs to numpy
  • fname: an argument expecting the path to a file
  • delimiter: an argument expecting the character that separates columns

Loaded data

>>> numpy.loadtxt(fname='data/inflammation-01.csv', delimiter=',')
array([[ 0.,  0.,  1., ...,  3.,  0.,  0.],
       [ 0.,  1.,  2., ...,  1.,  0.,  1.],
       [ 0.,  1.,  1., ...,  2.,  1.,  1.],
       ..., 
       [ 0.,  1.,  1., ...,  1.,  1.,  1.],
       [ 0.,  0.,  0., ...,  0.,  2.,  0.],
       [ 0.,  0.,  1., ...,  1.,  1.,  0.]])
  • The matrix is truncated to fit the screen
  • ... indicate missing rows or columns
  • If there are no significant digits, they are not shown (1 == 1. == 1.0)

Assign the matrix to a variable called data

LIVE DEMO

What is our data?

>>> type(data)
>>> data.dtype

LIVE DEMO

Members and attributes

  • Creating the array created information, too
  • Info stored in attributes (or members) that belong to data
  • data.<attribute> e.g. data.shape
>>> print(data.dtype)
>>> print(data.shape)

LIVE DEMO

Indexing arrays

  • We often work with subsets of data
    • individual rows (patients)
    • individual columns (days)
  • Counting of array elements starts at zero, not at one.
>>> print('first value in data:', data[0, 0])
first value in data: 0.0
>>> print('middle value in data:', data[30, 20])
middle value in data: 13.0

LIVE DEMO

Slicing arrays

  • To get a range of data from the array, index with [] and specify start and end indices
  • 0:4 means start at zero and go up to but not including 4
    • 0, 1, 2, 3
  • Define start and end separated by : (colon).
>>> print(data[0:4, 0:10])

LIVE DEMO

Another slice, please!

  • Don’t specify start, Python assumes the first element
  • Don’t specify end, Python assumes the end element
>>> small = data[:3, 36:]
>>> print(small)

QUESTION: What would : on its own indicate?

LIVE DEMO

Exercise 03 (1min)

We can take slices of any series, not just arrays.

>>> element = 'oxygen'

What is the value of element[:4]?

  1. oxyg
  2. gen
  3. oxy
  4. en

Array operations

  • arrays know how to perform operations on their values
  • +, -, *, /, etc. are elementwise
>>> doubledata = data * 2.0
>>> print(data[:3, 36:])
 [[ 2.  3.  0.  0.]
 [ 1.  1.  0.  1.]
 [ 2.  2.  1.  1.]]
>>> print(doubledata[:3, 36:])
 [[ 4.  6.  0.  0.]
 [ 2.  2.  0.  2.]
 [ 4.  4.  2.  2.]]

LIVE DEMO

numpy functions

  • numpy provides functions to operate on arrays
>>> print(numpy.mean(data))
6.14875
>>> maxval = numpy.max(data)
>>> print('maximum inflammation:', maxval)
maximum inflammation: 20.0
>>> minval = data.min()
>>> print('minimum inflammation:', minval)
minimum inflammation: 0.0
  • By default, these give summaries of the whole array.

LIVE DEMO

Summary for one patient

  • We want to summarise inflammation by patient

Extract a single row, or operate directly on a row

>>> patient_0 = data[0, :] # temporary variable
>>> print('maximum inflammation for patient 0:', numpy.max(data[0, :]))
maximum inflammation for patient 0: 18.0

LIVE DEMO

Summary for all patients

  • What if we need data for each patient or average on each day?
  • numpy functions take an axis= parameter: 0 (columns) or 1 (rows)
>>> print(numpy.max(data, axis=1))    # max by patient
>>> print(data.mean(axis=0))          # mean by day

LIVE DEMO

04. Visualisation

Visualisation

Graphics package: matplotlib

matplotlib is the de facto standard/base plotting library in Python

>>> import matplotlib.pyplot

LIVE DEMO

matplotlib.pyplot.imshow()

matplotlib.pyplot.imshow() renders matrix values as an image

>>> image = matplotlib.pyplot.imshow(data)
>>> matplotlib.pyplot.show()
  • small values are dark blue, large values are yellow
  • inflammation rises and falls over a 40-day period

matplotlib.pyplot.plot()

  • matplotlib.pyplot.plot() renders a line graph

We want to plot the average inflammation level on each day

>>> ave_inflammation = numpy.mean(data, axis=0)
>>> ave_plot = matplotlib.pyplot.plot(ave_inflammation)
>>> matplotlib.pyplot.show()

QUESTION: does this look reasonable?

Investigating data

  • The plot of .mean() looks artificial
  • Look at other statistics to gain insight
>>> max_plot = matplotlib.pyplot.plot(numpy.max(data, axis=0))
>>> matplotlib.pyplot.show()
>>> min_plot = matplotlib.pyplot.plot(numpy.min(data, axis=0))
>>> matplotlib.pyplot.show()

QUESTION: does this look reasonable?

Exercise 04 (5min)

Can you create a plot showing the standard deviation (numpy.std()) of the inflammation data for each day across all patients?

Figures and subplots

We can put all three plots into a single figure

  • create a figure (fig) with fig = matplotlib.pyplot.figure()
  • add subplots to fig with ax = fig.add_subplot()
  • set labels on a subplot with ax.set_ylabel()
  • plot data to a subplot with ax.plot()

LIVE DEMO

Exercise 05 (5min)

Can you modify your script to display the three graphs on top of one another, instead of side by side?

Save your new script as exercise_05.py

05. for loops

Motivation

  • We wrote some code that plots values of interest from a single dataset

  • BUT we’re soon going to receive dozens of datasets to plot
  • So we need to make the computer iterate over the data

for loops

Spelling Bee

  • Suppose we wanted to spell a word out with one letter on each line
word = "lead"
print(word[0])
print(word[1])
print(word[2])
print(word[3])

QUESTION: Why is this not a good approach?

LIVE DEMO

for loops

  • for loops perform actions for every item in a collection
>>> word = "lead"
>>> for char in word:
...     print(char)
... 
l
e
a
d

QUESTION: Why is this better?

LIVE DEMO

for loop syntax

for variable in collection:
    <do things using variable>
  • The for loop statement ends in a colon, :
  • The code block is indented with a tab (\t)

Counting with a for loop

Values defined outside a loop can be modified in the loop

>>> length = 0
>>> for vowel in 'aeiou':
...     length = length + 1
... 
>>> print("There are", length, "vowels") 

QUESTION: What output does this program give you?

LIVE DEMO

for loop variables

  • The loop variable is updated on each cycle
  • It keeps its value when the loop is finished
>>> letter = "z"
>>> print(letter)
z
>>> for letter in "abc":
...     print(letter)
... 
>>> print("after the loop, letter is:", letter)

LIVE DEMO

range()

range() is a Python function that creates a sequence of numbers

  • It returns a range type that can be iterated over in a loop
>>> seq = range(3)
>>> print("Range is:", seq)
>>> for val in seq:
...     print(val)
>>> seq = range(2, 5)
>>> seq = range(3, 10, 3)
>>> seq = range(10, 0, -1)

LIVE DEMO

Exercise 06 (5min)

Can you write a loop that takes a string, e.g. Newton, and produces a new string with the characters in reverse order, e.g. notweN?

HINTS

  1. You can “add” strings, e.g. "ab"" + "cd"
  2. An empty string can be created with empty quotes: mystr = ""

06. lists

Lists

  • lists are a built in Python datatype for storing multiple values
  • Denoted by square brackets, comma-separated
    • iterable lists of values
    • indexed and sliced like arrays
>>> odds = [1, 3, 5, 7]
>>> print("odds are:", odds)
>>> print('first and last:', odds[0], odds[-1])
>>> for number in odds:
...     print(number)

LIVE DEMO

Mutability

  • lists, like strings, are sequences
  • BUT list elements can be changed: lists are mutable
  • strings are not mutable
>>> names = ["Curie", "Darwing", "Turing"] # typo in Darwin's name
>>> print("names is originally:", names)
>>> names[1] = 'Darwin'    # correct the name
>>> print('final value of names:', names)
>>> name = "darwin"
>>> name[0] = "D"

LIVE DEMO

Changer danger

There are risks to modifying lists in-place

>>> my_list = [1, 2, 3, 4]
>>> your_list = my_list
>>> print("my list:", my_list)
>>> my_list[1] = 0
>>> print("your list:", your_list)

QUESTION: What is the value of your_list?

LIVE DEMO

list copies

  • To avoid this kind of effect:
    • make a copy of a list by slicing it or using the list() function
    • new_list = old_list[:]
>>> my_list = [1, 2, 3, 4]           # original list
>>> your_list = my_list[:]           # copy 1
>>> your_other_list = list(my_list)  # copy 2
>>> print("my_list:", my_list)
>>> my_list[1] = 0                   # change element
>>> print("my_list:", my_list)
>>> print("your_list:", your_list)
>>> print("your_other_list:", your_list)

LIVE DEMO

list functions

  • lists are Python objects and have useful functions (methods)
>>> print(odds)
[1, 3, 5, 7]
>>> odds.append(9)
>>> print("odds after adding a value:", odds)
>>> odds.reverse()
>>> print("odds after reversing the list:", odds)
>>> odds.pop()
>>> print("odds after popping:", odds)

LIVE DEMO

Overloading

Overloading refers to an operator (e.g. +) having more than one meaning, depending on the thing it operates on.

  • For numbers, + means add
  • For lists, + means concatenate
>>> vowels = ['a', 'e', 'i', 'o', 'u']
>>> vowels_welsh = ['a', 'e', 'i', 'o', 'u', 'w', 'y']
>>> print(vowels + vowels_welsh)
>>> counts = [2, 4, 6, 8, 10]
>>> repeats = counts * 2
>>> print(repeats)

QUESTION: What does ‘multiplication’ (*) do for lists?

LIVE DEMO

07. Making choices

Conditionals

  • We often want to do <something> if some condition is true
  • To do this, we can use an if statement:
if <condition>:
    <executed if condition is True>
>>> num = 37
>>> if num > 100:
...     print('greater')
... 
>>> num = 149
>>> if num > 100:
...     print('greater')
... 

LIVE DEMO

if-else statements

  • An if statement executes code if the condition evaluates as true
    • But what if the condition evaluates as false?
if <condition>:
    <executed if condition is True>
else:
    <executed if condition is not True>
>>> num = 37
>>> if num > 100:
...     print('greater')
... else:
...     print('not greater')
... 

LIVE DEMO

Conditional logic

if-elif-else

  • We can chain tests together using elif (else if)
if <condition1>:
    <executed if condition1 is True>
elif <condition2>:
    <executed if condition2 is True and condition1 is not True>
else:
    <executed if no conditions True>
>>> num = -3
>>> if num > 0:
...     print(num, "is positive")
... elif num == 0:
...     print(num, "is zero")
... else:
...     print(num, "is negative")
... 

LIVE DEMO

Combining conditions

Conditions can be combined using Boolean Logic

  • Operators include and, or and not
>>> if (4 > 0) or (2 > 0):
...     print('at least one part is true')
... else:
...     print('both parts are false')
... 

LIVE DEMO

Exercise 07 (1min)

What is the result of executing the code below?

>>> if 4 > 5:
...     print('A')
... elif 4 == 5:
...     print('B')
... elif 4 < 5:
...     print('C')
  1. A
  2. B
  3. C
  4. B and C

More about operators

Two useful condition operators are == (equality) and in (membership)

>>> print(1 == 1)
>>> print(1 == 2)
>>> print('a' in 'toast')
>>> print('b' in 'toast')
>>> print(1 in [1, 2, 3])
>>> print(1 in range(3))

LIVE DEMO

08. Analysing multiple files

Analysing multiple files

  • We have several files of inflammation study data
    • We want to visualise/analyse each of them
    • We know how to load, visualise, loop over, and make decisions on the data

We will write a new script to do this:

  • analyse_files.py
  • BUT we need to know how to interact with the filesystem to get filenames
  • The os module allows interaction with the filesystem

os.listdir()

The os.listdir() function lists the contents of a directory

  • The list can be filtered with a for loop or list comprehension
  • Our data is in the data directory
import os

# Get a list of inflammation data files
files = []
for fname in os.listdir('data'):
  if 'inflammation' in fname:
    files.append(fname)
print("Inflammation data files:", files)

analyse_files.py

$ nano analyse_files.py
import matplotlib.pyplot
import numpy as np
import os
...

LIVE DEMO

$ python analyse_files.py

BUT something’s not quite right…

os.path.join()

  • The os.listdir() function only returns filenames, not the path (relative or absolute)

os.path.join() builds a path from directory and filenames, suitable for the underlying OS

files = []
for fname in os.listdir('data'):
  if 'inflammation' in fname:
    files.append(os.path.join('data', fname))
print("Inflammation data files:", files)
$ python analyse_files.py
Inflammation data files: ['data/inflammation-05.csv', …]

LIVE DEMO

Visualising the data

Now we have all the tools we need to

  • load all the inflammation data files
  • visualise the mean, minimum and maximum values in an array of plots.
  • list of paths to the data files with os
  • load data from a file with np.loadtxt()
  • calculate summary statistics with np.mean(), np.max(), etc.
  • create figures with matplotlib
  • create arrays of figures with .add_subplot()

Visualisation code

We’re going to build the rest of this script together

$ nano analyse_files.py
$ python analyse_files.py 
Analysing data/inflammation-05.csv
Writing image to data/inflammation-05.png
Analysing data/inflammation-11.csv
Writing image to data/inflammation-11.png
[…]

LIVE DEMO

Checking Data

There are two suspicious features to some of the datasets

  1. The maximum values rose and fell as straight lines
  2. The minimum values are consistently zero

We’ll use if statements to test for these conditions and give a warning

Suspicious maxima

Is day zero value 0, and day 20 value 20?

$ nano analyse_files.py
    # Test for suspicious maxima
    if numpy.max(data, axis=0)[0] == 0 and numpy.max(data, axis=0)[20] == 20:
        print("Suspicious-looking maxima!")
$ python analyse_files.py

LIVE DEMO

Suspicious minima

Are all the minima zero? (do they sum to zero?)

$ nano analyse_files.py
    # Test for suspicious maxima
    if numpy.max(data, axis=0)[0] == 0 and numpy.max(data, axis=0)[20] == 20:
        print("Suspicious-looking maxima!")
    elif numpy.sum(numpy.min(data, axis=0)) == 0:
        print("Minima sum to zero!")
$ python analyse_files.py

LIVE DEMO

Being tidy

If everything’s OK, let’s be reassuring

$ nano analyse_files.py
    # Test for suspicious maxima
    if numpy.max(data, axis=0)[0] == 0 and numpy.max(data, axis=0)[20] == 20:
        print("Suspicious-looking maxima!")
    elif numpy.sum(numpy.min(data, axis=0)) == 0:
        print("Minima sum to zero!")
    else:
        print("Seems OK!")
$ python analyse_files.py

LIVE DEMO

XKCD: writing good code

10. Jupyter notebooks

Starting Jupyter

At the command-line, start Jupyter notebook:

jupyter notebook

LIVE DEMO

Jupyter landing page

LIVE DEMO

Create a new notebook

LIVE DEMO

My first notebook

  • Give your notebook a name (functions)

LIVE DEMO

Cell types

Jupyter documents are comprised of cells

  • A Jupyter cell can have one of several types

Change the first cell to Markdown

LIVE DEMO

Markdown text

Markdown allows us to enter formatted text.

Execute a cell with Shift + Enter

LIVE DEMO

Entering code

Python code can be entered directly into a code cell

Execute a cell with Shift + Enter

LIVE DEMO

11. Functions

Motivation

  • We have code to plot values of interest from multiple datasets

BUT the code is long and complicated

  • It’s not flexible enough to deal with thousands of files
  • We can’t modify it easily

SO we will package our code for reuse: FUNCTIONS

What is a function?

Functions in code work like mathematical functions

\[y = f(x)\]

  • \(f()\) is the function
  • \(x\) is an input (or inputs)
  • \(y\) is the returned value, or output(s)

  • The output \(y\) depends in some way on the value of \(x\) - defined by \(f()\).

Not all functions you will see take input, or produce usable output, but the principle is generally the same.

My first function

fahr_to_kelvin() to convert Fahrenheit to Kelvin

\[f(x) = ((x - 32) \times \frac{5}{9}) + 273.15\]

LIVE DEMO

Calling the function

  • Calling fahr_to_kelvin() in the notebook is the same as calling any other function
print('freezing point of water:', fahr_to_kelvin(32))
print('boiling point of water:', fahr_to_kelvin(212))

LIVE DEMO

Create a new function

Create a new function in your notebook, and call it.

def kelvin_to_celsius(temp):
  return temp - 273.15
print('freezing point of water', kelvin_to_celsius(273.15))

LIVE DEMO

Composing functions

Composing Python functions works the same way as for mathematical functions: \(y = f(g(x))\)

  • We could convert F (temp_f) to C (temp_c) by executing the code:
temp_c = kelvin_to_celsius(fahr_to_kelvin(temp_f))

LIVE DEMO

New functions from old

We can wrap this composed function inside a new function:

fahr_to_celsius:

def fahr_to_celsius(temp_f):
    return kelvin_to_celsius(fahr_to_kelvin(temp_f))
print('freezing point of water in Celsius:', fahr_to_celsius(32.0))

This is how programs are built:

combining small bits into larger bits until the behaviour we want is obtained

LIVE DEMO

Exercise 08 (5min)

Can you write a function called outer() that:

  • takes a single string argument
  • returns a string comprising only the first and last characters of the input, e.g.
print(outer("helium"))
hm

Function scope

Variables defined within a function, including parameters, are not ‘visible’ outside the function

  • This is called function scope
a = "Hello"

def my_fn(a):
  a = "Goodbye"

my_fn(a)  
print(a)

LIVE DEMO

Exercise 09 (1min)

What would be printed if you ran the code below?

a, b = 3, 7

def swap(a, b):
    temp = a
    a = b
    b = temp

swap(a, b)
print(b, a)
  1. 7 3
  2. 3 7
  3. 3 3
  4. 7 7

12. Refactoring

Tidying Up

Now we can write functions!

Let’s make the inflammation analysis easier to reuse: one function per operation

  • Open the analyse_files.py notebook from the first lesson

What operations should be put into functions?

The code is divisible into two sections

  1. check the data for problems
  2. plot the data

detect_problems()

  • We noticed that some data was questionable
  • This function spots problems with the data
    • Call the function after loading, before plotting
def detect_problems(data):
    if numpy.max(data, axis=0)[0] == 0 and numpy.max(data, axis=0)[20] == 20:
        print("Suspicious-looking maxima!")
    elif numpy.sum(numpy.min(data, axis=0)) == 0:
        print("Minima sum to zero!")
    else:
        print("Seems OK!")

LIVE DEMO

plot_data()

We’ll write a function called plot_data() that plots the data to file

def plot_data(data, fname):
    # create figure and three axes
    fig = plt.figure(figsize=(10.0, 3.0))
    [...]

LIVE DEMO

Code reuse

Our code is now much more readable

  • Loop over the files, load data, detect_problems() and plot_data()
# Analyse each file in turn
for fname in files:
    print("Analysing", fname)

    # load data
    data = np.loadtxt(fname=fname, delimiter=',')

    # identify problems in the data
    detect_problems(data)

    # plot image in file
    imgname = fname[:-4] + '.png'
    plot_data(data, imgname)        

Good code pays off

Why should I bother?

  • After 6 months, the referee’s report arrives and you need to rerun experiments
  • Another student is continuing the project
  • Some random person reads your article and asks for the code
  • Helps spot errors quickly
  • Clarifies structure in your mind as well as in the code
  • Saves you time in the long run! (“Future You” will back this up)

13. Command-line programs

Learning objectives

How can I write Python programs that will work like Unix command-line tools?

  • Use the values of command-line arguments in a program.
  • Handle flags and files separately in a command-line program.
  • Read data from standard input in a program so that it can be used in a pipeline (with pipes: |)

LIVE DEMO

The sys module

sys is a Python module for interacting with the operating system

Open a new file called sys_version.py in your editor

$ nano sys_version.py
import sys
print('version is', sys.version)
$ python sys_version.py 

LIVE DEMO

sys.argv

sys.argv is a variable that contains the command-line arguments used to call our script

Open a new file called sys_argv.py in your editor

$ nano sys_argv.py
import sys
print('sys.argv is', sys.argv)
$ python sys_argv.py 
$ python sys_argv.py item1 item2 somefile.txt

LIVE DEMO

Building a new script

We’re going to build a script that reports readings from data files

$ python readings.py mydata.csv
  • We will make it take options --min, --max, --mean
    • The script will report one of these
$ python readings.py --min mydata.csv
  • We will make it handle multiple files
$ python readings.py --min mydata.csv myotherdata.csv
  • We will make it take STDIN so we can use it with pipes
$ cat mydata.csv | readings.py --min

Starting the framework

We start with a script that doesn’t do all that

  • We’ll build features in one-by-one
$ nano readings.py
import sys
import numpy

def main():
    script = sys.argv[0]
    filename = sys.argv[1]
    data = numpy.loadtxt(filename, delimiter=',')
    for m in numpy.mean(data, axis=1):
        print(m)

LIVE DEMO

Calling a script

A Python file can tell if it is being run as a script

  • If we do this, we can use the same file as:
    • a module (import readings)
    • a script ($ python readings.py)
  • The Python code has __name__ == '__main__' only when run as a script

We run main() only if the file is run as a script

if __name__ == '__main__':
   main()

Add this to readings.py and run the script

LIVE DEMO

Handling multiple files

We want to be able to analyse multiple files with one command

NOTE: wildcards are expanded by the operating system

$ ls data/small-*
data/small-01.csv  data/small-02.csv  data/small-03.csv
$ python sys_argv.py data/small-*
  • All arguments from index 1 onwards are filenames
def main():
    script = sys.argv[0]
    for filename in sys.argv[1:]:
        print(filename)
        data = numpy.loadtxt(filename, delimiter=',')
        for m in numpy.mean(data, axis=1):
            print(m)

Handling flags

We want to use --min, --max, --mean to tell the script what to calculate

$ python readings.py --max myfile.csv

The flag will be sys.argv[1], so filenames are sys.argv[2:]

  • We should check that flags are valid
def main():
    script = sys.argv[0]
    action = sys.argv[1]
    filenames = sys.argv[2:]
    if action not in ['--min', '--mean', '--max']:
        print('Action is not one of --min, --mean, or --max: ' + action)
        sys.exit(1)
    for f in filenames:
        process(f, action)

Add process()

We split the script into two functions for readability

  • The process() function returns the summarised data
def process(filename, action):
    data = numpy.loadtxt(filename, delimiter=',')

    if action == '--min':
        values = numpy.min(data, axis=1)
    elif action == '--mean':
        values = numpy.mean(data, axis=1)
    elif action == '--max':
        values = numpy.max(data, axis=1)

    for m in values:
        print(m)

LIVE DEMO

Using STDIN

The final change will let us use STDIN if no file is specified

  • sys.stdin catches STDIN from the operating system
    if len(filenames) == 0:
        process(sys.stdin, action)
    else:
        for f in filenames:
            process(f, action)
$ python readings.py --max < data/small-01.csv

LIVE DEMO

14. Testing and documentation

Motivation

  • Once written, functions are reused
  • Functions might be reused without further checks
  • When functions are written:
    • test for correctness
    • document their function
  • Example: centring a numerical array

Create a new notebook

  • Call it testing

centre()

  • Add the function
import numpy as np

def centre(data, value):
    return (data - np.mean(data)) + value

LIVE DEMO

Test datasets

  • We could try centre() on real data
    • but we don’t know the answer!

Use numpy to create an artificial dataset

z = np.zeros((2, 2))
print(centre(z, 3.0))

LIVE DEMO

Real data

Try the function on real data…

data = np.loadtxt(fname='data/inflammation-01.csv', delimiter=',')
print(centre(data, 0))
  • But how do we know it worked?

LIVE DEMO

Check properties

  • We can check properties of the original and centred data
    • mean, min, max, std
centred = centre(data, 0)
print('original min, mean, and max are:', 
      np.min(data), np.mean(data), np.max(data))
print('min, mean, and max of centered data are:', 
      np.min(centred), np.mean(centred), np.max(centred))
print('std dev before and after:', 
      np.std(data), np.std(centred))      

LIVE DEMO

Documenting functions

  • Writing comments in the code (using the hash #) is a good thing
  • Python provides for docstrings
    • These go after the function definition
    • Hook into Python’s help system
def centre(data, value):
    """Returns the array in data, recentered around value."""
    return (data - np.mean(data)) + value

help(centre)

LIVE DEMO

Default arguments

  • The centre() function requires two arguments
  • We can specify a default argument in how we define the function
def centre(data, value=0.0):
    """Returns the array in data, recentered around the value"""
    return (data - np.mean(data)) + value
centre(data, 0.0)
centre(data, desired=0.0)
centre(data)

LIVE DEMO

Exercise 10 (10min)

Can you write a function called rescale() that
  • takes an array as input
  • returns an array with values scaled in the range [0.0, 1.0]
  • has an informative docstring
  • HINT: If lo and hi are the lowest and highest values in the original array, then the replacement for a value val should be (val - lo) / (hi - lo).

15. Errors and Exceptions

Create a new notebook

Errors

Programming n. - the process of making errors and correcting them until the code works

  • All programmers make errors
  • Identifying, fixing, and coping with errors is a valuable skill

Traceback

Python tries to tell you what has gone wrong by providing a traceback

def favourite_ice_cream():
    ice_creams = [
        "chocolate",
        "vanilla",
        "strawberry"
    ]
    print(ice_creams[3])
favourite_ice_cream()

LIVE DEMO

Anatomy of a traceback

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-2-713948c7cbba> in <module>
----> 1 favourite_ice_cream()

<ipython-input-1-d5749f56e901> in favourite_ice_cream()
      1 def favourite_ice_cream():
      2     ice_creams = ["chocolate", "vanilla", "strawberry"]
----> 3     print(ice_creams[3])

IndexError: list index out of range
  • (mostly, you can just look at the last couple of levels)

LIVE DEMO

Syntax errors

  • Logic errors occur when the code is ‘correct’ but does something illegal
  • Syntax errors occur when the code is not understandable as Python
def some_function()
    msg = "hello, world!"
    print(msg)
     return msg

LIVE DEMO

Syntax traceback

  File "<ipython-input-3-dbf32ad5d3e8>", line 1
    def some_function()
                       ^
SyntaxError: invalid syntax

LIVE DEMO

Fixed?

def some_function():
    msg = "hello, world!"
    print(msg)
     return msg

LIVE DEMO

Not quite

  File "<ipython-input-4-e169556d667b>", line 4
    return msg
    ^
IndentationError: unexpected indent

LIVE DEMO

Name errors

  • NameErrors occur when a variable is not defined in scope
    • (often due to a typo!)
print(a)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-5-c5a4f3535135> in <module>()
----> 1 print(a)

NameError: name 'a' is not defined

LIVE DEMO

Index Errors

  • Trying to access an element of a collection that does not exist gives an IndexError
letters = ['a', 'b']
[...]
print("Letter #3 is", letters[2])

Letter #1 is a
Letter #2 is b
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-9-62bced7460d2> in <module>()
      2 print("Letter #1 is", letters[0])
      3 print("Letter #2 is", letters[1])
----> 4 print("Letter #3 is", letters[2])

IndexError: list index out of range

LIVE DEMO

Exercise 11 (5min)

  • Can you read the code below, and (without running it) identify what the errors are?
  • Can you fix all the errors so the code prints abbabbabba?
for number in range(10):
    # use a if the number is a multiple of 3, otherwise use b
    if (Number % 3) = 0:
        message = message + a
    else:
        message = message + "b"
print(message)

16. Defensive programming

(Un)readable code

What does this function do?

def s(p):
    a = 0
    for v in p:
        a += v
    m = a / len(p)
    d = 0
    for v in p:
        d += (v - m) * (v - m)
    return numpy.sqrt(d / (len(p) - 1))

Readable code

What does this function do?

def std_dev(sample):
    sample_sum = 0
    for value in sample:
        sample_sum += value

    sample_mean = sample_sum / len(sample)

    sum_squared_devs = 0
    for value in sample:
        sum_squared_devs += (value - sample_mean) * (value - sample_mean)

    return numpy.sqrt(sum_squared_devs / (len(sample) - 1))

First line of defence: sensible naming, style and documentation

Create a new notebook

Defensive programming

We’ve focused on the basics of building code: variables, loops, functions, etc.

  • We’ve not focused on whether the code is ‘correct’
  • Defensive programming is expecting your code to have mistakes, and guarding against them

Write code that checks its own operation

  • This is good practice
    • speeds up software development
    • helps ensure that your code is doing what you intend

Assertions

  • Assertions are a Pythonic way to see if code runs correctly
    • 10-20% of the Firefox source code is checks on the rest of the code!
  • We assert that a condition is True
    • If it’s True, the code may be correct
    • If it’s False, the code is not correct
assert <condition>, "Some text describing the problem"

Example assertion

numbers = [1.5, 2.3, 0.7, -0.001, 4.4]
total = 0.0
for n in numbers:
    assert n > 0.0, 'Data should only contain positive values'
    total += n
print('total is:', total)

QUESTION: What does this assertion do?

LIVE DEMO

When do we use assertions?

  • preconditions - must be true at the start of an operation
  • postcondition - guaranteed to be true when operation completes
  • invariant - something true at a particular point in code
def normalise_rectangle(rect):
    """Normalises a rectangle to the origin, longest axis 1.0 units."""
    x0, y0, x1, y1 = rect
    dx = x1 - x0
    dy = y1 - y0
    
    if dx > dy:
        scaled = float(dy) / dx
        upper_x, upper_y = 1.0, scaled
    else:
        scaled = float(dx) / dy
        upper_x, upper_y = scaled, 1.0
        
    return (0, 0, upper_x, upper_y)

Preconditions

Preconditions must be true at the start of an operation or function

  • Here, we want to ensure that rect has four values
def normalise_rectangle(rect):
    """Normalises a rectangle to the origin, longest axis 1.0 units."""
    assert len(rect) == 4, "Rectangle must have four co-ordinates"
    x0, y0, x1, y1 = rect
    [...]

LIVE DEMO

Postconditions

Postconditions must be true at the end of an operation or function.

  • Here, we want to assert that the upper x and y values are in the range [0, 1]
def normalise_rectangle(rect):
    """Normalises a rectangle to the origin, longest axis 1.0 units."""
    [...]

    assert 0 < upper_x <= 1.0, "Calculated upper x-coordinate invalid"
    assert 0 < upper_y <= 1.0, "Calculated upper y-coordinate invalid"    
        
    return (0, 0, upper_x, upper_y)

LIVE DEMO

Notes on assertions

Assertions help understand programs

  • assertions declare what the program should be doing
  • assertions help the person reading the program match their understanding of the code to what the code expects

Fail early, fail often

  • Turn bugs into assertions or tests: if you’ve made the mistake once, you might make it again

09. Conclusions (Part 1)

Learning Outcomes

  • variables
  • data types: numpy.arrays, lists, strings, numbers
  • file IO: loading data, listing files, manipulating filenames
  • computing statistics
  • plotting data: plots and subplots
  • program flow: loops and conditionals
  • automating multiple analyses
  • Python scripts: edit-save-execute

Well done!

Building Programs With Python (Part 2)

Etherpad

Why are we here?

  • To learn basic concepts of programming (in Python)
  • How to solve problems in your research by…
    • Building scripts
    • Automating tasks
  • Good coding practice
    • Functions
    • Defensive programming