Pseudocode Programming: fake it before you make it

I’m interested in exploring ways to write better code. Ways of organising your workflow or your thoughts that are more advanced than ‘write down the first thing that comes into your head’. Techniques that will help to produce clean, reliable and fault-tolerant code, or at least help you to arrive at the same conclusion you would have anyway but more efficiently.

Something that I learned about when skimming through Code Complete 2 by Steve McConnell is the Pseudocode Programming Process. He seems to have been the one who invented it as I can’t find references to it outside of this context.

The idea is a simple one: before you start writing code, plan what you will do in pseudocode first. It’s very similar to what you would do in a whiteboarding interview, except without the sweaty anxiety of having to perform in front of a group of strangers.

The theory is that any kind of design process starts off with a bunch of preparatory work. You start off with drafts, wireframes or maquettes and iterate on these, front-loading your exploratory work while making mistakes is cheap. You don’t think about touching your final medium until you have most of the fundamentals worked out. This applies to building a bridge, making a painting, writing a book, or coding some software. That’s the idea, anyway.

McConnell goes into a lot more detail than that, which I’ve heavily summarised below.

The Pseudocode Programming Process, in brief

First: define your prerequisites

This is your pre-pseudocode preparatory work. You should consider:

  • what is the problem that the routine will solve?
  • is the job of the routine well-defined, and does it fit cleanly into the overall design?
  • what are the inputs, outputs, preconditions and postconditions?
  • what will you call your routine? Make sure it’s clear and unambiguous
  • how will you test your routine?
  • can you re-use functionality from standard or third-party libraries?
  • what could possibly go wrong, and how will you handle errors?

Next: design the routine in pseudocode

Now you can make your wireframe. During this stage you must resist the temptation, however hard it is, to write any code. You will really want to, but do not start writing any real code. The idea here is to get a high-level framework in place. If you start writing code, you are going to start worrying prematurely about implementation details. Instead you should:

  • write in pseudocode, starting off from the general control flow and then get gradually more specific
  • use precise, language-agnostic English (or your preferred human language)
  • try out a few ideas, and pick the best one
  • back up and think how you’d explain it to someone else
  • keep refining until it feels like a waste of time not to write real code

Finally: write real code

The time has come to write real code! What a relief.

  • fill in the code below each comment, and use the pseudocode as higher level comments if you like
  • mentally check whether any further refactoring is needed
  • mentally check for errors
  • test out the code for real, and fix any errors
  • repeat as needed, going back to pseudocode if need be

Trying it out

I experimentally used PPP last quarter to see if it would help me to write better code. Here’s some of my thoughts about it:

  • I must confess that I never managed to do all of the steps of the (condensed) set above. I bet I’m not the only one. You have to be really conscientious to not skip any, and I wasn’t that conscientious. There’s a huge temptation to take shortcuts/ jump straight in before completing the steps. However, as far as I see it pretty much the whole point is to try and resist this temptation. The true value of the technique comes from pausing to look at what you are trying to achieve from a high level before diving in.

  • I found PPP most useful for tasks that were too big/complicated to hold in my head all at once. In these cases, PPP was useful for breaking them down and making them less intimidating. I got myself out a few ruts this way, where I was staring at my screen and didn’t know where to start.

  • It felt sometimes that I had to check my ego before starting. Sometimes I felt too ‘proud’ to do PPP, I felt like I didn’t ‘need’ it. Times when I’ve felt like this I probably ended up taking longer doing it the standard way (writing code as it came to me, and then going back and revising) than I would have using PPP.

  • PPP didn’t feel worth it for small things (a few lines of code), although I suspect the level of complication where PPP starts to be useful is probably lower than it first appears.

In summary

PPP feels like a useful addition to my tool box, but I’m on the fence as to whether it’ll end up staying there long term. I’m going to try it out on more complicated projects and see where it takes me.


For more analysis and opinion about PPP, there’s loads of interesting comments at the end of this article.

How I hacked my imposter syndrome using personal tracking

About nine months ago I started a job as a Data Scientist at DueDil. As this is my first proper data science job, a lot of things have been new to me: a whole new stack, new workflows for project management (like managing job tickets), goal-setting practices such as quarterly OKRs, not to mention the pace of development in a start-up. It wasn’t surprising to me that these took some time to get familiar with. However, an unexpected block I had was my perception of my own abilities. After some time, I found myself getting stressed and frustrated when I hit a bug or something I couldn’t understand.

I started to assume whenever I couldn’t do something that I was at fault for not being able to see the answer right away. Other people around me seemed to be able to waltz in and fix the problem immediately. The stress I was experiencing made it harder to think clearly, which in turn made it even harder to troubleshoot the bugs. The feelings were sometimes close to panic. It was a really uncomfortable state to be in.

It was clear I hadn’t prepared myself for the feelings of frustration that are familiar to rookie and seasoned-developers alike. And no, I haven’t been living under a rock: I know what imposter syndrome is. I’ve listened to/read numerous talks, blogs and chat streams about it. Turns out, just knowing about a defect in your thought patterns isn’t enough to fix it.

Introducing: the surprise journal

One of my good friends suggested I try something she called a ‘surprise journal’. The idea is that whenever you notice that you are surprised about something, you fill out an entry in the journal. You write down what surprised you, what you originally thought would happen, and what actually happened. Then later on, you can go back and review your entries and see what you’ve learned.

I decided that I’d adapt the idea for one of my OKRs this quarter. I thought that maybe by doing this, I could find some general patterns with things that I was getting stuck on, and that would help me to understand how I could improve myself. Actually, the journal ended up helping me in quite a different way to what I expected.

Format and process

I tweaked the format a bit to suit my own needs. I set it out in a table in Google Sheets. The column headers that I settled on were:

  • Date
  • What was I surprised or confused by
  • What I thought would happen
  • What actually happened
  • What did I learn
  • Notes
  • Category

I changed the criteria to include ‘confusion’, since this more accurately described my feelings than ‘surprise’ a lot of the time.

I decided that if I was stuck on a problem for more than about 15 minutes, then I’d include it in the journal. I didn’t want to put everything in there, since it’s typical to feel some level of confusion almost every other minute whilst coding, but most of these are tiny issues which are quickly resolved.

I’d try to fill out the first four columns first, whilst I was struggling with the bug. That way I didn’t know what the solution was yet, and so I was filling it out in a state of uncertainty.

Once I had solved the problem (or gotten help from someone else), I filled in the ‘what did I learn’ column, and sometimes ‘notes’ if I needed more space for explanation.

Later on, after I had completed ten entries, I went back and filled out the ‘category’ column with a description of the problem domain. For example, some of my categories were Spark, bash, S3 and self-awareness.

An example entry

Here’s an example of one of my entries. It won’t make much sense out of context, but it gives a sense of how it looked.

Date August 7 2017
What I was surprised or confused by Why, when I run the data pipeline, are the metrics files not being appended together?
What I thought would happen I expected a cumulative append of metrics files, rather than just one record for each date
What actually happened There was only the most recent record for each metric output. You can see in the Jenkins output that the three metrics files all run at the same time, so none has finished by the time the others look for a latest output file. Hence the more recent output cannot find any previous files to append
What did I learn Checking the Jenkins output is a good way to figure out order of execution
Notes -
Category Spark, Jenkins

What did I learn?

I learned a lot of things while keeping the journal.

Keeping the journal made me calmer whilst debugging

I realised after a while that keeping the journal allowed me to remove my ego from the situation. It became a reminder that the difficulties I were having were normal and expected. Even the name ‘surprise’ felt quite pleasant and positive. Soon I wasn’t feeling stressed about the bugs any more. It was like they turned from personal failings into puzzles to be solved.

I got used to the feeling of discomfort

Now I look back on it, starting the entry before I knew the answer may have been crucial to the journal’s impact. It forced me to pause and face my feelings of discomfort. Usually, my first instinct when debugging is to try and fix the problem as quickly as possible, because feeling stuck is so damn unpleasant. This time, I had to pause for long enough to actually articulate the problem. This gave me practice at experiencing the feeling of uncertainty, which made this feeling more routine and less scary.

Articulating the problem helped to solve it

Sometimes the journal became my rubber duck - articulating the problem helped me to get my thoughts in order and get to the answer quicker.

Everyone else is not a wizard

Something important that happened more than once was that a problem I thought would be trivial for someone else to solve ended up stumping them too. This gave me confidence that actually I was no different to anyone else. I also got to see how other people reacted to and solved the problem. I saw that it was possible to take a while to solve something without that seeming like a black mark against the person’s abilities.

There wasn’t something systematically wrong with my understanding

The thing I originally expected to happen (that I’d realise I was deficient in some area or other and that I would go away and read about it) didn’t materialise. The categories of things I got stuck on were more a reflection of whatever I was working on at the time than some systematic flaw in my reasoning. Looking back at the entries, none of them seem embarrassing or shameful - anyone could have gotten confused by these things.

Summary

In summary, this turned out to be a really useful tool for me to get some perspective and regain my confidence. I’m thinking about continuing it and maturing the format. I’d recommend it to anyone who wants to get a higher-level view on their problem-solving abilities.


For those currently struggling with imposter syndrome, an additional resource I found really helpful when I was researching this topic was this talk/blog post by Allison Kaptur on effective learning strategies for programmers.

You too can parallelise in Python

Parallelisation in Python has a bad rep, so much so that I’ve been put off learning about it in the past. However, sometimes in my work I come up against embarrassingly parallel problems: problems where parallelisation is a no-brainer. Parallel programming is a hot topic at the moment as a way to increase performance without increasing CPU power. Also, Python’s parallel programming libraries are maturing with each new release. It seems like a good time to get familiar with these concepts.

This post will start off with a brief introduction into what parallel programming is, and then describe some of the libraries you can use to do this in Python.

What is parallelisation?

A lot of the time people use ‘parallelisation’ or ‘parallel computing’ as generic umbrella terms to describe everything related to doing computation through simultaneous activity.

However, not all parallel computing is done in the same way. The broadest distinction is between two main types of parallel computing, concurrency and multiprocessing.

Concurrency versus multiprocessing

Concurrency and multiprocessing refer to two very different ways of doing simultaneous work. Concurrency describes interleaving execution of the tasks in the same timeframe, whereas multiprocessing involves adding more workers to do the tasks.

An analogy I came up with to explain the difference between these two concepts is a line of people queuing with their groceries at the supermarket. Each person has a certain amount of shopping to scan through the till. In the first scenario, everyone waits in a single line and has their shopping scanned through the till by the assistant one after the other. This is like an unparallelised job.

Concurrency

The first scenario works fine, but it’s slow. Sometimes there are pauses which cause bottlenecks. Maybe someone forgot avocadoes for their guacamole, so everyone in the line has to wait while she goes and finds them (how selfish!). In scenario two, instead of waiting during this pause (as he did in scenario one), the till assistant starts scanning some items from the other customers. When the first customer returns with her avocadoes, the assistant resumes from where he was before. This is like how concurrency works.

Concurrency is when two or more tasks are being performed during overlapping time periods, but they’re never being executed at exactly the same time. A real-world example would be a computer with a single core that is running multiple drivers for the mouse, keyboard and display driver at the same time. The computer isn’t ever running more than one process at a time; rather, it sneakily switches between them so fast that you don’t notice the gaps.

This way, if one of the units of work is super slow, it doesn’t mean all the others have to queue patiently before they can start; they can proceed in synchrony. Concurrency relies on the fact that processes can be broken down into discrete units (in some contexts these are called ‘threads’), which can be run in any order without affecting the final output.

Multiprocessing

Back to the supermarket. Another thing that could happen when the queue gets too long is that they call up another shop assistant and open another till. Now the customers that were previously being served by one assistant are served by two. This is like how multiprocessing works.

Multiprocessing is bona fide parallelism (sometimes it’s just called ‘parallelism’) because at least two processes are being performed in parallel, at the same time. In practice this requires the individual pieces of computation to be performed on different CPUs on your machine (or on different machines entirely).

In other words, you could have hundreds of concurrent processes happening on just one core, but for multiprocessing you need multiple cores.

Note that if you start reading around about this topic, you’ll find people saying that multiprocessing is trivial to implement whereas concurrency is really hard; it’s been decribed by Dave Beazley as ‘one of the most difficult topics in computer science (usually best avoided)’. It seems to me that when people say this they are referring specifically to multi-threaded code, whereas there are lots of alternative architectures to this that you can use to implement concurrency. You can read about some of them in Seven Concurrency Models in Seven Weeks (it also looks like it’s a good intro to parallel programming architectures in general).

Sounds great. Can I do parallelisation in Python?

Yes you can, but it’s not as trivial to implement as in other languages.

Python (in particular CPython rather than say, Jython or IronPython) has something called the Global Interpreter Lock (GIL), which (for safety reasons) forces only one thread to be executed at a time.

The GIL’s effects on the threads of your program is simple: ‘one thread runs Python, while N others sleep or await I/O’. This makes performing multi-threading in Python as you would in other languages really difficult. It is possible to have control over the GIL if you write an Python library coded in C, but the prospect of that brings me out in a cold sweat. What’s worse, the GIL is pretty much baked into CPython; removing it is really hard (so hard that it’s been described as Python’s hardest problem).

But don’t hate the GIL! Without it, the implementation of the CPython interpreter and C extensions would be much more complicated. We have the GIL to thank for much-loved C libraries such as NumPy and SciPy, which are some of the main reasons why Python is so popular as a data processing language.

Threads versus processes

Before looking at Python libraries, I’ll briefly mention the difference between a thread and a process, since these feature a lot in discussions about parallelisation.

A process is an instance of a computer program being implemented, whereas a thread is a subset of a process.

The threads within a process share resources such as memory space; this means that if there is a variable within a process’ memory, it’s accessible to all the threads in that process.

Conversely, different processes don’t share the same memory space, so if they want to know what another one is thinking, they have to talk to one another. This can be an expensive operation to perform, so it’s better to avoid it if you can.

Python modules for parallelisation

I’ll talk about four Python libraries for doing parallelisation: threading, multiprocessing, concurrent.futures and aync.io. There are others (such as Twisted, Tornado, and Celery) but these four are in the standard library, and as far as I can see they represent pretty well the main kinds of architecture for parallelisation that exist in Python.

threading

Using multiple threads to execute work is a form of concurrency. As I already said, In Python the GIL prevents two or more Python threads from executing at the same time. This is great for preventing threads from accidentally overwriting each others’ memory (moar here), but it means that we can only parallelise using Python threads in certain situations.

In what situations should you use the Python threading module? Let’s say you are writing a program that is I/O limited (the time taken to complete an operation is limited by input/output operations). For example, you are fetching URLs by making requests over HTTP, and there are times when you are waiting for a response. If you could exploit these pauses to do other useful work, the overall time that your program takes to run will be shorter.

This is where threading comes in. In Python, if there is an I/O block, Python handily releases the GIL while waiting for the I/O block to resolve. In these situations there’s no pesky GIL preventing concurrent thread execution. Woohoo! So if you’re doing a lot of I/O bound operations, threading can be useful to make your programs’ execution more responsive.

Here is a toy example of how you could create threads to make HTTP requests. Note that time.sleep also blocks the GIL, so we can use it to simulate a longer I/O block.

import threading
import time

import requests


urls = [
  'https://www.royalacademy.org.uk/',
  'http://www.metmuseum.org/',
  'http://www.artic.edu/',
]

def open_url(url):
  print(threading.current_thread().getName(), 'starting')
  print('Opening {}'.format(url))
  time.sleep(3)
  requests.get(url)
  print(threading.current_thread().getName(), 'exiting')

if __name__ == '__main__':
    for url in urls:
        t = threading.Thread(target=open_url, args=(url,))
        t.start()

This outputs the following in your terminal:

Thread-1 starting
Opening https://www.royalacademy.org.uk/
Thread-2 starting
Opening http://www.metmuseum.org/
Thread-3 starting
Opening http://www.artic.edu/
Thread-3 exiting
Thread-2 exiting
Thread-1 exiting

You can see that as one thread is waiting, another is spawned.

The whole syntax of instantiating a thread inside a for loop and passing in the arg is a bit unwieldy. Fortunately, there’s a nicer syntax set out in this great blog post involving mapping tasks over a thread pool.

A thread pool is a pool of workers that you can create and offload tasks to all at once. Unfortunately, you can’t make Pool instances with threading. However, you can using multiprocessing.dummy, which replicates the multiprocessing API (more on this below) but uses threading under the covers (i.e. it uses threads instead of processes).

The pool instance has a handy map function that does just what you’d expect, except it makes the function calls concurrently:

import multiprocessing.dummy as dummy
import time

import requests


urls = [
  'https://www.royalacademy.org.uk/',
  'http://www.metmuseum.org/',
  'http://www.artic.edu/',
]

def open_url(url):
  print(dummy.current_process().name, 'starting')
  print('Opening {}'.format(url))
  time.sleep(3)
  print(dummy.current_process().name, 'exiting')
  return requests.get(url)

if __name__ == '__main__':
    pool = dummy.Pool(3)

    results = pool.map(open_url, urls)

    pool.close()
    pool.join()

This syntax is much cleaner and also has the advantage of giving me easy access to the results at the end.

multiprocessing

Threading is good for making your program run faster if it’s blocked by I/O operations. However, if your bottleneck is that your code is computationally expensive to run (it’s CPU-bound) then threading isn’t going to cut it. This is where you’ll need multiprocessing.

In contrast to threading, the multiprocessing module spawns new processes with separate GILs. This means that you can get around the threading issues and take advantage of multiple CPUs and cores. However, as I already mentioned it’s more difficult and more expensive to share data between processes than between threads.

Again, we can use a pool of workers and the mapping pattern. The API for multiprocessing is very similar to threading:

from multiprocessing import Pool

list_to_be_processed = range(10000)

def multiply_number_by_itself(number):
    return number**number

if __name__ == '__main__':
    pool = Pool()
    results = pool.map(multiply_number_by_itself, list_to_be_processed)
    pool.close()
    pool.join()

If you call top from your bash terminal you can see a list of your system processes. When I run the script, I can see nine Python3 processes pop up. By default Pool() will create as many processes as you have cores, and I have eight cores on my machine. The ninth must be the original Python process that existed before.

concurrent.futures

concurrent.futures is part of the standard library that was added in Python 3.2, and lets you do both threading and multiprocessing via the ThreadPoolExecutor and ProcessPoolExecutor classes.

Like the examples above for multiprocessing and multiprocessing.dummy, they let us create pools of threads or processes with a set number of workers, which we can submit tasks to. The pool takes care of distributing the tasks and scheduling.

Unlike multiprocessing, when we submit a task to the pool, we get back an instance of the Future class. A Future instance represents a deferred computation that may or may not have completed yet. Futures have a done method that we can call to check whether the command has finished executing or not, but the typical pattern to use would be to use the add_done_callback method to notify the client code when the future is done executing. The result of this task might be an exception. See the PEP that introduced futures for more information.

from concurrent.futures import ThreadPoolExecutor
import time

import requests


urls = [
    'https://www.royalacademy.org.uk/',
    'http://www.metmuseum.org/',
    'http://www.artic.edu/',
]

def open_url(url):
    time.sleep(3)
    return requests.get(url)

if __name__ == '__main__':
    with ThreadPoolExecutor(max_workers=5) as pool:
        results = pool.map(open_url, urls)

This time I used a context manager to take care of the cleanup, but otherwise it’s almost the same syntax as the multiprocessing example above. In fact, I looked at the source code and saw that under the hood ProcessPoolExecutor is using multiprocessing module, and ThreadPoolExecutor is using threading.

So why would you use concurrent.futures over threading or multiprocessing? Reasons include:

  • you get access to both threading and multiprocessing abilities in one library
  • it has a simpler interface, so is easier to use

Reasons you might not:

  • The API is more limited, so you can’t do as much as you can with multiprocessing or threading
  • If you’re using Python 2.7 or earlier you’ll have to use a backport of concurrent.futures

In general, the advice seems to be to use concurrent.futures if you can, since it’s intended to mostly replace multiprocessing and threading in the long-term.

async.io

async.io is apparently really hot right now. It’s part of the standard library from Python 3.4 onwards. Luciano Ramalho in Fluent Python describes it as ‘one of the largest and most ambitious libraries ever added to Python’.

Like threading and concurrent.futures, it is a package that implements concurrency. Also like concurrent.futures, Futures objects form a foundation of the async.io package. However, unlike any of the packages mentioned previously, async.io uses some different constructs (event loops and coroutines) to implement a completely different concurrency architecture.

  • An event loop is a general computer science term for something which manages and distributes the execution of different tasks. It waits for events from an event provider and dispatches them to an event handler. An example might be the main loop in a program that is continually testing for whether the user has interacted with the user interface.

  • A coroutine is a Python contruct that is an extension to a generator. How they work is beyond the scope of this post (there’s a whole chapter dedicated to them in Fluent Python if you want to learn more, and this is an interesting and funny slide deck about them), but just know that where a generator only generates values, a coroutine can consume them - i.e. you can send values to a coroutine. As such, they are able to receive values from an event loop. Importantly, you can use coroutines instead of threads to implement concurrent activities.

The key difference to how async.io works compared tp threading is that it uses a main event loop to drive coroutines that are executing concurrent activities, and it does so with a single thread of execution. Where threading cycles between threads in order to see whether they’re still blocked, asyncio uses the event loop to keep track of and schedule all the coroutines that want time on the thread.

One tricky thing with asyncio is that you can’t necessarily just use the same libraries that you would normally use synchronously, because they can block the event loop. You instead have to use special aync versions of the libraries. For example, I use the requests library above to fetch web content over HTTP, but if I was using asyncio I’d have to use something like aiohttp.

I’ve been looking at the PEPs that were included in Python 3.5 and 3.6, and looks like you can now use coroutines with a different syntax and asynchronous versions of list, set, dictionary comprehensions , as well as generators. There’s also plans for formalising new keywords related to asynchronous programming async and await in Python 3.7. Clearly developing these features is a priority right now, and the async.io ecosystem is maturing rapidly to keep pace with developments in other languages.

I really tried to put together an equivalent example here showing how async.io works like the other above. I came up with something that kinda sorta works but I don’t understand what it’s doing. There are a lot of new concepts to take in with asyncio and I’m struggling to understand how they all fit together. It looks like I’m not the only one. Instead of trying to explain something I don’t understand, I’m going to admit defeat at this point. Maybe I’ll revisit async.io at a later date.

If, unlike me, you can figure out how to use it, when should you use async.io over concurrent.futures or threading? According to this person, async.io is better if your I/O-bound operation is low and has many connections. This is because async.io specifically keeps track of which coroutines are ready and which are still awaiting I/O. So async.io incurs fewer switching costs than threading might do.

Summary

  • In summary, if you’re doing work that is I/O-limited you need concurrency, whereas if you’re CPU-limited you need parallelisation. In Python you can achieve concurrency either using threads or using coroutines and event loops.
  • If you’re doing something pretty standard and can get away with using concurrent.futures, look no further.
  • If you need finer control and a richer API, use multiprocessing or threading.
  • If you are doing I/O with lots of connections (and really know what you’re doing) use async.io

Scraps from my internship part 1: programming concepts

I am so behind with my blog that I haven’t even gotten round to talking about Enthought yet, where I interned from September last year until a couple of weeks ago. Enthought write scientific software as well as running programming training courses. They do almost everything in Python, and there are some exceptionally skilled Pythonistas working there.

I actually originally learned Python using Enthought’s training videos and exercises, so it was very cool to get to work there. For my internship I worked on adding a new feature to the Canopy Geoscience application, a piece of software that helps geoscience researchers analyse their data.

I tried to keep a list of new concepts/tools as I came across them during the course of the internship. Rather than blog about the project I thought I’d jumble together some things I learned. Enthought very kindly gave me a copy of Fluent Python as a going away gift, so I’ve referenced it a few times here; it’s a very good book.

This was threatening to turn into a monster essay so I’ve decided to break it down into three separate posts:

  1. Programming concepts

  2. Python specifics

  3. Git tricks

Read on for part 1!

Part 1: Programming concepts

The hardest thing I found about my project was learning how to work with a big codebase. I imagined the codebase like a big and complex mechanism, with lots of interconnecting cogs, cams and pulleys. A bit like the inside of this watch:

Watch mechanism by Alex Brown

(Although unlike a watch, codebases tend to be sprawling and idiosyncratic, more like an organism that evolved over time than an elegant piece of machinery, so they are rarely neat and predictable. Anyway.)

I didn’t need to be intimately au fait with each and every part of the mechanism to be able to add a new feature, but I needed to be able to hold the general architecture in my head, and at the same time zoom in locally to the part I was working on. I found this really challenging and tbh I think it’s probably one of those skills that takes years to hone. Fortunately for would-be tinkerers, the danger of breaking something can be mitigated somewhat if you or someone else has written good …

Tests

Don't be a dummy. Write some tests. Source

When I was working on personal software or data science projects, I knew I was supposed to write tests, but it was really more of an aspiration than a necessity. In a grown-up software environment, tests are a necessity if you want to have any kind of safety net. Whenever I wrote a new feature, I got into the habit of writing tests for it using Python’s unittest module. Sometimes writing the tests took longer than writing the code itself.

I also used mocking a bit, which lets you replace parts of your system that you want to test with mock objects. You might want to do this if the real objects are impractical to include in the test e.g. perhaps you want to test a method that calls another method to open an internet page. You don’t want to actually open the page during the test, so you mock the sub-method and check that it got called at the right time. You could do this within a context manager (see below).

NB: You should avoid overzealous mocking. Ideally you should always be testing the behaviour of the original code pattern, rather than mocking your problems under the rug.

Model/View/Controller architecture

This is a broad programming concept that gets used a lot for designing user interfaces. The idea is that you have three discrete parts: the Model, which is the underlying data or information that you want to display; your View, which the display itself; and the Controller, which combines the model and the view. In the canonical Model/View/Controller (MVC) system, the Model and the View should never know that each other exist; it’s all up to the controller to manage the two-way data flow between them. In practice, the boundaries can be more fuzzy.

This separation of concerns is supposed to be make your code more easily re-usable and testable: you can swap out the model for another and the view shouldn’t care, and vice versa. I used MVC via Enthought’s TraitsUI package. MVC seems to be one of those contentious programming concepts that makes people very cross, as evidenced by the comments on this article (warning to sensitive readers: the word ‘hogwash’ gets thrown around).

Interfaces and ABCs

Interfaces are another programming concept that you can be using in Python without even realising it. Fluent Python defines interfaces as:

The subset of an object’s public methods that enable it to play a specific role in the system.

To look at it another way, an interface is the essential set of functions of attributes that make your object … objecty.

For example, in Python the only methods that a class needs to be considered a sequence are the __len__ and __getitem__ methods. A class that implements these methods is a sequence, regardless of what classes it inherits from, or whatever other methods it has. This set of methods comprise a sequence’s interface. In this context where the interface is not formally declared, it is known as a protocol. You might be familiar with language such as ‘file-like object’ or ‘callable’ to describe Python protocols - objects that behave in a certain way in certain contexts.

The process of operating with objects regardless of their types, as long as they implement certain protocols, is known in the programming community as duck typing (never mind if it is a duck – does it quack like one?). This is a central concept in a dynamically-typed language such as Python, when type-checking is done at runtime.

This duck is late for the train. Source.

There are also more formal ways of implementing interfaces in Python. These require you to explicitly define the interface and register any classes that implement it. When I was at Enthought I did this via the Traits library, which has its own custom syntax for interfaces (scroll down to ‘Implementing an Interface’). Core Python has an equivalent called ‘Abstract Base Classes’ (ABCs), which are nicely introduced in Fluent Python chapter 11.

Operator overloading

Following on from duck typing is operator overloading. This is a concept that just means that certain operators (such as + - =) can have different behaviours based on the type of arguments. So, if I define a custom class Book with an attribute pages, I can + two or more instances of Book by defining the methods __radd__ and __add__ on my class (example from here).

    class Book:
        def __init__(self, pages):
            self.pages = pages
            
        def __add__(self, other):
            return self.pages + other
            
        # this is Python's 'reverse add', meaning if it is unable
        # to add a + b, it will try b + a (i.e. we need to make the
        # addition commutative)
        def __radd__(self, other):
            return self.pages + other

Now we can happily use the + operator, and with the magic of operator overloading we need never check the type of what we’re adding (it could be a number, a Book or a Duck, we don’t care).

Mixins

Finally for this post, mixins are classes that offer methods to other classes but are not themselves ever designed to be instantiated. Apparently this is a common design pattern in object-oriented languages. In Python mixins are used via multiple inheritence: a class that needs to use the mixin methods should inherit from the mixin class. However, the subclass should also inherit from another non-mixin class. This is not a formal rule, but if you follow it (and others in chapter 12 of Fluent Python!) you can bring order to the complexity of multiple inheritence.

Here’s an example I wrote involving cute furries. Imagine you have the superclasses Dog and Cat that each have their own attributes that are specific to their species.

  class Dog:
      def __init__(self, colour, bark):
          self.colour = colour
          self.bark = bark

  class Cat:
      def __init__(self, cuteness, name):
          self.cuteness = cuteness
          self.name = name

Now imagine that we want to make subclasses of Dog and Cat that have eating behaviour. We can create a mixin with eating methods:

  class Eat_Mixin:
      def eat(self, food):
          print('Yummy I like to eat {}'.format(food))

      def refuse(self, food):
          print('Yuck! I hate {}'.format(food))

Now we can use the mixin when creating the subclasses, using multiple inheritence (note super will only work without arguments like this in Python 3):

  class Daschund(Dog, Eat_Mixin):
      def __init__(self, colour='black', bark='loud'):
          super().__init__(colour, bark)

  class Manx(Cat, Eat_Mixin):
      def __init__(self, cuteness=5, name='bubbins'):
          super().__init__(cuteness, name)

Now if we instantiate Daschund or Manx, they will inherit the same eat methods from Eat_Mixin.

  >>> biggles = Daschund()
  >>> biggles.eat('cheese')
  Yummy I like to eat cheese

  >>> amanda = Manx()
  >>> amanda.refuse('chocolate')
  Yuck! I hate chocolate 

We shouldn’t be inheriting from just Eat_Mixin because this was never designed to be used as a concrete class.


That’s it for my first post in this series. Post 2 will look at some things about Python that I learned.

My summer of data science for social good

A happy new year to you for 2017 - here is a long overdue blog update! Almost a year ago now, I wrote a return statement for my time at the Recurse Center. Very belatedly, this is my return statement for DSSG (The Data Science for Social Good Summer Fellowship).

TLDR: It’s not an exaggeration to say that I had one of the best summers of my life at DSSG. If you’re considering whether this is a worthwhile thing to spend three months of your life doing, I say emphatically yes! Applications for summer 2017 are open now until the end of January.

What’s it all about?

DSSG is a programme that trains aspiring data scientists to work on social good problems. Around 45 fellows come to Chicago in the summer for 14 weeks to work in teams and build real data science solutions to problems faced by a range of social organisations. On the face of it, DSSG sounds similar in format to a data science bootcamp like Insight or The Data Incubator to name just two.

There are indeed similarities to a bootcamp: it is a high-intensity environment, there are a lot of people coming from academic backgrounds, and you spend your time working on data science projects. However, it is not a bootcamp as the term is commonly used. Its purpose is not to make money or to land its participants jobs after the summer. The goal of DSSG is to train fellows from academic backgrounds in how to use data science skills (such as statistics, machine learning, data mining etc.) to work on problems of a social nature. A parallel goal is to teach partner organisations in the social good space (such as non-profits or governmental organisations) how to work with data scientists to solve problems.

I will say from my experience having also attended the S2DS bootcamp in London, that DSSG felt completely different. There was a real buzz of excitement in the air from working on meaningful problems. People were having thoughtful conversations about our work and how it fitted into the wider world on a daily basis. We also got to interact with the social organisations themselves, and get a feel for the challenges they face. To put it bluntly, if you’re just looking to find a data science job (and there’s nothing wrong with that) you’re probably better off trying one of the hundreds of bootcamps that already exist. If, however, you have data science skills (and that category has a broad scope) that you want to use for social good, this could be for you.

The project

The view from the DSSG work space in downtown Chicago

During the fellowship, I worked in a team of four on a project for the Metro Nashville Police Department, creating an early intervention system to predict officers who are at risk of adverse interactions with the public. An adverse interaction could be anything from an injury (to a civilian or officer), a poorly-judged use of force, or a citizen complaint. Such events range in severity, but at their worst they can have tragic and irreparable consequences for the individuals and departments involved.

At the heart of this project was the need to try to predict these types of events before they occur. With this information, departments can intervene (by offering officers extra training, counselling or other interventions) and ideally prevent these negative events from ever occurring at all. With recent events in the states, this is obviously a highly emotionally-charged topic. It was challenging to maintain a balanced perspective for our work, whilst there was (and still continues to be) what felt like real social upheaval happening around us. At times this was difficult to navigate, but it was also an immensely valuable learning experience.

A Black lives matter demonstration I saw on my way home from work. They were happening all over the states whilst I was in Chicago.

The data

We had access to 5 years’ worth of internal data from the police department, including information about the characteristics of individual officers (age, gender etc.), the districts they worked in, any activity they had participated in (such as when and where they were dispatched to) and any complaints filed against them, or compliments they had received. We used individual officers that had been involved in an adverse incident as our labelled data (the thing we were trying to predict and prevent).

We created a pipeline to clean the data, format it, run machine learning models on it and ultimately create a risk score for each officer. This risk score was a prediction for how likely an officer was to have an adverse interaction in the next year. In order to check the success of our predictions, we used a method called temporal cross-validation (otherwise known as ‘rolling forecasting origin’) which involved picking a day in the past, training the model using data up until that point, and then testing whether the model could predict what would happen in the ‘future’ (the remaining data we had).

Project collaboration and designing a database

Our project was actually a continuation of a project that was initiated in Charlotte, NC last year. The Charlotte project was also being continued by another team of fellows this year and so we worked in close collaboration with them, since our projects were so similar.

So similar, in fact, that we built a single pipeline that both of us used to create our officer risk scores. During the summer I guest blogged for DSSG about the process of creating a common database schema to house data from both Nashville and Charlotte’s police departments. It was really interesting to sit down and design a database schema and pipeline process from scratch. What is great about working on early-stage projects like these is how much input we each got to have over the design process.

One of many whiteboards we used to design our database

Feature generation

Aside from ETL (Extract, Transform, Load), a lot of our time was spent generating features. This is where our weekly meetings and the site visit to Nashville (you can read about the site visit that the Charlotte team went on) became so useful. The knowledge of the police officers was incredibly informative in telling us what features could turn out to be predictive. For example: one warning sign they suggested for an officer under stress is someone who is taking a lot of days off after their regular days off; this could be indicative of someone with a substance abuse problem. Similarly, an officer who is going through a divorce is likely to be under a lot more stress than usual. Working on this project, it was clear that a successful outcome was going to involve constant communication and feedback between us and our project partners in Nashville.

The results

By the end of the summer, our top-performing model (a variation on a Random Forest) was able to correctly flag 80% of officers who would go on to have an adverse interaction, whilst only requiring intervention on 30% of officers in order to do so. Although this was just a first pass, if we had been using a threshold-based system as has been used in other police departments, we would have needed to flag 2 out of every 3 police officers in the department for the same level of accuracy. The Center for Data Science and Public Policy continues to carry this project forward, and the intention is that it will soon be implemented in real life. The code that we wrote will also soon be made open source. You can read an official update on the police projects from this summer on the DSSG website.

The people

I’ve talked a lot about the project, but for me DSSG was really all about the people. The people were so great. It can be hard as an adult to make real and lasting friendships; DSSG was a wonderful exception to that rule. I found myself making real connections, quickly. Everyone was smart, engaged, and cared about the world. After the three months were over, it felt like we’d known each other a lot longer. The best way I can describe it is that it was like being at summer camp for adults. I’m looking forward to seeing what this community will go on to do in the coming years.

Summary

DSSG exceeded my expectations. I got to learn from and interact with a community of people who care about using their skills to do social good, some of whom I expect to be friends with for life. The project itself was exciting and I learned a lot about what it takes to work with data science on social problems (hint: it’s not straightforward!). We were also given loads of opportunities to present our work to the other fellows and at meetups with the local tech community. The outcome of the projects themselves was only one small part the wider goal of training and education. As far as making an impact is concerned, DSSG is playing the long game.