I'll be spending the summer in Chicago doing data science for social good

A quick update on some career progression stuff. It’s been a couple of months since RC finished (it feels like an age) and I’ve been applying to a few jobs and internships. I was delighted to find out recently that I’ve been awarded a place on Data Science for Social Good, a summer fellowship funded by the Eric and Wendy Schmidt foundation and run by the University of Chicago. In their own words:

The Eric & Wendy Schmidt Data Science for Social Good Fellowship is a University of Chicago summer program to train aspiring data scientists to work on data mining, machine learning, big data, and data science projects with social impact. Working closely with governments and nonprofits, fellows take on real-world problems in education, health, energy, public safety, transportation, economic development, international development, and more.

For three months in Chicago they learn, hone, and apply their data science, analytical, and coding skills, collaborate in a fast-paced atmosphere, and learn from mentors coming from industry and academia.

I’ve known about this programme for some time and applied unsuccessfully last year, so it feels great to have a place this year. Whereas previous programmes I’ve done this year (S2DS and RC) were mostly about experience and skill building, DSSG is also an opportunity to become acquainted with how data science works in a particular field: namely, in the social good realm. I know others who have done DSSG and have gotten a lot out of it, so I think that this is going to be a very valuable experience.

You can see my happy face on the 2016 fellows page!

A summary of my time at RC

RC is over for me, or at least the in-person component (RC-ers never graduate, remember?). This post is a reflection of the things that I learned there, as well as my highlights and lowlights of the past 3 months.

tl;dr: had a great time, learned a lot. If you want to get better at programming and also meet a whole lot of smart, passionate people who will help you do it, RC is the place.

Things I learned

I gained a heap of object-level knowledge:

  • I learned a lot more about Python, including decorators and generators and pickles, and I finally got to grips with classes by working on multiple projects
  • I learned about common computer science data structures and algorithms at workshops organised by Pauli and Javier; I coded some of these up during whiteboarding sessions
  • I learned how to use Flask to make Web applications
  • I learned how to use d3.js to make interactive visualisations
  • I learned about test driven development, and the modules for this in Python
  • I played around with web scraping in Selenium
  • I read a lot about modelling linear regressions in Gelman and Hill, and practiced these in R
  • I learned about relational databases, and using them to store and query georeferenced data
  • I learned how to use the tweepy library to scrape Twitter data
  • I had my first taste of AWS (specifically RDS for serving a database, and S3 for storing data). It was prohibitively complex for a beginner.
  • I learned how to quickly knock together a web page with HTML/CSS/Javascript and style it so it looks pretty
  • I learned such things as Monads exist… but I still don’t understand what they are :D

I also learned less tangible stuff:

  • I gained confidence in my programming ability, and gained a greater understanding of where I fit in the programming ability spectrum
  • I learned to enjoy whiteboarding! I went from ‘I am terrified I can’t even write’ to ‘this is fun, let’s do another one!’
  • I greatly broadened my knowledge of the programming ecosystem by seeing what kinds of projects other people were working on
  • I updated my expectations for how much I can hope to achieve in a certain amount of time. I seem to achieve less than I think I will on a given day but learn faster than I expect to over successive days/weeks

The best bits

Here’s my RC highlights reel:

  • Making Planigale with Dave, and getting featured on Jerry Coyne’s blog
  • Running around Manhattan on NYE with Shad, Andrew, Diego and Darius
  • Having other people be excited about stuff I made
  • Playing in the snow with Ezekiel
  • Experiencing the slow seeping of understanding about functional programming principles
  • Friday night talks, and the amazing projects people had made. A few of my favourites were Javier’s pixelarttoCSS, Carrie’s painting colour theory ML and Jesse’s Fourier transformations
  • Singing bad songs at karaoke
  • Defusing virtual bombs with multiple people.
  • Everyone’s excitement when Allie brought in her Arduino-hacked knitting machine
  • Making origami cranes over Christmas
  • Getting Chinese food, and Peruvian food, and Malaysian food, and Mexican food, and …

The not so best bits

For balance, here’s my RC bloopers reel:

  • Spending ~30 days (really, I checked my notes) at the start of my batch being confused about what to work on
  • Spending way too long worrying about housing for the second half of my batch
  • Getting a cold and RSI in my last week so that I effectively stopped typing
  • The days when I felt like I didn’t move forward with my project
  • Feeling project envy when everyone else seemed to be making awesome stuff

This post has turned into a list splurge. I will just finish up by saying, I am very glad that I made the decision to come to RC. I can’t think how long it would have taken me to learn or encounter such things, or met such a community, had I not done so.

Slaying the SQL dragon

I think many developers have tools or techniques that they’re scared of using. Some magic that doesn’t make sense, so they avoid using it in the hopes it will go away. Maybe for some people it is multiple inheritence, for others functional programming. For me, it’s databases. I’m not really sure where my fear of databases came from. Maybe it’s because you have to use a special alien language to speak to them. Maybe it’s because they can be large and unwieldy and difficult to look at all at once. I don’t really know. All I know is, it’s time to slay this dragon. Or rather, not slay it, but learn how to speak to it nicely so that it will give me gold :D. What follows below is my brief introduction to dragons relational databases.

What is a relational database?

A relational database is a digital database organised according to the relational model of data: a simple set of concepts that allows us to build very complex data structures. In essence, relational databases contain one or more tables with rows and columns, where each row has a unique key for identification. Rows in one table can be linked to rows in another table by storing the value of the row key to be linked to. A useful analogy I saw here is that a database is analogous to a whole Excel spreadsheet file, whereas the individual database tables are like the tabs/worksheets in that Excel file.

Different relationships can link columns within and between tables in a relational database. You can have one-to-one relationships, one-to-many relationships and many-to-many relationships between rows in different tables.

Advantages of relational databases

Why not just use one big flat table? Why bother with linking between different tables? There are several advantages to relational databases compared to a standard flat file:

  1. Data is only stored once

    You don’t need to have multiple records for a single entity. Let’s say for example you have a database of ten friends in your address book, and these ten friends collectively live in four different cities. You can have two tables in your database, one for friend names and street address, and one for cities. You can create links between rows in your cities and friends tables without the need to duplicate any information. This is good for several reasons:

    • Less duplication means the database takes up less space and so is more storage efficient.

    • Since there’s no duplication, you also eliminate possible inconsistencies, for example if you had a column in a single flat table for city, you might end up with some items misspelled (e.g. ‘Londn’ instead of ‘London’).

    • It should also be much easier to change information if it only exists in one place, e.g. if for some bizarre reason the UK decided to change the name of London to ’Jabberwocky’, you’d only have to update this information once in our fictitious relational database of addresses.

  2. Data has a fixed type

    This means that your text will always be interpreted as text, your numbers as numbers, your dates as dates. You can avoid typos like iO instead of 10.

  3. You can apply complex queries

    …to pull out exactly the information you want, from multiple tables at once. You can use these queries for further analysis without having to duplicate your data, as you might do in an Excel spreadsheet, thus cutting out the middle man.

  4. It’s easier to maintain security

    By splitting the data up into separate tables, you can ensure that in certain situations, only part of the data can be made accessible to a particular individual. For example, if you are using a database for a web application, you might want to restrict an individual user to their own information, instead of giving them access to all of the email addresses of everyone signed up to your service.

  5. You can cater to future requirements

    It’s easy to add more data that are not yet needed, but might be in the future. For example, you might be going to a cheese rolling convention in Manhattan, where you anticipate making lots of new friends from around the world. In preparation for your trip, you could expand your cities table in you friend address database to include all of the cities in the world, even though they aren’t being referenced by anything yet. You can’t do this with a flat table (well, you could, but not without adding a lot of ugly null values). Of course, designing a database from scratch that is extensible and maintainable can be really tricky, as demonstrated in this fun blog post about designing the most egalitarian marriage database.

Interacting with relational databases

Data manipulation in relational databases is performed by making queries in Structured Query Language (SQL). All SQL operations do one of four fundamental types of operation:

Create – Putting data into the table

Read – Querying data from the table

Update – Changing data that is already in the table

Delete – Removing data from the table

These all add up to the delightful acronym ‘CRUD’.

Wikipedia says that SQL is based on relational algebra and tuple relational calculus. I don’t know what those are, but what I should take from that is that SQL’s roots are in mathematics. It is not a programming language. Thus, you’re not allowed to get annoyed if SQL isn’t like you’re favourite programming language. It’s a fundamentally different thing.

Learning SQL syntax is a whole massive topic in itself. I found the following resources to be helpful:

Database Management Systems (DBMSs)

To complicate things further, there are lots of different kinds of systems for letting a user and other applications communicate with a database. Popular DBMSs include PostgreSQL, MySQL, Microsoft SQL Server, Oracle and SQLite. They all have slightly different advantages and disadvantages. Of these, SQLite is the major odd one out as it doesn’t have a client-server architecture (the database lives on a computer server, and is accessed from a separate machine, which is the client); SQLite is actually embedded in the end program itself. This makes it a good DBMS to start playing around with, as you don’t need to fiddle around with servers.

What about NoSQL??

Sigh. Just when you thought you were getting the hang of things, you find out there is another kind of database called NoSQL. NoSQL is a kind of non-relational database with a completely different kind of architecture. NoSQL is supposed to be more scalable and fixes problems with the relational model. People on the internet seem to treating it like it is the hot new thing. I don’t even want to think about this right now; I’ll mentally bookmark it for later.


Thus ends my very short introduction to relational databases. I’m currently learning to speak SQL by working on a project to build an API for some NASA rainfall data. This involves working with GIS data, which is another level of complexity. I’ll write up this project in another post.

Planigale is live!

UBER VICTORY POINTS UPDATE 11/1/2016: Planigale was featured on Jerry Coyne’s blog!!


Dave and I finally got Planigale up and running!

We also have Google Analytics so we can see statistics about visitors to our site. Here’s the breakdown of sessions played since we went live on Wednesday:

You can also see where in the world people are playing! Creepy.

I especially like the breakdown by city:

I can tell I’m going to have a lot of fun with this.

Python doctest

I recently learned about doctest in Python, and now I’m excited and want to use it for everything.

doctest is a module that lets you write tests within your docstrings. When you run the file as a script, these tests run; if they fail, you’ll get a printout of which tests failed. This is useful for making sure that your docstrings are up to date after you’ve modified your code.

Here’s an example. Let’s say I’m writing a script with some simple functions in it:

1
2
3
4
5
6
7
8
9
def salutation(name):
    print("Hello {}!".format(name))

def double(number):
    print(2 * number)

def add_three(number):
    number += 3
    print(number)

In order to use doctest, I’ll write some tests in the docstrings:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def salutation(name):
    '''
    Greets the user.

    >>> salutation('Lin')
    Hello Lin!
    '''
    print("Hello {}!".format(name))

def double(number):
    '''
    Doubles the input.

    >>> double(5)
    10
    '''
    print(2 * number)

def add_three(number):
    '''
    Adds 3 to the input.

    >>> add_three(2)
    5
    '''
    number += 3
    print(number)

Notice that >>> signifies where a python code snippet starts (like in an interactive session). Lines that come directly after without the >>> are the expected result. Now we just need to add the following lines to the bottom of the script:

1
2
3
if __name__ == "__main__":
    import doctest
    doctest.testmod()

Here you can see that whenever we run the file as a script, if __name__ == "__main__" evaluates to True (I wrote about this briefly here). We then import the doctest module, and execute doctest’s testmod() function. testmod() goes through all of the docstrings in the script and attempts to execute all of the code snippets it finds. If all of our expected values match the computed values (as in this case), doctest won’t give us any output. However, let’s say I went back and changed a couple of my functions:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def salutation(name):
    '''
    Greets the user.

    >>> salutation('Lin')
    Hello Lin!
    '''
    print("Bye {}!".format(name))

def double(number):
    '''
    Doubles the input.

    >>> double(5)
    10
    '''
    print(3 * number)

def add_three(number):
    '''
    Adds 3 to the input.

    >>> add_three(2)
    5
    '''
    number += 3
    print(number)

if __name__ == "__main__":
    import doctest
    doctest.testmod()

Now if I run the script again, I get this output in my terminal:

**********************************************************************
File "python_doctest.py", line 14, in __main__.double
Failed example:
    double(5)
Expected:
    10
Got:
    15
**********************************************************************
File "python_doctest.py", line 5, in __main__.salutation
Failed example:
    salutation('Lin')
Expected:
    Hello Lin!
Got:
    Bye Lin!
**********************************************************************
2 items had failures:
   1 of   1 in __main__.double
   1 of   1 in __main__.salutation
***Test Failed*** 2 failures.

We can see how many failures we had and which functions failed.

doctest is great for keeping your docstrings accurate, and it’s best practice to write a docstring for every function and class. However, docstring is unwieldy if you want to do any significant testing (it’s annoying for your user to have to read long docstrings with loads of edge cases). For more significant testing, you can use the unittest module.