Archive

Modern Pandas (Part 1)


This is part 1 in my series on writing modern idiomatic pandas.


Effective Pandas

Introduction

This series is about how to make effective use of pandas, a data analysis library for the Python programming language. It's targeted at …


dplyr and pandas

This notebook compares pandas and dplyr. The comparison is just on syntax (verbage), not performance. Whether you're an R user looking to switch to pandas (or the other way around), I hope this guide will help ease the transition.

We'll work through the introductory dplyr vignette to analyze some flight …


Practical Pandas Part 3 - Exploratory Data Analysis

Welcome back. As a reminder:

  • In part 1 we got dataset with my cycling data from last year merged and stored in an HDF5 store
  • In part 2 we did some cleaning and augmented the cycling data with data from http://forecast.io.

You can find the full source code …


Practical Pandas Part 2 - More Tidying, More Data, and Merging

This is Part 2 in the Practical Pandas Series, where I work through a data analysis problem from start to finish.

It's a misconception that we can cleanly separate the data analysis pipeline into a linear sequence of steps from

  1. data acqusition
  2. data tidying
  3. exploratory analysis
  4. model building
  5. production

As …


Practical Pandas Part 1 - Reading the Data

This is the first post in a series where I'll show how I use pandas on real-world datasets.

For this post, we'll look at data I collected with Cyclemeter on my daily bike ride to and from school last year. I had to manually start and stop the tracking at …


Using Python to tackle the CPS (Part 4)

Last time, we got to where we'd like to have started: One file per month, with each month laid out the same.

As a reminder, the CPS interviews households 8 times over the course of 16 months. They're interviewed for 4 months, take 8 months off, and are interviewed four …


Using Python to tackle the CPS (Part 3)

In part 2 of this series, we set the stage to parse the data files themselves.

As a reminder, we have a dictionary that looks like

         id  length  start  end
0    HRHHID      15      1   15
1   HRMONTH       2     16   17
2   HRYEAR4       4     18   21
3  HURESPLI       2     22   23 …

Tidy Data in Action

Hadley Whickham wrote a famous paper (for a certain definition of famous) about the importance of tidy data when doing data analysis. I want to talk a bit about that, using an example from a StackOverflow post, with a solution using pandas. The principles of tidy data aren't language specific …


Organizing Papers

As a graduate student, you read a lot of journal articles... a lot. With the material in the articles being as difficult as it is, I didn't want to worry about organizing everything as well. That's why I wrote this script to help (I may have also been procrastinating from …


Using Python to tackle the CPS (Part 2)

Last time, we used Python to fetch some data from the Current Population Survey. Today, we'll work on parsing the files we just downloaded.


We downloaded two types of files last time:

  • CPS monthly tables: a fixed-width format text file with the actual data
  • Data Dictionaries: a text file describing …