Archive

Modern Pandas (Part 5): Tidy Data


This is part 5 in my series on writing modern idiomatic pandas.


Reshaping & Tidy Data

Structuring datasets to facilitate analysis (Wickham 2014)

So, you've sat down to analyze a new dataset. What do you do first?

In …


Modern Panadas (Part 3): Indexes


This is part 3 in my series on writing modern idiomatic pandas.


Indexes can be a difficult concept to grasp at first. I suspect this is partly becuase they're somewhat peculiar to pandas. These aren't like the …


Modern Pandas (Part 4): Performance


This is part 4 in my series on writing modern idiomatic pandas.


Wes McKinney, the creator of pandas, is kind of obsessed with performance. From micro-optimizations for element access, to embedding a fast hash table inside pandas …


Modern Pandas (Part 2): Method Chaining


This is part 2 in my series on writing modern idiomatic pandas.


Method Chaining

Method chaining, where you call methods on an object one after another, is in vogue at the moment. It's always been a style …


Modern Pandas (Part 1)


This is part 1 in my series on writing modern idiomatic pandas.


Effective Pandas

Introduction

This series is about how to make effective use of pandas, a data analysis library for the Python programming language. It's targeted …


dplyr and pandas

This notebook compares pandas and dplyr. The comparison is just on syntax (verbage), not performance. Whether you're an R user looking to switch to pandas (or the other way around), I hope this guide will help ease the transition.

We'll work through the introductory dplyr vignette to analyze some flight …


Practical Pandas Part 3 - Exploratory Data Analysis

Welcome back. As a reminder:

  • In part 1 we got dataset with my cycling data from last year merged and stored in an HDF5 store
  • In part 2 we did some cleaning and augmented the cycling data with data from http://forecast.io.

You can find the full source code …


Practical Pandas Part 2 - More Tidying, More Data, and Merging

This is Part 2 in the Practical Pandas Series, where I work through a data analysis problem from start to finish.

It's a misconception that we can cleanly separate the data analysis pipeline into a linear sequence of steps from

  1. data acqusition
  2. data tidying
  3. exploratory analysis
  4. model building
  5. production

As …


Practical Pandas Part 1 - Reading the Data

This is the first post in a series where I'll show how I use pandas on real-world datasets.

For this post, we'll look at data I collected with Cyclemeter on my daily bike ride to and from school last year. I had to manually start and stop the tracking at …


Using Python to tackle the CPS (Part 4)

Last time, we got to where we'd like to have started: One file per month, with each month laid out the same.

As a reminder, the CPS interviews households 8 times over the course of 16 months. They're interviewed for 4 months, take 8 months off, and are interviewed four …