pandas

Modern Pandas (Part 2): Method Chaining

This is part 2 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling Method Chaining Method chaining, where you call methods on an object one after another, is in vogue at the moment. It’s always been a style of programming that’s been possible with pandas, and over the past several releases, we’ve added methods that enable even more chaining....

Modern Pandas (Part 1)

This is part 1 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling Effective Pandas Introduction This series is about how to make effective use of pandas, a data analysis library for the Python programming language. It’s targeted at an intermediate level: people who have some experience with pandas, but are looking to improve. Prior Art There are many great resources for learning pandas; this is not one of them....

dplyr and pandas

This notebook compares pandas and dplyr. The comparison is just on syntax (verbage), not performance. Whether you’re an R user looking to switch to pandas (or the other way around), I hope this guide will help ease the transition. We’ll work through the introductory dplyr vignette to analyze some flight data. I’m working on a better layout to show the two packages side by side. But for now I’m just putting the dplyr code in a comment above each python call....

Practical Pandas Part 3 - Exploratory Data Analysis

Welcome back. As a reminder: In part 1 we got dataset with my cycling data from last year merged and stored in an HDF5 store In part 2 we did some cleaning and augmented the cycling data with data from http://forecast.io. You can find the full source code and data at this project’s GitHub repo. Today we’ll use pandas, seaborn, and matplotlib to do some exploratory data analysis. For fun, we’ll make some maps at the end using folium....

Practical Pandas Part 2 - More Tidying, More Data, and Merging

This is Part 2 in the Practical Pandas Series, where I work through a data analysis problem from start to finish. It’s a misconception that we can cleanly separate the data analysis pipeline into a linear sequence of steps from data acqusition data tidying exploratory analysis model building production As you work through a problem you’ll realize, “I need this other bit of data”, or “this would be easier if I stored the data this way”, or more commonly “strange, that’s not supposed to happen”....

Practical Pandas Part 1 - Reading the Data

This is the first post in a series where I’ll show how I use pandas on real-world datasets. For this post, we’ll look at data I collected with Cyclemeter on my daily bike ride to and from school last year. I had to manually start and stop the tracking at the beginning and end of each ride. There may have been times where I forgot to do that, so we’ll see if we can find those....

Using Python to tackle the CPS (Part 4)

Last time, we got to where we’d like to have started: One file per month, with each month laid out the same. As a reminder, the CPS interviews households 8 times over the course of 16 months. They’re interviewed for 4 months, take 8 months off, and are interviewed four more times. So if your first interview was in month $m$, you’re also interviewed in months $$m + 1, m + 2, m + 3, m + 12, m + 13, m + 14, m + 15$$....

Using Python to tackle the CPS (Part 3)

In part 2 of this series, we set the stage to parse the data files themselves. As a reminder, we have a dictionary that looks like id length start end 0 HRHHID 15 1 15 1 HRMONTH 2 16 17 2 HRYEAR4 4 18 21 3 HURESPLI 2 22 23 4 HUFINAL 3 24 26 ... ... ... ... giving the columns of the raw CPS data files. This post (or two) will describe the reading of the actual data files, and the somewhat tricky process of matching individuals across the different files....

Tidy Data in Action

Hadley Whickham wrote a famous paper (for a certain definition of famous) about the importance of tidy data when doing data analysis. I want to talk a bit about that, using an example from a StackOverflow post, with a solution using pandas. The principles of tidy data aren’t language specific. A tidy dataset must satisfy three criteria (page 4 in Whickham’s paper): Each variable forms a column. Each observation forms a row....

Using Python to tackle the CPS (Part 2)

Last time, we used Python to fetch some data from the Current Population Survey. Today, we’ll work on parsing the files we just downloaded. We downloaded two types of files last time: CPS monthly tables: a fixed-width format text file with the actual data Data Dictionaries: a text file describing the layout of the monthly tables Our goal is to parse the monthly tables. Here’s the first two lines from the unzipped January 1994 file:...