This notebook compares pandas and dplyr. The comparison is just on syntax (verbage), not performance. Whether you’re an R user looking to switch to pandas (or the other way around), I hope this guide will help ease the transition.

We’ll work through the introductory dplyr vignette to analyze some flight data.

I’m working on a better layout to show the two packages side by side. But for now I’m just putting the dplyr code in a comment above each python call.

# Some prep work to get the data from R and into pandas
%matplotlib inline
%load_ext rmagic

import pandas as pd
import seaborn as sns

pd.set_option("display.max_rows", 5)
/Users/tom/Envs/py3/lib/python3.4/site-packages/IPython/extensions/rmagic.py:693: UserWarning: The rmagic extension in IPython is deprecated in favour of rpy2.ipython. If available, that will be loaded instead.
http://rpy.sourceforge.net/
  warnings.warn("The rmagic extension in IPython is deprecated in favour of "
%%R
library("nycflights13")
write.csv(flights, "flights.csv")

Data: nycflights13

flights = pd.read_csv("flights.csv", index_col=0)
# dim(flights)   <--- The R code
flights.shape  # <--- The python code
(336776, 16)
# head(flights)
flights.head()
yearmonthdaydep_timedep_delayarr_timearr_delaycarriertailnumflightorigindestair_timedistancehourminute
1201311517283011UAN142281545EWRIAH2271400517
2201311533485020UAN242111714LGAIAH2271416533
3201311542292333AAN619AA1141JFKMIA1601089542
4201311544-11004-18B6N804JB725JFKBQN1831576544
5201311554-6812-25DLN668DN461LGAATL116762554

Single table verbs

dplyr has a small set of nicely defined verbs. I’ve listed their closest pandas verbs.

dplyrpandas
filter() (and slice())query() (and loc[], iloc[])
arrange()sort()
select() (and rename())\_\_getitem\_\_ (and rename())
distinct()drop_duplicates()
mutate() (and transmute())None
summarise()None
sample_n() and sample_frac()None

Some of the “missing” verbs in pandas are because there are other, different ways of achieving the same goal. For example summarise is spread across mean, std, etc. Others, like sample_n, just haven’t been implemented yet.

Filter rows with filter(), query()

# filter(flights, month == 1, day == 1)
flights.query("month == 1 & day == 1")
yearmonthdaydep_timedep_delayarr_timearr_delaycarriertailnumflightorigindestair_timedistancehourminute
1201311517283011UAN142281545EWRIAH2271400517
2201311533485020UAN242111714LGAIAH2271416533
...................................................
841201311NaNNaNNaNNaNAAN3EVAA1925LGAMIANaN1096NaNNaN
842201311NaNNaNNaNNaNB6N618JB125JFKFLLNaN1069NaNNaN

842 rows × 16 columns

The more verbose version:

# flights[flights$month == 1 & flights$day == 1, ]
flights[(flights.month == 1) & (flights.day == 1)]
yearmonthdaydep_timedep_delayarr_timearr_delaycarriertailnumflightorigindestair_timedistancehourminute
1201311517283011UAN142281545EWRIAH2271400517
2201311533485020UAN242111714LGAIAH2271416533
...................................................
841201311NaNNaNNaNNaNAAN3EVAA1925LGAMIANaN1096NaNNaN
842201311NaNNaNNaNNaNB6N618JB125JFKFLLNaN1069NaNNaN

842 rows × 16 columns

# slice(flights, 1:10)
flights.iloc[:9]
yearmonthdaydep_timedep_delayarr_timearr_delaycarriertailnumflightorigindestair_timedistancehourminute
1201311517283011UAN142281545EWRIAH2271400517
2201311533485020UAN242111714LGAIAH2271416533
...................................................
8201311557-3709-14EVN829AS5708LGAIAD53229557
9201311557-3838-8B6N593JB79JFKMCO140944557

9 rows × 16 columns

Arrange rows with arrange(), sort()

# arrange(flights, year, month, day) 
flights.sort(['year', 'month', 'day'])
yearmonthdaydep_timedep_delayarr_timearr_delaycarriertailnumflightorigindestair_timedistancehourminute
1201311517283011UAN142281545EWRIAH2271400517
2201311533485020UAN242111714LGAIAH2271416533
...................................................
11129520131231NaNNaNNaNNaNUANaN219EWRORDNaN719NaNNaN
11129620131231NaNNaNNaNNaNUANaN443JFKLAXNaN2475NaNNaN

336776 rows × 16 columns

# arrange(flights, desc(arr_delay))
flights.sort('arr_delay', ascending=False)
yearmonthdaydep_timedep_delayarr_timearr_delaycarriertailnumflightorigindestair_timedistancehourminute
7073201319641130112421272HAN384HA51JFKHNL6404983641
23577920136151432113716071127MQN504MQ3535JFKCMH744831432
...................................................
3367752013930NaNNaNNaNNaNMQN511MQ3572LGACLENaN419NaNNaN
3367762013930NaNNaNNaNNaNMQN839MQ3531LGARDUNaN431NaNNaN

336776 rows × 16 columns

Select columns with select(), []

# select(flights, year, month, day) 
flights[['year', 'month', 'day']]
yearmonthday
1201311
2201311
............
3367752013930
3367762013930

336776 rows × 3 columns

# select(flights, year:day) 

# No real equivalent here. Although I think this is OK.
# Typically I'll have the columns I want stored in a list
# somewhere, which can be passed right into __getitem__ ([]).
# select(flights, -(year:day)) 

# Again, simliar story. I would just use
# flights.drop(cols_to_drop, axis=1)
# or fligths[flights.columns.difference(pd.Index(cols_to_drop))]
# point to dplyr!
# select(flights, tail_num = tailnum)
flights.rename(columns={'tailnum': 'tail_num'})['tail_num']
1    N14228
...
336776    N839MQ
Name: tail_num, Length: 336776, dtype: object

But like Hadley mentions, not that useful since it only returns the one column. dplyr and pandas compare well here.

# rename(flights, tail_num = tailnum)
flights.rename(columns={'tailnum': 'tail_num'})
yearmonthdaydep_timedep_delayarr_timearr_delaycarriertail_numflightorigindestair_timedistancehourminute
1201311517283011UAN142281545EWRIAH2271400517
2201311533485020UAN242111714LGAIAH2271416533
...................................................
3367752013930NaNNaNNaNNaNMQN511MQ3572LGACLENaN419NaNNaN
3367762013930NaNNaNNaNNaNMQN839MQ3531LGARDUNaN431NaNNaN

336776 rows × 16 columns

Pandas is more verbose, but the the argument to columns can be any mapping. So it’s often used with a function to perform a common task, say df.rename(columns=lambda x: x.replace('-', '_')) to replace any dashes with underscores. Also, rename (the pandas version) can be applied to the Index.

Extract distinct (unique) rows

# distinct(select(flights, tailnum))
flights.tailnum.unique()
array(['N14228', 'N24211', 'N619AA', ..., 'N776SK', 'N785SK', 'N557AS'], dtype=object)

FYI this returns a numpy array instead of a Series.

# distinct(select(flights, origin, dest))
flights[['origin', 'dest']].drop_duplicates()
origindest
1EWRIAH
2LGAIAH
.........
255456EWRANC
275946EWRLGA

224 rows × 2 columns

OK, so dplyr wins there from a consistency point of view. unique is only defined on Series, not DataFrames. The original intention for drop_duplicates is to check for records that were accidentally included twice. This feels a bit hacky using it to select the distinct combinations, but it works!

Add new columns with mutate()

# mutate(flights,
#   gain = arr_delay - dep_delay,
#   speed = distance / air_time * 60)

flights['gain'] = flights.arr_delay - flights.dep_delay
flights['speed'] = flights.distance / flights.air_time * 60
flights
yearmonthdaydep_timedep_delayarr_timearr_delaycarriertailnumflightorigindestair_timedistancehourminutegainspeed
1201311517283011UAN142281545EWRIAH22714005179370.044053
2201311533485020UAN242111714LGAIAH227141653316374.273128
.........................................................
3367752013930NaNNaNNaNNaNMQN511MQ3572LGACLENaN419NaNNaNNaNNaN
3367762013930NaNNaNNaNNaNMQN839MQ3531LGARDUNaN431NaNNaNNaNNaN

336776 rows × 18 columns

# mutate(flights,
#   gain = arr_delay - dep_delay,
#   gain_per_hour = gain / (air_time / 60)
# )

flights['gain'] = flights.arr_delay - flights.dep_delay
flights['gain_per_hour'] = flights.gain / (flights.air_time / 60)
flights
yearmonthdaydep_timedep_delayarr_timearr_delaycarriertailnumflightorigindestair_timedistancehourminutegainspeedgain_per_hour
1201311517283011UAN142281545EWRIAH22714005179370.0440532.378855
2201311533485020UAN242111714LGAIAH227141653316374.2731284.229075
............................................................
3367752013930NaNNaNNaNNaNMQN511MQ3572LGACLENaN419NaNNaNNaNNaNNaN
3367762013930NaNNaNNaNNaNMQN839MQ3531LGARDUNaN431NaNNaNNaNNaNNaN

336776 rows × 19 columns

dplyr's approach may be nicer here since you get to refer to the variables in subsequent statements within the mutate(). To achieve this with pandas, you have to add the gain variable as another column in flights. If I don’t want it around I would have to explicitly drop it.

# transmute(flights,
#   gain = arr_delay - dep_delay,
#   gain_per_hour = gain / (air_time / 60)
# )

flights['gain'] = flights.arr_delay - flights.dep_delay
flights['gain_per_hour'] = flights.gain / (flights.air_time / 60)
flights[['gain', 'gain_per_hour']]
gaingain_per_hour
192.378855
2164.229075
.........
336775NaNNaN
336776NaNNaN

336776 rows × 2 columns

Summarise values with summarise()

flights.dep_delay.mean()
12.639070257304708

Randomly sample rows with sample_n() and sample_frac()

There’s an open PR on Github to make this nicer (closer to dplyr). For now you can drop down to numpy.

# sample_n(flights, 10)
flights.loc[np.random.choice(flights.index, 10)]
yearmonthdaydep_timedep_delayarr_timearr_delaycarriertailnumflightorigindestair_timedistancehourminutegainspeedgain_per_hour
3169032013991539-61650-439EN918XJ3459JFKBNA987651539-37468.367347-22.653061
1053692013122590501126-7FLN939AT275LGAATL11776295-7390.769231-3.589744
............................................................
82862201311301627-81750-35AAN4XRAA343LGAORD1117331627-27396.216216-14.594595
1906532013428748-7856-24MQN520MQ3737EWRORD107719748-17403.177570-9.532710

10 rows × 19 columns

# sample_frac(flights, 0.01)
flights.iloc[np.random.randint(0, len(flights),
                               .1 * len(flights))]
yearmonthdaydep_timedep_delayarr_timearr_delaycarriertailnumflightorigindestair_timedistancehourminutegainspeedgain_per_hour
18858120134251836-421457DLN398DA1629JFKLAS3132248183611430.9265182.108626
3070152013829125851409-4EVN129576054EWRIAD462121258-9276.521739-11.739130
............................................................
28656320138721261867UAN822UA373EWRPBI13810232126-11444.782609-4.782609
6281820131181300016155VXN636VA411JFKLAX34924751305425.5014330.859599

33677 rows × 19 columns

Grouped operations

# planes <- group_by(flights, tailnum)
# delay <- summarise(planes,
#   count = n(),
#   dist = mean(distance, na.rm = TRUE),
#   delay = mean(arr_delay, na.rm = TRUE))
# delay <- filter(delay, count > 20, dist < 2000)

planes = flights.groupby("tailnum")
delay = planes.agg({"year": "count",
                    "distance": "mean",
                    "arr_delay": "mean"})
delay.query("year > 20 & distance < 2000")
yeararr_delaydistance
tailnum
N0EGMQ3719.982955676.188679
N1015615312.717241757.947712
............
N999DN6114.311475895.459016
N9EAMQ2489.235294674.665323

2961 rows × 3 columns

For me, dplyr’s n() looked is a bit starge at first, but it’s already growing on me.

I think pandas is more difficult for this particular example. There isn’t as natural a way to mix column-agnostic aggregations (like count) with column-specific aggregations like the other two. You end up writing could like .agg{'year': 'count'} which reads, “I want the count of year”, even though you don’t care about year specifically. Additionally assigning names can’t be done as cleanly in pandas; you have to just follow it up with a rename like before.

# destinations <- group_by(flights, dest)
# summarise(destinations,
#   planes = n_distinct(tailnum),
#   flights = n()
# )

destinations = flights.groupby('dest')
destinations.agg({
    'tailnum': lambda x: len(x.unique()),
    'year': 'count'
    }).rename(columns={'tailnum': 'planes',
                       'year': 'flights'})
flightsplanes
dest
ABQ254108
ACK26558
.........
TYS631273
XNA1036176

105 rows × 2 columns

Similar to how dplyr provides optimized C++ versions of most of the summarise functions, pandas uses cython optimized versions for most of the agg methods.

# daily <- group_by(flights, year, month, day)
# (per_day   <- summarise(daily, flights = n()))

daily = flights.groupby(['year', 'month', 'day'])
per_day = daily['distance'].count()
per_day
year  month  day
2013  1      1      842
...
2013  12     31     776
Name: distance, Length: 365, dtype: int64
# (per_month <- summarise(per_day, flights = sum(flights)))
per_month = per_day.groupby(level=['year', 'month']).sum()
per_month
year  month
2013  1        27004
...
2013  12       28135
Name: distance, Length: 12, dtype: int64
# (per_year  <- summarise(per_month, flights = sum(flights)))
per_year = per_month.sum()
per_year
336776

I’m not sure how dplyr is handling the other columns, like year, in the last example. With pandas, it’s clear that we’re grouping by them since they’re included in the groupby. For the last example, we didn’t group by anything, so they aren’t included in the result.

Chaining

Any follower of Hadley’s twitter account will know how much R users love the %>% (pipe) operator. And for good reason!

# flights %>%
#   group_by(year, month, day) %>%
#   select(arr_delay, dep_delay) %>%
#   summarise(
#     arr = mean(arr_delay, na.rm = TRUE),
#     dep = mean(dep_delay, na.rm = TRUE)
#   ) %>%
#   filter(arr > 30 | dep > 30)
(
flights.groupby(['year', 'month', 'day'])
    [['arr_delay', 'dep_delay']]
    .mean()
    .query('arr_delay > 30 | dep_delay > 30')
)
arr_delaydep_delay
yearmonthday
201311634.24736224.612865
3132.60285428.658363
1.........
121755.87185640.705602
2332.22604232.254149

49 rows × 2 columns

Other Data Sources

Pandas has tons IO tools to help you get data in and out, including SQL databases via SQLAlchemy.

Summary

I think pandas held up pretty well, considering this was a vignette written for dplyr. I found the degree of similarity more interesting than the differences. The most difficult task was renaming of columns within an operation; they had to be followed up with a call to rename after the operation, which isn’t that burdensome honestly.

More and more it looks like we’re moving towards future where being a language or package partisan just doesn’t make sense. Not when you can load up a Jupyter (formerly IPython) notebook to call up a library written in R, and hand those results off to python or Julia or whatever for followup, before going back to R to make a cool shiny web app.

There will always be a place for your “utility belt” package like dplyr or pandas, but it wouldn’t hurt to be familiar with both.

If you want to contribute to pandas, we’re always looking for help at https://github.com/pydata/pandas/. You can get ahold of me directly on twitter.