Pandas and Friends¶

Austin Godber
Mail: godber@uberhip.com
Twitter: @godber
Presented at DesertPy, Jan 2015.

What does it do?¶

Pandas is a Python data analysis tool built on top of NumPy that provides a suite of data structures and data manipulation functions to work on those data structures. It is particularly well suited for working with time series data.

Getting Started - Installation¶

Installing with pip or apt-get::

pip install pandas
# or
sudo apt-get install python-pandas

Mac - Homebrew or MacPorts to get the dependencies, then pip
Windows - Python(x,y)?
Commercial Pythons: Anaconda, Canopy

Getting Started - Dependencies¶

Dependencies, required, recommended and optional

# Required
numpy, python-dateutil, pytx
# Recommended
numexpr, bottleneck
# Optional
cython, scipy, pytables, matplotlib, statsmodels, openpyxl

Pandas' Friends!¶

Pandas works along side and is built on top of several other Python projects.

IPython

Numpy

Matplotlib

Pandas gets along with EVERYONE!¶

Background - IPython¶

IPython is a fancy python console. Try running ipython or ipython --pylab on your command line. Some IPython tips

# Special commands, 'magic functions', begin with %
%quickref, %who, %run, %reset
# Shell Commands
ls, cd, pwd, mkdir
# Need Help?
help(), help(obj), obj?, function?
# Tab completion of variables, attributes and methods

Background - IPython Notebook¶

There is a web interface to IPython, known as the IPython notebook, start it like this

ipython notebook
# or to get all of the pylab components
ipython notebook --pylab

IPython - Follow Along¶

Follow along by connecting to TMPNB.ORG!

http://tmpnb.org

Background - NumPy¶

NumPy is the foundation for Pandas
Numerical data structures (mostly Arrays)
Operations on those.
Less structure than Pandas provides.

Background - NumPy - Arrays¶

In [1]:

import numpy as np
# np.zeros, np.ones
data0 = np.zeros((2, 4))

data0

Out[1]:

array([[ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.]])

In [2]:

# Make an array with 20 entries 0..19
data1 = np.arange(20)
# print the first 8
data1[0:8]

Out[2]:

array([0, 1, 2, 3, 4, 5, 6, 7])

Background - NumPy - Arrays¶

In [3]:

# make it a 4,5 array
data = np.arange(20).reshape(4, 5)
data

Out[3]:

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

Background - NumPy - Arrays¶

Arrays have NumPy specific types, dtypes, and can be operated on.

In [4]:

print "dtype: ", data.dtype
result = data * 20.5
print result

dtype:  int64
[[   0.    20.5   41.    61.5   82. ]
 [ 102.5  123.   143.5  164.   184.5]
 [ 205.   225.5  246.   266.5  287. ]
 [ 307.5  328.   348.5  369.   389.5]]

Now, on to Pandas¶

Pandas¶

Tabular, Timeseries, Matrix Data - labeled or not
Sensible handling of missing data and data alignment
Data selection, slicing and reshaping features
Robust data import utilities.
Advanced time series capabilities

Data Structures¶

Series - 1D labeled array
DataFrame - 2D labeled array
Panel - 3D labeled array (More D)

Assumed Imports¶

In my code samples, assume I import the following

In [5]:

import pandas as pd
import numpy as np

Series¶

one-dimensional labeled array
holds any data type
axis labels known as index
implicit integert indexes
dict-like

Create a Simple Series¶

In [6]:

s1 = pd.Series([1, 2, 3, 4, 5])
s1

Out[6]:

0    1
1    2
2    3
3    4
4    5
dtype: int64

Series Operations¶

In [7]:

# integer multiplication
print s1 * 5

0     5
1    10
2    15
3    20
4    25
dtype: int64

Series Operations - Cont.¶

In [8]:

# float multiplication
print s1 * 5.0

0     5
1    10
2    15
3    20
4    25
dtype: float64

Series Index¶

In [9]:

s2 = pd.Series([1, 2, 3, 4, 5],
               index=['a', 'b', 'c', 'd', 'e'])
s2

Out[9]:

a    1
b    2
c    3
d    4
e    5
dtype: int64

Date Convenience Functions¶

A quick aside ...

In [10]:

dates = pd.date_range('20130626', periods=5)
print dates
print
print dates[0]

<class 'pandas.tseries.index.DatetimeIndex'>
[2013-06-26, ..., 2013-06-30]
Length: 5, Freq: D, Timezone: None

2013-06-26 00:00:00

Datestamps as Index¶

In [11]:

s3 = pd.Series([1, 2, 3, 4, 5], index=dates)
print s3

2013-06-26    1
2013-06-27    2
2013-06-28    3
2013-06-29    4
2013-06-30    5
Freq: D, dtype: int64

Selecting By Index¶

Note that the integer index is retained along with the new date index.

In [12]:

print s3[0]
print type(s3[0])
print
print s3[1:3]
print type(s3[1:3])

1
<type 'numpy.int64'>

2013-06-27    2
2013-06-28    3
Freq: D, dtype: int64
<class 'pandas.core.series.Series'>

Selecting by value¶

In [13]:

s3[s3 < 3]

Out[13]:

2013-06-26    1
2013-06-27    2
Freq: D, dtype: int64

Selecting by Label (Date)¶

In [14]:

s3['20130626':'20130628']

Out[14]:

2013-06-26    1
2013-06-27    2
2013-06-28    3
Freq: D, dtype: int64

Series Wrapup¶

Things not covered but you should look into:

Other instantiation options: dict
Operator Handling of missing data NaN
Reforming Data and Indexes
Boolean Indexing
Other Series Attributes:
- index - index.name
- name - Series name

DataFrame¶

2-dimensional labeled data structure
Like a SQL Table, Spreadsheet or dict of Series objects.
Columns of potentially different types
Operations, slicing and other behavior just like Series

DataFrame - Simple¶

In [15]:

data1 = pd.DataFrame(np.random.rand(4, 4))
data1

Out[15]:

	0	1	2	3
0	0.371333	0.788792	0.869380	0.084323
1	0.490858	0.134202	0.816307	0.019157
2	0.839547	0.721638	0.544628	0.042547
3	0.098594	0.954038	0.306561	0.689759

DataFrame - Index/Column Names¶

In [16]:

dates = pd.date_range('20130626', periods=4)
data2 = pd.DataFrame(
    np.random.rand(4, 4),
    index=dates, columns=list('ABCD'))
data2

Out[16]:

	A	B	C	D
2013-06-26	0.572954	0.785437	0.089758	0.872083
2013-06-27	0.857868	0.779294	0.453022	0.836332
2013-06-28	0.715369	0.355922	0.750194	0.770045
2013-06-29	0.409056	0.452993	0.937368	0.118998

DataFrame - Operations¶

In [17]:

data2['E'] = data2['B'] + 5 * data2['C']
data2

Out[17]:

	A	B	C	D	E
2013-06-26	0.572954	0.785437	0.089758	0.872083	1.234225
2013-06-27	0.857868	0.779294	0.453022	0.836332	3.044402
2013-06-28	0.715369	0.355922	0.750194	0.770045	4.106891
2013-06-29	0.409056	0.452993	0.937368	0.118998	5.139832

See? You never need Excel again!

DataFrame - Column Access¶

Deleting a column.

In [18]:

# Deleting a Column
del data2['E']
data2

Out[18]:

	A	B	C	D
2013-06-26	0.572954	0.785437	0.089758	0.872083
2013-06-27	0.857868	0.779294	0.453022	0.836332
2013-06-28	0.715369	0.355922	0.750194	0.770045
2013-06-29	0.409056	0.452993	0.937368	0.118998

DataFrame¶

Remember this, data2, for the next examples.

In [19]:

data2

Out[19]:

	A	B	C	D
2013-06-26	0.572954	0.785437	0.089758	0.872083
2013-06-27	0.857868	0.779294	0.453022	0.836332
2013-06-28	0.715369	0.355922	0.750194	0.770045
2013-06-29	0.409056	0.452993	0.937368	0.118998

DataFrame - Column Access¶

As a dict

In [20]:

data2['B']

Out[20]:

2013-06-26    0.785437
2013-06-27    0.779294
2013-06-28    0.355922
2013-06-29    0.452993
Freq: D, Name: B, dtype: float64

DataFrame - Column Access¶

As an attribute

In [21]:

data2.B

Out[21]:

2013-06-26    0.785437
2013-06-27    0.779294
2013-06-28    0.355922
2013-06-29    0.452993
Freq: D, Name: B, dtype: float64

DataFrame - Row Access¶

By row label

In [22]:

data2.loc['20130627']

Out[22]:

A    0.857868
B    0.779294
C    0.453022
D    0.836332
Name: 2013-06-27 00:00:00, dtype: float64

DataFrame - Row Access¶

By integer location

In [23]:

data2.iloc[1]

Out[23]:

A    0.857868
B    0.779294
C    0.453022
D    0.836332
Name: 2013-06-27 00:00:00, dtype: float64

DataFrame - Cell Access¶

Access column, then row or use iloc and row/column indexes.

In [24]:

print data2.B[0]
print data2['B'][0]
print data2.iloc[0,1]  # [row,column]

0.785436968548
0.785436968548
0.785436968548

DataFrame - Taking a Peek¶

Look at the beginning of the DataFrame

In [25]:

data3 = pd.DataFrame(np.random.rand(100, 4))
data3.head()

Out[25]:

	0	1	2	3
0	0.127258	0.981462	0.820096	0.650613
1	0.471623	0.118745	0.595012	0.205356
2	0.802777	0.398816	0.383789	0.025648
3	0.368724	0.922648	0.850099	0.659503
4	0.889618	0.565936	0.056413	0.768219

DataFrame - Taking a Peek¶

Look at the end of the DataFrame.

In [26]:

data3.tail()

Out[26]:

	0	1	2	3
95	0.332815	0.920505	0.808580	0.161875
96	0.348941	0.532944	0.921147	0.736528
97	0.305776	0.747903	0.161359	0.808279
98	0.141267	0.878946	0.537137	0.157560
99	0.862024	0.519265	0.172454	0.665810

DataFrame Wrap Up¶

Just remember,

A DataFrame is just a bunch of Series grouped together.
Any one dimensional slice returns a Series
Any two dimensional slice returns another DataFrame.
Elements are typically NumPy types or Objects.

Panel¶

Like DataFrame but 3 or more dimensions.

IO Tools¶

Robust IO tools to read in data from a variety of sources

Plotting¶

Matplotlib - s.plot() - Standard Python Plotting Library
Trellis - rplot() - An 'R' inspired Matplotlib based plotting tool

Bringing it Together - Data¶

The csv file (phx-temps.csv) contains Phoenix weather data from GSOD::

1973-01-01 00:00:00,53.1,37.9
1973-01-02 00:00:00,57.9,37.0
...
2012-12-30 00:00:00,64.9,39.0
2012-12-31 00:00:00,55.9,41.0

Bringing it Together - Code¶

Simple read_csv()

In [27]:

# simple readcsv
phxtemps1 = pd.read_csv('phx-temps.csv')
phxtemps1.head()

Out[27]:

	1973-01-01 00:00:00	53.1	37.9
0	1973-01-02 00:00:00	57.9	37.0
1	1973-01-03 00:00:00	59.0	37.0
2	1973-01-04 00:00:00	57.9	41.0
3	1973-01-05 00:00:00	54.0	39.9
4	1973-01-06 00:00:00	55.9	37.9

Bringing it Together - Code¶

Advanced read_csv(), parsing the dates and using them as the index, and naming the columns.

In [28]:

# define index, parse dates, name columns
phxtemps2 = pd.read_csv(
    'phx-temps.csv', index_col=0,
    names=['highs', 'lows'], parse_dates=True)
phxtemps2.head()

Out[28]:

	highs	lows
1973-01-01	53.1	37.9
1973-01-02	57.9	37.0
1973-01-03	59.0	37.0
1973-01-04	57.9	41.0
1973-01-05	54.0	39.9

Bringing it Together - Plot¶

In [29]:

import matplotlib.pyplot as plt
%matplotlib inline 
phxtemps2.plot()  # pandas convenience method

Out[29]:

<matplotlib.axes.AxesSubplot at 0x7f9916ef4d90>

Boo, Pandas and Friends would cry if they saw such a plot.

Bringing it Together - Plot¶

Lets see a smaller slice of time:

In [30]:

phxtemps2['20120101':'20121231'].plot()

Out[30]:

<matplotlib.axes.AxesSubplot at 0x7f9916ae8290>

Bringing it Together - Plot¶

Lets operate on the DataFrame ... lets take the differnce between the highs and lows.

In [31]:

phxtemps2['diff'] = phxtemps2.highs - phxtemps2.lows
phxtemps2['20120101':'20121231'].plot()

Out[31]:

<matplotlib.axes.AxesSubplot at 0x7f99168bec10>

Pandas and Friends¶

What does it do?¶

Getting Started - Installation¶

Getting Started - Dependencies¶

Pandas' Friends!¶

Pandas gets along with EVERYONE!¶

Background - IPython¶

Background - IPython Notebook¶

IPython - Follow Along¶

Background - NumPy¶

Background - NumPy - Arrays¶

Background - NumPy - Arrays¶

Background - NumPy - Arrays¶

Now, on to Pandas¶

Pandas¶

Data Structures¶

Assumed Imports¶

Series¶

Create a Simple Series¶

Series Operations¶

Series Operations - Cont.¶

Series Index¶

Date Convenience Functions¶

Datestamps as Index¶

Selecting By Index¶

Selecting by value¶

Selecting by Label (Date)¶

Series Wrapup¶

DataFrame¶

DataFrame - Simple¶

DataFrame - Index/Column Names¶

DataFrame - Operations¶

DataFrame - Column Access¶

DataFrame¶

DataFrame - Column Access¶

DataFrame - Column Access¶

DataFrame - Row Access¶

DataFrame - Row Access¶

DataFrame - Cell Access¶

DataFrame - Taking a Peek¶

DataFrame - Taking a Peek¶

DataFrame Wrap Up¶

Panel¶

IO Tools¶

Plotting¶

Bringing it Together - Data¶

Bringing it Together - Code¶

Bringing it Together - Code¶

Bringing it Together - Plot¶

Bringing it Together - Plot¶

Bringing it Together - Plot¶

Pandas Alternatives¶

References¶

Thanks! - Pandas and Friends¶