Pandas is a Python data analysis tool built on top of NumPy that provides a suite of data structures and data manipulation functions to work on those data structures. It is particularly well suited for working with time series data.
Installing with pip or apt-get::
pip install pandas
# or
sudo apt-get install python-pandas
Dependencies, required, recommended and optional
# Required
numpy, python-dateutil, pytx
# Recommended
numexpr, bottleneck
# Optional
cython, scipy, pytables, matplotlib, statsmodels, openpyxl
Pandas works along side and is built on top of several other Python projects.
IPython is a fancy python console. Try running ipython
or ipython --pylab
on your command line. Some IPython tips
# Special commands, 'magic functions', begin with %
%quickref, %who, %run, %reset
# Shell Commands
ls, cd, pwd, mkdir
# Need Help?
help(), help(obj), obj?, function?
# Tab completion of variables, attributes and methods
There is a web interface to IPython, known as the IPython notebook, start it like this
ipython notebook
# or to get all of the pylab components
ipython notebook --pylab
import numpy as np
# np.zeros, np.ones
data0 = np.zeros((2, 4))
data0
array([[ 0., 0., 0., 0.], [ 0., 0., 0., 0.]])
# Make an array with 20 entries 0..19
data1 = np.arange(20)
# print the first 8
data1[0:8]
array([0, 1, 2, 3, 4, 5, 6, 7])
# make it a 4,5 array
data = np.arange(20).reshape(4, 5)
data
array([[ 0, 1, 2, 3, 4], [ 5, 6, 7, 8, 9], [10, 11, 12, 13, 14], [15, 16, 17, 18, 19]])
Arrays have NumPy specific types, dtypes
, and can be operated on.
print "dtype: ", data.dtype
result = data * 20.5
print result
dtype: int64 [[ 0. 20.5 41. 61.5 82. ] [ 102.5 123. 143.5 164. 184.5] [ 205. 225.5 246. 266.5 287. ] [ 307.5 328. 348.5 369. 389.5]]
In my code samples, assume I import the following
import pandas as pd
import numpy as np
dict
-likes1 = pd.Series([1, 2, 3, 4, 5])
s1
0 1 1 2 2 3 3 4 4 5 dtype: int64
# integer multiplication
print s1 * 5
0 5 1 10 2 15 3 20 4 25 dtype: int64
# float multiplication
print s1 * 5.0
0 5 1 10 2 15 3 20 4 25 dtype: float64
s2 = pd.Series([1, 2, 3, 4, 5],
index=['a', 'b', 'c', 'd', 'e'])
s2
a 1 b 2 c 3 d 4 e 5 dtype: int64
A quick aside ...
dates = pd.date_range('20130626', periods=5)
print dates
print
print dates[0]
<class 'pandas.tseries.index.DatetimeIndex'> [2013-06-26, ..., 2013-06-30] Length: 5, Freq: D, Timezone: None 2013-06-26 00:00:00
s3 = pd.Series([1, 2, 3, 4, 5], index=dates)
print s3
2013-06-26 1 2013-06-27 2 2013-06-28 3 2013-06-29 4 2013-06-30 5 Freq: D, dtype: int64
Note that the integer index is retained along with the new date index.
print s3[0]
print type(s3[0])
print
print s3[1:3]
print type(s3[1:3])
1 <type 'numpy.int64'> 2013-06-27 2 2013-06-28 3 Freq: D, dtype: int64 <class 'pandas.core.series.Series'>
s3[s3 < 3]
2013-06-26 1 2013-06-27 2 Freq: D, dtype: int64
s3['20130626':'20130628']
2013-06-26 1 2013-06-27 2 2013-06-28 3 Freq: D, dtype: int64
Things not covered but you should look into:
dict
NaN
Other Series Attributes:
index
- index.name
name
- Series namedict
of Series
objects.Series
data1 = pd.DataFrame(np.random.rand(4, 4))
data1
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 0.371333 | 0.788792 | 0.869380 | 0.084323 |
1 | 0.490858 | 0.134202 | 0.816307 | 0.019157 |
2 | 0.839547 | 0.721638 | 0.544628 | 0.042547 |
3 | 0.098594 | 0.954038 | 0.306561 | 0.689759 |
dates = pd.date_range('20130626', periods=4)
data2 = pd.DataFrame(
np.random.rand(4, 4),
index=dates, columns=list('ABCD'))
data2
A | B | C | D | |
---|---|---|---|---|
2013-06-26 | 0.572954 | 0.785437 | 0.089758 | 0.872083 |
2013-06-27 | 0.857868 | 0.779294 | 0.453022 | 0.836332 |
2013-06-28 | 0.715369 | 0.355922 | 0.750194 | 0.770045 |
2013-06-29 | 0.409056 | 0.452993 | 0.937368 | 0.118998 |
data2['E'] = data2['B'] + 5 * data2['C']
data2
A | B | C | D | E | |
---|---|---|---|---|---|
2013-06-26 | 0.572954 | 0.785437 | 0.089758 | 0.872083 | 1.234225 |
2013-06-27 | 0.857868 | 0.779294 | 0.453022 | 0.836332 | 3.044402 |
2013-06-28 | 0.715369 | 0.355922 | 0.750194 | 0.770045 | 4.106891 |
2013-06-29 | 0.409056 | 0.452993 | 0.937368 | 0.118998 | 5.139832 |
See? You never need Excel again!
Deleting a column.
# Deleting a Column
del data2['E']
data2
A | B | C | D | |
---|---|---|---|---|
2013-06-26 | 0.572954 | 0.785437 | 0.089758 | 0.872083 |
2013-06-27 | 0.857868 | 0.779294 | 0.453022 | 0.836332 |
2013-06-28 | 0.715369 | 0.355922 | 0.750194 | 0.770045 |
2013-06-29 | 0.409056 | 0.452993 | 0.937368 | 0.118998 |
Remember this, data2, for the next examples.
data2
A | B | C | D | |
---|---|---|---|---|
2013-06-26 | 0.572954 | 0.785437 | 0.089758 | 0.872083 |
2013-06-27 | 0.857868 | 0.779294 | 0.453022 | 0.836332 |
2013-06-28 | 0.715369 | 0.355922 | 0.750194 | 0.770045 |
2013-06-29 | 0.409056 | 0.452993 | 0.937368 | 0.118998 |
As a dict
data2['B']
2013-06-26 0.785437 2013-06-27 0.779294 2013-06-28 0.355922 2013-06-29 0.452993 Freq: D, Name: B, dtype: float64
As an attribute
data2.B
2013-06-26 0.785437 2013-06-27 0.779294 2013-06-28 0.355922 2013-06-29 0.452993 Freq: D, Name: B, dtype: float64
By row label
data2.loc['20130627']
A 0.857868 B 0.779294 C 0.453022 D 0.836332 Name: 2013-06-27 00:00:00, dtype: float64
By integer location
data2.iloc[1]
A 0.857868 B 0.779294 C 0.453022 D 0.836332 Name: 2013-06-27 00:00:00, dtype: float64
Access column, then row or use iloc and row/column indexes.
print data2.B[0]
print data2['B'][0]
print data2.iloc[0,1] # [row,column]
0.785436968548 0.785436968548 0.785436968548
Look at the beginning of the DataFrame
data3 = pd.DataFrame(np.random.rand(100, 4))
data3.head()
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 0.127258 | 0.981462 | 0.820096 | 0.650613 |
1 | 0.471623 | 0.118745 | 0.595012 | 0.205356 |
2 | 0.802777 | 0.398816 | 0.383789 | 0.025648 |
3 | 0.368724 | 0.922648 | 0.850099 | 0.659503 |
4 | 0.889618 | 0.565936 | 0.056413 | 0.768219 |
Look at the end of the DataFrame.
data3.tail()
0 | 1 | 2 | 3 | |
---|---|---|---|---|
95 | 0.332815 | 0.920505 | 0.808580 | 0.161875 |
96 | 0.348941 | 0.532944 | 0.921147 | 0.736528 |
97 | 0.305776 | 0.747903 | 0.161359 | 0.808279 |
98 | 0.141267 | 0.878946 | 0.537137 | 0.157560 |
99 | 0.862024 | 0.519265 | 0.172454 | 0.665810 |
Just remember,
DataFrame
is just a bunch of Series
grouped together.Series
DataFrame
.Like DataFrame but 3 or more dimensions.
Robust IO tools to read in data from a variety of sources
The csv file (phx-temps.csv
) contains Phoenix weather data from
GSOD::
1973-01-01 00:00:00,53.1,37.9
1973-01-02 00:00:00,57.9,37.0
...
2012-12-30 00:00:00,64.9,39.0
2012-12-31 00:00:00,55.9,41.0
Simple read_csv()
# simple readcsv
phxtemps1 = pd.read_csv('phx-temps.csv')
phxtemps1.head()
1973-01-01 00:00:00 | 53.1 | 37.9 | |
---|---|---|---|
0 | 1973-01-02 00:00:00 | 57.9 | 37.0 |
1 | 1973-01-03 00:00:00 | 59.0 | 37.0 |
2 | 1973-01-04 00:00:00 | 57.9 | 41.0 |
3 | 1973-01-05 00:00:00 | 54.0 | 39.9 |
4 | 1973-01-06 00:00:00 | 55.9 | 37.9 |
Advanced read_csv()
, parsing the dates and using them as the index, and naming the columns.
# define index, parse dates, name columns
phxtemps2 = pd.read_csv(
'phx-temps.csv', index_col=0,
names=['highs', 'lows'], parse_dates=True)
phxtemps2.head()
highs | lows | |
---|---|---|
1973-01-01 | 53.1 | 37.9 |
1973-01-02 | 57.9 | 37.0 |
1973-01-03 | 59.0 | 37.0 |
1973-01-04 | 57.9 | 41.0 |
1973-01-05 | 54.0 | 39.9 |
import matplotlib.pyplot as plt
%matplotlib inline
phxtemps2.plot() # pandas convenience method
<matplotlib.axes.AxesSubplot at 0x7f9916ef4d90>
Boo, Pandas and Friends would cry if they saw such a plot.
Lets see a smaller slice of time:
phxtemps2['20120101':'20121231'].plot()
<matplotlib.axes.AxesSubplot at 0x7f9916ae8290>
Lets operate on the DataFrame
... lets take the differnce between the highs and lows.
phxtemps2['diff'] = phxtemps2.highs - phxtemps2.lows
phxtemps2['20120101':'20121231'].plot()
<matplotlib.axes.AxesSubplot at 0x7f99168bec10>