Intro to Machine Learning


24 September 2014

Sarah Braden


What is Machine Learning?

Python and Machine Learning


Why do you use Machine Learning?

To make predictions and decisions


When do you use it?

When the going gets tough

Who uses it?

  • Spam filters / Fraud detection
  • Sentiment Analysis
  • Computer Vision
  • Speech and Handwriting Recognition

What do you need?

  • A Problem!
  • Data
  • Features
  • Labels (for supervised learning)
  • Programming Skillz
  • Patience / Stubborness / Math

Are there different kinds of Machine Learning?

  • Supervised Learning
    • Data has both features and labels
    • Classification (label is a class)
    • Regression (label is a continuous value)
  • Unsupervised Learning
    • Data only has features
    • Clustering
    • Use clustering and then classification together!
  • Feature Engineering
  • Many other things

I'm still interested. Tell me more...

Fire up scikit-learn!

scikit-learn dependencies

  • Python (>= 2.6 or >= 3.3)
  • NumPy (>= 1.6.1)
  • SciPy (>= 0.9)

pip install numpy scipy scikit-learn


Why is scikit-learn awesome?

  • Out-of-the-box Models
  • Model Selection (important!)
  • Data Preprocessing

More than just scikit-learn

  • PyMC (Bayesian modeling)
  • Shogun (Support Vector Machines)
  • Theano (Deep Learning)

Other Python Machine Learning Libraries

Example of a Spam Filter using Naive Bayes

Dataset * Text files of emails from Machine Learning in Action published by Manning * unzip * Small dataset of ham and spam

Data Preprocessing

Making word vectors before we use Naive Bayes to classify the word vectors

In [21]:
import re
import numpy as np
from glob import glob

# Use regular expressions to split up the sentence on anything that isn't a word or a number
regEx = re.compile('\\W*')

email_text = open('email/ham/1.txt').read()
# words sorta equal tokens
list_of_tokens = regEx.split(email_text)
In [55]:
email_text = open('email/spam/1.txt').read()
# words sorta equal tokens
list_of_tokens = regEx.split(email_text)
In [22]:
def parse_text(email_filename):
    """converts all tokens to lowercase and removes tokens < 2 characters long
    email_text = open(email_filename).read()
    tokens = re.split('\\W*', email_text)
    return [token.lower() for token in tokens if len(token) > 2]

def get_all_text(email_type):
    files = glob('email/' + email_type + '/*.txt')
    return [parse_text(file) for file in files]
In [54]:
def create_vocab_list(data_set):
    vocab_set = set([])  #create empty set
    for document in data_set:
        vocab_set = vocab_set | set(document) #union of the two sets
    return list(vocab_set)

def bag_of_words(vocab_list, input_words):
    returnVec = [0]*len(vocab_list)
    for word in input_words:
        if word in vocab_list:
            returnVec[vocab_list.index(word)] += 1
    return returnVec
In [24]:
email_types = ['ham', 'spam']

ham = get_all_text('ham')
spam = get_all_text('spam')      

all_documents = ham + spam
all_labels = ['ham'] * 25 + ['spam'] * 25

vocab_list = create_vocab_list(all_documents)  #create vocabulary

# Convert the documents into word vectors
features = [bag_of_words(vocab_list, document) for document in all_documents]

print np.array(features).shape
print np.array(all_labels).shape
(50, 692)

We have features and labels!

In [29]:
# Cross validation
from sklearn import cross_validation

X_train, X_test, y_train, y_test = cross_validation.train_test_split(
    np.array(features), np.array(all_labels), test_size=0.3, random_state=0)

In [30]:
print "Training set:", X_train.shape, y_train.shape
print "Test set:", X_test.shape, y_test.shape
Training set: (35, 692) (35,)
Test set: (15, 692) (15,)

In [31]:
import numpy as np
from sklearn.naive_bayes import GaussianNB

classifier = GaussianNB(), y_train)
y_pred = classifier.predict(X_test)
print "Number of mislabeled points : %d" % (y_test != y_pred).sum()
print "Score:", classifier.score(X_test, y_test)
Number of mislabeled points : 1
Score: 0.933333333333

In [32]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))
             precision    recall  f1-score   support

        ham       0.86      1.00      0.92         6
       spam       1.00      0.89      0.94         9

avg / total       0.94      0.93      0.93        15

Example of Unsupervised Learning using K-means Clustering

Iris Flower Data Set - The data set consists of 50 samples from each of three species of Iris - Iris setosa - Iris virginica - Iris versicolor - Four features were measured from each sample: - the length of the sepals - the width of the sepals - the length of the petals - the width of the petals


In [53]:
import matplotlib.pyplot as plt
%matplotlib inline

iris = datasets.load_iris()
X =[:, :2]  # we only take the first two features.
Y =

# Plot the points
plt.scatter(X[:, 0], X[:, 1], c=Y,
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
<matplotlib.text.Text at 0x113578e50>
In [49]:
from sklearn import cluster, datasets

iris = datasets.load_iris()
X_iris =  # features
y_iris =  # labels
In [50]:
k_means = cluster.KMeans(n_clusters=3)

# does the kmeans clustering predict reality?
# only predicts species # 2
[1 1 1 1 1 0 0 0 0 0 2 2 2 2 2]
[0 0 0 0 0 1 1 1 1 1 2 2 2 2 2]


