LAB 05.02 - Model evaluation

!wget --no-cache -O init.py -q https://raw.githubusercontent.com/fagonzalezo/ai4eng-unal/main/content/init.py
import init; init.init(force_download=False); init.get_weblink()
from local.lib.rlxmoocapi import submit, session
session.LoginSequence(endpoint=init.endpoint, course_id=init.course_id, lab_id="L05.02", varname="student");

Task 1: Partition randomly numpy arrays

observe we can select specific rows and/or columns on a numpy array

import numpy as np

x = np.random.randint(100, size=(20,5))
x[:,0] = range(len(x))
x[0,:] = range(x.shape[1])
x
ridxs = np.r_[2,4,5]
x[ridxs]
cidxs = np.r_[1,3]
x[:,cidxs]
x[ridxs][:, cidxs]

and the dimensions of the array are accessible through len and shape

len(x), x.shape

observe also how we can partition it

x[:3]
x[3:]

we can do the same thing with vectors

v = np.arange(100,120)
v
v[:5], v[5:]

finally, observe how we can create a random permutation of a specific vector

np.random.permutation(v)

or the first natural numbers

p = np.random.permutation(20)
p

how do you interpret this?

v[p[5:]]
x[p[:5]]

assignment

in this task you will have to complete the function split_data below so that:

  • it accepts two arguments X and y, either of which can be any numpy array (1D, 2D, etc.) of the same size \(n\) (observe the assert statement), and a pct

  • creates a random permutation of the natural number from \(0\) to \(n-1\)

  • partitions the permutations so that the first partition contains the first n1_elements \(=\) int(n * pct) numbers, and the second partition the rest

  • interpret the permutation partitions components as indexes to X and y so that they are partitioned into X1, X2 and y1, y2 respectively

note that indexes to array must be of type int. do the following to convert a float to int

a,b = 10,.3
c = a*b
print (c)
c = int(c)
print(c)
def split_data(X, y, pct):
    
    assert len(X)==len(y), "X and y must have the same length"
    assert pct>0 and pct<1, "pct must be in the (0,1) iterval"
    
    permutation = 
    n1_elements = 
    permutation_partition_1 =
    permutation_partition_2 = 
    X1 = 
    X2 = 
    y1 = 
    y2 = 
    return X1, X2, y1, y2

check your solution manually with the following code

XX = np.random.randint(100, size=(20,8))
yy = np.arange(100,100+len(XX))
XX[:,0] = range(len(XX))
XX[0,:] = range(XX.shape[1])
print (XX)
print (yy)
Xtr, Xts, ytr, yts = split_data(XX, yy, pct=.7)
# check partition ok
np.sum(XX), np.sum(Xtr) + np.sum(Xts), np.sum(yy), np.sum(ytr)+np.sum(yts)
print (Xtr, "\n--")
print (Xts, "\n--")
print (ytr, "\n--")
print (yts, "\n--")
Xts

submit your code

student.submit_task(globals(), task_id="task_01");

Task 2: Fit a model and make predictions

observe how we create new data from synthetic datasets available in sklearn

from sklearn.datasets import make_moons
import matplotlib.pyplot as plt
from local.lib import mlutils
%matplotlib inline
X, y = make_moons(200, noise=0.2)
X.shape, y.shape
mlutils.plot_2Ddata(X,y); plt.grid();

observe also how we create an algorithm instance and fit a model

from sklearn.svm import SVC
estimator = SVC(gamma=1)
estimator.fit(X,y)
mlutils.plot_2Ddata_with_boundary(estimator.predict, X, y)

and how we make predictions

preds = estimator.predict(X)
print (preds.shape)
preds

in this task you have to complete the following function so that:

  • it makes two non-random partitions of X and y. One containing the first half of the data and one containing the second part. If the number of elements of X is odd, then the second half will contain one more element than the first half.

  • it fits the model with the first part of the data

  • it makes predictions on the second half of the data

  • returns the estimator fitted, and the predictions on the second half of the data.

def fit_and_predict(estimator, X, y):
    assert len(X)==len(y), "X and y must have the same length"
    
    predictions = ...
    
    return estimator, predictions

check your code. your predictions should be similar to

preds
>> array([0, 0, 0, 0, 1, 0, 1, 1, 1, 0])
X = np.array([[ 0.74799424, -0.5867667 ],
       [-0.64457753,  1.25127894],
       [ 0.53682593,  0.10931563],
       [-0.88825294, -0.06987509],
       [ 0.99612638, -0.52295157],
       [ 1.20586692,  0.01930477],
       [-0.19368482,  0.65121567],
       [ 0.1973759 ,  0.82250723],
       [ 0.94859234, -0.5457241 ],
       [ 1.87967948, -0.22740261],
       [ 0.58766146,  0.3982837 ],
       [ 0.27731571,  1.14369568],
       [-0.67421956,  0.12785382],
       [ 0.56957459,  1.05330376],
       [ 1.52435938, -0.29864338],
       [-0.15973608,  0.21790711],
       [ 1.59037406, -0.56875485],
       [ 0.43257507, -0.48900315],
       [ 1.09440413, -0.73789029],
       [-0.32940869,  0.74671384]])
y = np.array([1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0])
X.shape, y.shape
from sklearn.linear_model import LogisticRegression
estimator = LogisticRegression()
estimator, preds = fit_and_predict(estimator, X, y)
preds

submit your code

student.submit_task(globals(), task_id="task_02");

Task 3: Select data with indices

Observe how we can create a vector or matrix of True/False (boolean) by applying a condition to any matrix or vector

import numpy as np
y = np.random.randint(10, size=15)
print (y)
y_less_than_5 = y<5
print (y_less_than_5)

and how we can select elements of a vector using a boolean vector of the same length

y[y_less_than_5]
y[y<5]

python doesn’t really care how you construct the vector of booleans to index any other vector or array

v = np.random.randint(20, size=15)
v
v[y<5]

in this task you will complete the function select_per_class such that:

  • receives an array of data X and a vector of labels y, of the same length

  • the labels y are binary, they can only have values 0 or 1

  • makes two partitions of X, one corresponding to the places where y is 0, and another where y is 1

  • returns the two partitions

For instance, for the following X and y

X = np.array([[8, 8, 5, 2, 0, 0],
              [4, 4, 8, 1, 3, 7],
              [4, 5, 3, 6, 9, 6],
              [0, 3, 5, 3, 5, 3],
              [0, 7, 2, 7, 1, 7],
              [5, 7, 7, 1, 8, 5],
              [2, 5, 7, 3, 8, 0],
              [7, 2, 5, 9, 8, 7],
              [1, 6, 6, 1, 6, 0],
              [0, 7, 6, 5, 3, 4]])

y = np.array([0, 0, 0, 0, 1, 1, 0, 0, 1, 1])

your function must return the following two matrices:

[[8 8 5 2 0 0]
 [4 4 8 1 3 7]
 [4 5 3 6 9 6]
 [0 3 5 3 5 3]
 [2 5 7 3 8 0]
 [7 2 5 9 8 7]]
 
[[0 7 2 7 1 7]
 [5 7 7 1 8 5]
 [1 6 6 1 6 0]
 [0 7 6 5 3 4]]
def select_per_class(X, y):
    X1 = 
    X2 = 
    return X1, X2

check manually your code

X = np.array([[8, 8, 5, 2, 0, 0],
              [4, 4, 8, 1, 3, 7],
              [4, 5, 3, 6, 9, 6],
              [0, 3, 5, 3, 5, 3],
              [0, 7, 2, 7, 1, 7],
              [5, 7, 7, 1, 8, 5],
              [2, 5, 7, 3, 8, 0],
              [7, 2, 5, 9, 8, 7],
              [1, 6, 6, 1, 6, 0],
              [0, 7, 6, 5, 3, 4]])

y = np.array([0, 0, 0, 0, 1, 1, 0, 0, 1, 1])
a,b = select_per_class(X, y)
print (a)
print (b)

submit your code

student.submit_task(globals(), task_id="task_03");

Task 4: Measure accuracy

complete the following function such that:

  • it receives to binary vectors (composed of 0’s and 1’s) of the same length

  • returns the percentage of elements that are the same in both vectors

recall that

  • if a and b are vectors of the same length a==b returns a vector of booleans in which positions in True signal that elements in those position are the same

  • if k is a vector of booleans, sum(k) returns the number of True elements.

for the following two vectors you should get 0.375

a = np.array([1,0,0,0,1,1,0,0])
b = np.array([1,1,1,1,0,1,0,1])
accuracy(a, b)
>>> 0.375
def accuracy(y_true, y_pred):
    result = 
    return result
a = np.array([1,0,0,0,1,1,0,0])
b = np.array([1,1,1,1,0,1,0,1])
accuracy(a,b)

submit your code

student.submit_task(globals(), task_id="task_04");

Task 5: Random split, fit and predict

complete the following function so that:

  • fits the estimator with a random sample of size train_pct of the data X and binary labels y. You can use the split_data function developed previously

  • makes predictions on the test part of the data

  • measures accuracy of those predictions. you may use the function created previously

  • returns the estimator fitted, the test part of X and y, and the accuracy measured

the execution below should return something with the following structure (the actual numbers will change)

(LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                    intercept_scaling=1, l1_ratio=None, max_iter=100,
                    multi_class='warn', n_jobs=None, penalty='l2',
                    random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                    warm_start=False), array([[-0.76329684,  0.2572069 ],
        [ 1.02356829,  0.37629873],
        [ 0.32099415,  0.82244488],
        [ 1.08858315, -0.61299904],
        [ 0.58470767,  0.58510559],
        [ 1.60827644, -0.15477173],
        [ 1.53121784,  0.78121504],
        [-0.42734156,  0.87585237],
        [-0.36368682,  0.72152586],
        [ 1.05312619,  0.19835526]]), array([0, 0, 1, 1, 0, 1, 1, 0, 0, 0]), 0.6)
def split_fit_predict(estimator, X, y, train_pct):
    
    def split_data(X, y, pct):
        # your code here
    
    def accuracy(y_true, y_pred):
        # your code here

    Xtr, Xts, ytr, yts = ...
    ... fit the estimator ....
    preds_ts = ... obtain predictions ... 
    return estimator, Xts, yts, accuracy(yts, preds_ts)
        
        
from sklearn.linear_model import LogisticRegression

X, y = make_moons(100, noise=0.2)
estimator = LogisticRegression(solver="lbfgs")
split_fit_predict(estimator, X, y, train_pct=0.9)

submit your code

student.submit_task(globals(), task_id="task_05");