Commit a6b9bde1 by Stenli Karanxha

### Completed the linear regression and fixed the links.

 ... ... @@ -3,10 +3,11 @@

The target of the project was to solve a classification problem, identifying which of 87 species of birds and anfibians are present in a list of continuous wild song recording. The problem is presented as a Kaggle competition. For more information the website is: https://www.kaggle.com/c/multilabel-bird-species-classification-nips2013. The problem is presented as a Kaggle competition. For more information see: bird classification.

The main characteristics of the problem are:

1. The training is simplified from the fact that each of the training samples contains a single call.
2. ... ... @@ -74,7 +75,82 @@ as valid only if it has a better performance than that.

3. Linear regression

3. Linear regression

1. Basics

Linear regression is a commonly used predictive model in machine learning. In our case, the linear regression machine learning algorithm is needed to do the multi-class classification. For this purpose, I created the class 'LinearRegression' containing the later explained functions, as well as some required utility functions.

The following modules are needed in the code:

• numpy - for diverse operations on arrays
• scipy.optimize - to calculate the optimal value of the parameters
• mathutils - because of the sigmoid function
• Sample from sample - to get the samples to train the algorithm

2. Implementation

The 'LinearRegression' class contains the following functions:

• the train function

This one trains the algorithm by analysing the list of samples from Sample.

It uses the '_get_flat_biased_data' function to assign all the training samples to a Matrix X, in which every row contains a sample and the amount of columns is defined by the number of classes. Thus, the matrix X contains all the training samples.

Furthermore it assigns a matrix with the expected outputs to y by calling the 'get_classification()' function from Sample. Here again, each row of the matrix y contains an output and the amount of columns is defined by the number of classes.

After minimizing the cost function, the train function assigns the optimal value of the parameters to self.parameters. This value is obtained through the BFGS optimization function from the scipy fmin module.

Last, the train function calls the cost and the gradient function.

• the evaluate function

Using the already trained algorithm, the evaluate function does the evaluation of the test samples. In other words, it calculates the 87 class belonging probabilities for a given sample.

The required utility functions are those below:

• the cost function

The regression hypothesis is defines as:

with

the sigmoid function which figures in the module mathutils.

In our case Θ is the lot of parameters, and with m = len(y), the cost function is given by the following formula:

But we're going to use the regularized version of the cost function, therefore the regularization term needs to be added, the regularized cost function follows:

‹In mathematics, the gradient is a generalization of the usual concept of derivative to the functions of several variables. If f(x1, ..., xn) is a differentiable function of several variables, also called "scalar field", its gradient is the vector of the n partial derivatives of f. It is thus a vector-valued function also called vector field.› (wikipedia extract)

Thus the gradient of the regularized cost function is a vector whose elements are defined as follows:

4. K-nearest neighbours

... ... @@ -89,8 +165,8 @@ based on the categories of the k nearest neighbors in the training data set.

Preparation

The first idea was to use Fast ICA, described in the paper: http://mlsp.cs.cmu.edu/courses/fall2012/lectures/ICA_Hyvarinen.pdf.

The first idea was to use Fast ICA . The approach was rather complex and at the end I opted for the k-nearest neighbors which is a simpler and more efficient approach.

... ... @@ -120,22 +196,30 @@ which should do the bird classification.

Material and Preparation

As my knowledge about neural networks was very limited (to abstract basic principles), it was necessery to become familiar with the basic concepts. Stenli found the following coursera course about machine learning: https://www.coursera.org/course/ml

As my knowledge about neural networks was very limited (to abstract basic principles), it was necessery to become familiar with the basic concepts. Stenli found the following coursera course about machine learning. There the material for week 4 and 5 is concerned with neural networks and I used this to get the gist of the topic. Then I found another coursera class about neural networks in machine learning by Geoffrey Hinton: https://www.coursera.org/course/neuralnets. Then I found another coursera class about neural networks in machine learning by Geoffrey Hinton. Some of the lectures can also be found on youtube.

I watched the first 5 lectures which provided a more profound theoretical background and introduced me to backpropagation.

I limited the scope of my algorithm to this material as more complex neural networks would have meant studying even more material. Python provides many modules for the implementation of neural networks (pybrain: http://pybrain.org/docs/, neurolab: http://pythonhosted.org/neurolab/intro.html#support-neural-networks-types, ...).

I limited the scope of my algorithm to this material as more complex neural networks would have meant studying even more material. Python provides many modules for the implementation of neural networks ( pybrain , neurolab, ...).

I decided not to use these modules as I wanted to make every step of the algorithm explicit. Implementing a neural network and a learning algorithm with these modules can basically be reduced to two function calls but I was interested in what is going on behind the scenes. I am well aware of the fact

I decided not to use these modules as I wanted to make every step of the algorithm explicit. Implementing a neural network and a learning algorithm with these modules can basically be reduced to two function calls but I was interested in what is going on behind the scenes. I am well aware of the fact that my algorithms performance lies far below the performance of the algorithms used in these modules.

... ... @@ -143,9 +227,11 @@ that my algorithms performance lies far below the performance of the algorithms

Implementation

So my implementation is a basic feed-forward neural network consisting of 1 input, 1 hidden and 1 output layer. To adjust the weights during training I use the backpropagation algorithm (http://en.wikipedia.org/wiki/Backpropagation). The main functions of the algorithm are the initialisation of the neuronal network, the training of the network with the training data, and the evaluation of the testing data. Several other functions were necessary as helpers to realise these 3 functions. More detailed information can be found in the code. To adjust the weights during training I use the backpropagation algorithm. The main functions of the algorithm are the initialisation of the neuronal network, the training of the network with the training data, and the evaluation of the testing data. Several other functions were necessary as helpers to realise these 3 functions. More detailed information can be found in the code.

... ... @@ -155,7 +241,8 @@ Several other functions were necessary as helpers to realise these 3 functions. where n depends on the length of the input wav file. This means that the input varies in length. This results in 17*n input nodes of the network (it does not matter that 2 dimensional data is fed in as 1 dimensional as the network will find its own way to extract the information). The problem, however, is that the input length is variable and in feed-forward networks the input must be of the same length always. The problem, however, is that the input length is variable and in feed-forward networks the input must be of the same length always. For variable input-length other networks such as recurrent networks (which are beyond my knowledge) are more suitable.

... ... @@ -172,9 +259,9 @@ the results by using the verification process integrated in the base script.

We run some tests of all the algorithms, using the full training data, and using 75% of it for learning and 25% for verification. The neural network algorithm performed quite well, reaching on a limited set of training samples a correspondence index of 77% (against the benchmark of around 50% of the random algorithm). The linear regression and K-neighbours algorithms on the other hand had some implementation problems which did not allow them to reach acceptable results.

benchmark of around 50% of the random algorithm). The linear regression on the other hand was not a viable choice, as python could not cope with the matrix of parameters coming out of it.

7. Further development

... ... @@ -184,10 +271,10 @@ to be able to re-use the algorithm running and verification parts. Another inter can be done on the verification process, which could replicate better the one used in the Kaggle competition, and making the reading and writing of the files operating system independent.

The linear regression and K-neighours algorithms need of course to be fixed, and generally all of them need a more efficient approach, as the complexity and efficiency

The K-neighours algorithms needs of course to be fixed, and generally all the algorithms could be implemented more efficiently, as the complexity and efficiency are of concern for such amounts of data.

A better tuning of the parameters of the algorithms should also be done,

A better tuning of the parameters of the algorithms could also be done, to improve the performance.

Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!