Commit e5c3adec authored by Stenli Karanxha's avatar Stenli Karanxha
Browse files

Completed the K-neighbours.

parent 734d1a21
......@@ -9,8 +9,8 @@ https://www.kaggle.com/c/multilabel-bird-species-classification-nips2013.
<p>The main characteristics of the problem are:</p>
<ol>
<li> 687 training samples are available, each containing a single call. This simplifies the training.</li>
<li> The samples are pre-elaborated, to remove inquinating information, silence.</li>
<li> The training is simplified from the fact that each of the training samples contains a single call. </li>
<li> The samples are pre-elaborated, to remove inquinating information or silence.</li>
<li> The samples are not all of the same length. This requires some padding to simplify the algorithms used.</li>
</ol>
......@@ -79,15 +79,46 @@ as valid only if it has a better performance than that. </p>
<h1> 4. K-nearest neighbours </h1>
<h2> Idea</h2>
<p> The k-nearest neighbors is a non parametric machine learning algorithm for
multi-class classification. It belongs to the supervised learning family and
does not need to fit a model to the data. Instead, data points are classified
based on the categories of the k nearest neighbors in the training data set.</p>
<h2> Preparation </h2>
<p> The first idea was to use Fast ICA, described in the paper:
http://mlsp.cs.cmu.edu/courses/fall2012/lectures/ICA_Hyvarinen.pdf.
The approach was rather complex and at the end I opted for the k-nearest neighbors which
is a simpler and more efficient approach. </p>
<h2> Implementation </h2>
<p>The following parameters define the behavior of the algorithm:</p>
<ol>
<li><b>soft_boundaries</b>: Gives the possibility to have a fuzzy assignment, with the
same element assigned partially to more classes. Setting this argument to 0 compells
the assignment of the element to only the class with higher probability.
</li>
<li><b>normalization:</b> Whitens the 0.1 data and it’s ideal to use if the distance metric is corr.
</li>
<li><b>distance_metric:</b> Gives an assignment ranking, either correlation based or distance based.
</li>
</ol>
<h1> 5. Neural networks </h1>
<h2> 1. Idea </h2>
<h2> Idea </h2>
<p>Goal of this part of the project was to create an algorithm implementing a neural network
which should do the bird classification.</p>
<h2> 2. Material and Preparation </h2>
<h2> Material and Preparation </h2>
<p>As my knowledge about neural networks was very limited (to abstract basic principles), it was necessery to become familiar with the basic concepts.
Stenli found the following coursera course about machine learning: https://www.coursera.org/course/ml
......@@ -109,7 +140,7 @@ that my algorithms performance lies far below the performance of the algorithms
</p>
<h2> 3. Implementation </h2>
<h2> Implementation </h2>
<p>So my implementation is a basic feed-forward neural network consisting of 1 input, 1 hidden and 1 output layer.
To adjust the weights during training I use the backpropagation algorithm (http://en.wikipedia.org/wiki/Backpropagation).
......@@ -118,7 +149,7 @@ Several other functions were necessary as helpers to realise these 3 functions.
</p>
<h2> 4. Remarks </h2>
<h2> Remarks </h2>
<p>The samples in the training and testing data are two dimensional cepstra values for 17 features (17*n),
where n depends on the length of the input wav file. This means that the input varies in length.
......@@ -138,9 +169,12 @@ For variable input-length other networks such as recurrent networks (which are b
Unfortunately we could not produce a proper algorithm in time for submission, so we had to check
the results by using the verification process integrated in the base script. </p>
<p> The tests were run using a partial group of the training samples 100 of the training samples. The neural network
algorithm performed quite well, reaching on a limited set of training samples a correspondence index of 77% (against the
benchmark of around 50% of the random algorithm). </p>
<p> We run some tests of all the algorithms, using the full training data, and using 75% of
it for learning and 25% for verification. The neural network algorithm performed quite well,
reaching on a limited set of training samples a correspondence index of 77% (against the
benchmark of around 50% of the random algorithm). The linear regression and K-neighbours algorithms
on the other hand had some implementation problems which did not allow them to reach acceptable
results. </p>
<h1> 7. Further development </h1>
......@@ -150,6 +184,10 @@ to be able to re-use the algorithm running and verification parts. Another inter
can be done on the verification process, which could replicate better the one used in the Kaggle
competition, and making the reading and writing of the files operating system independent.</p>
<p>The algorithms on the other hand need a more efficient approach, as the complexity and efficiency
<p>The linear regression and K-neighours algorithms need of course to be fixed, and generally all
of them need a more efficient approach, as the complexity and efficiency
are of concern for such amounts of data. </p>
<p>A better tuning of the parameters of the algorithms should also be done,
to improve the performance.</p>
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment