- This article mainly looks at several problems that should be considered in machine learning. We take spam classification as an example. Given some training sets with tags, spam y = 1, non spam y = 0, we construct a classifier by supervised learning.

###First, consider how to construct the vector X

- In spam classification, we can give some words to distinguish spam from non spam. For example, deal, buy and discount are common words in spam, and name (Andrew) and now (meaning urgent) are common words in non spam. We choose words like this, where yes is 1 and no is 0, which are represented by eigenvectors
- The above features are manually selected by us, but in practical work, we often select the more frequent words in e-mail as feature vectors

### Secondly, we consider which method to improve the algorithm to improve the accuracy of spam classifier

- Intuitively, we’ll start with the idea of collecting a lot of data, but it’s not necessarily useful
- In addition, we can also consider using more complex characteristic variables.

(1) Because spam is often sent through uncommon servers, the source is fuzzy, so we can use the mail path information as the characteristic variable

(2) Different parts of speech with the same meaning have the same function, such as discount = discounts, deal = dealer; case has the same function; the punctuation of spam has certain characteristics (a large number of promotional emails are used)!) Spam may use intentional spelling mistakes to avoid being screened out by spam feature words

###Error analysis: how to choose an appropriate method to improve the algorithm

- Error analysis means that when we build a machine learning program, it is better not to build a complex system and complex variables first, but to build a simple algorithm to quickly learn the model. Although the effect of the model may not be good, we can find the error to further decide whether to increase the data set or use complex feature variables
- We use error analysis because we lack a learning curve, and we can’t know in advance whether we need complex variables or more data, so we don’t know where to spend our time to improve efficiency. First, we construct a simple algorithm to draw the learning curve, so as to avoid the premature optimization problem in programming: we must use evidence to lead the decision, rather than rely on intuition, because intuition is often wrong
- The premise of error analysis: Although the algorithms are simple and complex, the errors of different algorithms are generally the same, that is, they generally classify some email errors
- In the example of spam classification, we learn a simple model and cross verify it. Manually check the errors, find out the rules of the system, and see which kind of e-mail is always wrongly classified, which inspires us to construct new feature variables to improve the algorithm
- We found that there were 12 drug selling emails, 4 counterfeit emails, 53 phishing emails and 31 other emails. It can be seen that the performance of the algorithm to distinguish phishing e-mail is very poor, so we will study this aspect more and construct better feature variables to distinguish phishing e-mail
- Let’s take a look at which methods can help us better classify phishing e-mail, check spelling errors can check out 5, check unusual e-mail path source can check out 16, punctuation can check out 32, it can be seen that punctuation is a very powerful feature, so it takes time to construct complex variables about punctuation features
- To sum up, error analysis is a manual monitoring process to detect possible errors

### Numerical calculation method of evaluation algorithm

- When constructing algorithms, it is useful to evaluate machine learning algorithms in a numerical way.
- In the example of spam classification, after error analysis, we consider constructing new feature variables, and treat discount, discounts, discounts and discounting as equivalent. In natural language processing, this can be realized by stem extraction software such as porter. The software checks whether the first few letters are the same to judge whether they are regarded as the same

- However, the results obtained by checking the first few letters are not necessarily correct. For example, University and university do not have the same meaning. So after we add this feature, it’s hard to know whether the effect is good or not
- So we use cross validation to see the error rate of using stemming and not using stemming, and to decide whether to add this feature

- In general, error analysis can help us find the disadvantages of the algorithm and inspire us to propose improvement schemes; numerical evaluation is responsible for evaluating whether the improvement scheme improves the effect of the algorithm. However, numerical evaluation needs an appropriate error measure, sometimes error rate is not appropriate as an error measure, so we need to find an appropriate error measure

### Skew class

、

- In the case of tumor classification, suppose that the model we get only has a 1% correct rate in the test set, which is very good from the perspective of error rate. But in the test set, only 0.5% of people had cancer, so the 1% error rate was not so friendly
- If we assume that y is always 0, that is to say, we can predict that all people will not have cancer, then the error rate is only 0.5%, which is smaller than the error rate of 1%, but this is obviously inappropriate.
- When the number of samples of one class is much more than that of another class, or when the proportion of positive samples and negative samples is close to an extreme, this situation is called skew class. When we make y equal to 0 or 1, the error rate is very low, but the quality of the model is poor. So we hope to use another error measure evaluation algorithm, namely recall and precision, when we encounter skew class

##### Precision and recall

- Here are four definitions:

(1) True positive: y = 1 in fact, y = 1 in prediction

(2) True negative: y = 0 in fact, y = 0 in prediction

(3) False positive: actual y = 0, predictive y = 1

(4) False negative: y = 1 in fact, y = 0 in prediction

It can be seen that true and false are in terms of whether the predicted value is correct, while Yin and yang are in terms of the predicted value

- With the above four definitions, we can introduce precision rate and recall rate

(1) Precision rate: for all patients who are predicted to have cancer, the proportion of patients who actually have cancer. The high precision rate indicates that for patients, we have a high accuracy rate in predicting that they have cancer

(2) Recall rate: the percentage of all patients who actually have cancer that we can predict.

Through these two, we can judge whether the classification model for skew class is good or not. It is worth noting that the positive definition of precision rate and recall rate here is y = 1, which is not absolute. Generally, the one with less occurrence is defined as positive

- Precision rate and recall rate can better solve the skew problem. When we always predict y = 0, no one has cancer, there is no true positive, and the recall rate is 0

##### Tradeoff between precision and recall

- In the example of cancer prediction, we use logistic regression model to train data and output the probability value between 0 and 1. And in general, when h function is greater than or equal to 0.5, it is predicted to be 1
- However, we hope to be able to predict a person with cancer only when we are very confident. Because telling the patient that he has cancer means very bad news, he will go through a very painful process of treatment.
- Therefore, we modified the algorithm to set the critical value to 0.7, that is, only when the probability is relatively high can we predict that the patient has cancer. In this way, among the cancer patients we predicted, the probability of actually suffering from cancer is very high, so there will be a high precision rate and a low recall rate
- However, on the other hand, we hope to avoid omitting patients with cancer, that is, to avoid false negative. So we set the probability a little lower (0.3), so that most of the people who really have cancer can be marked, but most of the predicted people do not have cancer. This leads to low precision and high recall
- To sum up, we can’t have both precision and recall. We need to weigh them and see which one we value more

##### How to use precision rate and recall rate to compare the quality of the model

- When we use the error rate as the evaluation measure, we only need to compare one real number. Here we have two numbers to compare. How can we determine an evaluation standard, that is, how to know whether 0.5, 0.4 or 0.7, 0.1 is better?
- First of all, we propose to calculate the average value of precision rate and recall rate. However, there are limitations. For example, if we always predict y = 0, we will get a recall rate close to 0 and a precision rate close to 1, so the average value may be relatively high. So averaging is not a good way to evaluate algorithms
- So we use F value to evaluate. The F value also takes into account the average value, but gives a higher weight to the lower value. We can see from molecules that molecules are products, so one is 0, and the whole is 0. The F value returns a value from 0 to 1

- We can try different thresholds and then test on the cross validation set to see which F is the highest

### The importance of data volume

- First, we discussed the evaluation index of the model. Next, we look at the importance of the amount of data
- Before, people did a study. To compare the effect of different algorithms, and the effect of these algorithms using different training set size. The problem is how to classify easily confused words (to, two, too). People use variance algorithm, winnow algorithm, memory based learning algorithm, and naive algorithm, and change the size of the data to carry out experiments.

- Conclusion: whether the algorithm is good or bad, the performance of most algorithms are similar, but with the increase of the amount of training data, the performance of the algorithm is gradually enhanced. Therefore, the amount of data is very important. If the amount of data of a inferior algorithm is much larger than that of a superior algorithm, it may achieve better results
- General consensus: the best is not the one with the best algorithm, but the one with the most data.
- But the importance of data quantity has a premise that feature set X can contain enough information to predict y. For example, in the word classification, the words around the blank give enough information; while in the house price prediction, if we only give the size, there is no location, decoration, the number of bedrooms, it is difficult to predict the price
- We can see whether the features provide enough information. If we show the sentences to the English experts, we can predict them; but if we only show the size to the sales lady, we can’t predict the house price

- To sum up, the key to get high performance algorithm is as follows

(1) Eigenvalues contain enough information to predict y

(2) There are a lot of training data. If the training data is too large and larger than the number of features, over fitting can be avoided and the training error is close to the test error

(3) Learning algorithms using many parameters, such as multi feature linear regression and logistic regression, neurons with many hidden units. Multi parameters can fit very complex parameters and reduce the training error