目录
1. Classification
1.1 Text Classification Tasks
2. Algorithms for Classification
2.1 Choosing a Classification Algorithm
2.2 Naïve Bayes
2.3 Logistic Regression
2.4 Support Vector Machines
2.5 K-Nearest Neighbour
2.6 Decision tree
2.7 Random Forests
2.8 Neural Networks
编辑
2.9 Hyper-parameter Tuning
3. Evaluation
3.1 Accuracy
3.2 Precision & Recall
3.3 F(1)-score
4. A Final Word
1. Classification
Input
- A document d
- Often represented as a vector of features 通常表示为一个特征向量
- A fixed output set of classes C = {c1,c2,…ck}
- Categorical, not continuous (regression) or ordinal (ranking) 分类的,不是连续的(回归)或顺序的(排名)。
Output
1.1 Text Classification Tasks
一些常见的例子
- 主题分类 Topic classification
- 情感分析 Sentiment analysis
- 本土语言识别 Native-language identification
- 自然语言推理 Natural language inference
- 自动事实核查 Automatic fact-checking
- 释义 Paraphrase
输入可能不是一个长的文件
2. Algorithms for Classification
2.1 Choosing a Classification Algorithm
- Bias vs. Variance
- Bias: assumptions we made in our model 我们在模型中所作的假设
- Variance: sensitivity to training set 对训练集的敏感性
- Underlying assumptions, e.g., independence
- Complexity
- Speed
2.2 Naïve Bayes

Pros:
- Fast to train and classify
- robust, low-variance -> good for low data situations
- optimal classifier if independence assumption is correct
- extremely simple to implement.
Cons:
- Independence assumption rarely holds
- low accuracy compared to similar methods in most situations
- smoothing required for unseen class/feature combinations
2.3 Logistic Regression

Pros:
- Unlike Naive Bayes not confounded by diverse, correlated features better performance
Cons:
- Slow to train;
- Feature scaling needed
- Requires a lot of data to work well in practice
- Choosing regularisation strategy is important since overfitting is a big problem
2.4 Support Vector Machines

Finds hyperplane which separates the training data with maximum margin
Pros:
- Fast and accurate linear classifier
- Can do non-linearity with kernel trick
- works well with huge feature sets
Cons:
- Multiclass classification awkward
- Feature scaling needed
- Deals poorly with class imbalances
- Interpretability
2.5 K-Nearest Neighbour

Pros:
- Simple but surprisingly effective
- No training required
- Inherently multiclass
- Optimal classifier with infinite data
Cons:
- Have to select k
- Issues with imbalanced classes
- Often slow (for finding the neighbours)
- Features must be selected carefully
2.6 Decision tree

Pros:
- Fast to build and test
- Feature scaling irrelevant
- Good for small feature sets
- Handles non-linearly-separable problems
Cons:
- In practice, not that interpretable Highly redundant sub-trees
- Not competitive for large feature sets
2.7 Random Forests

Pros:
- Usually more accurate and more robust than decision trees
- Great classifier for medium feature sets
- Training easily parallelised
Cons:
- Interpretability
- Slow with large feature sets
2.8 Neural Networks

Ppros:
- Extremely powerful, dominant method in NLP and vision
- Little feature engineering
Cons:
- Not an off-the-shelf classifier
- Many hyper-parameters, difficult to optimise
- Slow to train
- Prone to overfitting
2.9 Hyper-parameter Tuning
- Dataset for tuning
- Development set
- Not the training set or the test set
- k-fold cross-validation
- Specific hyper-parameters are classifier specific
- E.g. tree depth for decision trees
- But many hyper-parameters relate to regularisation
- Regularisation hyper-parameters penalise model complexity
- Used to prevent overfitting
- For multiple hyper-parameters, use grid search
3. Evaluation
3.1 Accuracy

3.2 Precision & Recall

3.3 F(1)-score

4. A Final Word
- 在你感兴趣的任务上可以尝试很多算法(参见 scikit-learn)
- 但是,如果你的目标是在一项新任务中取得好的结果,那么注释良好、数据集丰富和适当的特性往往比所使用的特定算法更重要