0%

白板推导 - 机器学习 - 线性分类

1. 线性分类 - 背景

Linear Regression: $f(w, b) = w^T x + b$

  • 线性:

    • 属性非线性: 特征转换(多项式回归)
    • 全局非线性: 线性分类(激活函数是非线性)
    • 系数非线性: 神经网络,感知机
  • 全局性:线性样条回归,决策树

  • 数据未加工: PCA, 流形

分类:

  • 硬分类: 0 或者 1

    • 线性判别模型 fisher:
    • 感知机
  • 软分类: 0 到 1 的概率值

    • 生成式:高斯判别模型
    • 判别式:逻辑回归

2. 线性分类 - 感知机

思想: 错误驱动

模型: $f(x) = sign(w^Tx), x\in R^p, w\in R^p$,
当 a>=0, sign(a) = 1, 否则, sign(a) = -1

策略:被错误分类的点的个数, loss function
$$L(w) = \sum_{i=1}^N I(y_iw^Tx_i < 0) \to L(w) = \sum_{i=1}^N -y_iw^Tx_i$$

求解方法:SGD: $w^{t+1}=w^t - \lambda\frac{\partial L(w)}{\partial w} = w^t+\lambda y_ix_i$

3. 线性分类 - 线性判别分析(Fisher)

  • 假设符号表示: $X = (x_1, …, x_N)^T \in R^{N\times p}, x_i\in R^p, y_i\in(+1, -1)$,
    $X_{c_1} = {x_i | y_i = +1}, X_{c_2} = {x_i | y_i = -1}$, $|X_{c_1}|=N_1, X_{c_2}=N_2, N_1+N_2 = N$

  • 思想: 类内小, 类间大。

  • 公式
    $$z_i = w^T x_i$$
    $$\hat{z} = \frac{1}{N}\sum_{i=1}^Nz_i = \frac{1}{N}w^Tx_i$$
    $$S_z = \frac{1}{N}\sum_{i=1}^N (z_i - \hat{z})(z_i - \hat{z})^T$$
    $$ = \frac{1}{N}\sum_{i=1}^N(w^Tx_i - \hat{z})(w^Tx_i - \hat{z})^T$$

    $$c_1: \hat{z_1} = \frac{1}{N_1}\sum_{i=1}^{N_1}w^Tx_i$$
    $$S_1 = \frac{1}{N_1}\sum_{i=1}^{N_1}(w^Tx_i - \hat{z_1})(w^Tx_i - \hat{z_1})^T$$
    $$c_2: \hat{z_2} = \frac{1}{N_2}\sum_{i=1}^{N_2}w^Tx_i$$
    $$S_2 = \frac{1}{N_2}\sum_{i=1}^{N_2}(w^Tx_i - \hat{z_2})(w^Tx_i - \hat{z_2})^T$$

    类间:$(\hat{z_1} - \hat{z_2})^2$

    类内:$S_1 + S_2$

    目标函数:
    $$J(w) = \frac{(\hat{z_1} - \hat{z_2})^2}{S_1 + S_2} = \frac{w^T(\overline{x_{c_1}} - \overline{x_{c_2}})w}{w^T(S_{c_1} + S_{c_2})w}$$
    $$ = \frac{w^T S_b w}{w^T S_w w}$$

    $S_b: between-class 类间方差 $

    $S_w: with-class 类内方差 $

    最终化简得到: $w \propto (\overline{x_{c_1}} - \overline{x_{c_2}})$

4. 线性分析 - 逻辑回归

  • Data: ${(x_i, y_i)}_{i-1}^N, x_i \in R^p, y_i\in{0, 1}$

  • sigmoid function: $\delta(z) = \frac{1}{1+exp(-z)}$
    $$P(y=1|x) = \delta(w^Tx) = \frac{1}{1+exp(-w^T x)}, y=1$$
    $$P(y=0|x) = 1 - P(y=1|x) = 1- \delta(w^Tx) = \frac{exp(-w^T x)}{1+exp(-w^T x)}, y=0$$
    $$P(y|x) = P(y=1|x)^y P(y=0|x)^{1-y}$$

  • MLE 得到 交叉熵函数(cross entropy):
    $$\hat{w} = \argmax_w \prod_{i=1}^N p(y_i|x_i)$$
    $$= \argmax_w \sum_{i=1}^N y_ilog P(y=1|x)+(1-y_i)log P(y=0|x)$$
    $$= \argmin_w -\sum_{i=1}^N y_ilog P(y=1|x)+(1-y_i)log P(y=0|x)$$

5. 线性分析 - 高斯判别分析

  • Data: ${(x_i, y_i)}_{i-1}^N, x_i \in R^p, y_i\in{0, 1}$

  • 生成式模型:
    $$\hat{y}= \argmax P(y|x) = \argmax_y P(y)P(x|y)$$
    $$y \sim Bernoulli(\phi) \to {P(y=1)=\phi 且 P(y=0)=1-\phi }$$
    $$x | y=1 \sim N(\mu_1, \xi)$$
    $$x | y=0 \sim N(\mu_2, \xi)$$

  • log-likelihood:
    $$l(\theta) = log\prod_{i=1}^N P(x_i, y_i)$$
    $$ = \sum_{i=1}^Nlog (P(x_i|y_i)P(y_i))$$
    $$ = \sum_{i=1}^N[log P(x_i|y_i) + log P(y_i)]$$
    $$ = \sum_{i=1}^N[log N(\mu_1, \xi)^{y_i} N(\mu_2, \xi)^{1-y_i}+ log \phi^{y_i} (1-\phi)^{1-y_i}]$$
    $$ = \sum_{i=1}^N[log N(\mu_1, \xi)^{y_i} + \log N(\mu_2, \xi)^{1-y_i}+ log \phi^{y_i} + log(1-\phi)^{1-y_i}]$$

  • 假设:
    $$\hat{\theta} = \argmax_{\theta} l(\theta)$$
    $$\theta = (\mu_1, \mu_2, \xi, \phi)$$
    $$y=1: N_1$$
    $$y=0: N_2$$
    $$N = N_1 + N_2$$

  • 求偏导:
    $$\frac{\partial l(\theta)}{\partial \phi} = \sum_{i=1}^N\frac{y_i}{\phi} + \frac{1-y_i}{1-\phi} = 0$$
    $$\to \sum_{i=1}^N y_i(1-\phi) - (1-y_i)\phi = \sum_{i=1}^N (y_i - \phi) = 0$$
    $$\to \sum_{i=1}^Ny_i - N\phi = 0$$
    $$\to \hat{\phi} = \sum_{i=1}^N = \frac{N_1}{N}$$

$$① = \sum_{i=1}^N logN(\mu_1,\xi)^{y_i} = \sum_{i=1}^Ny_i log \frac{1}{(2\pi)^\frac{p}{2}|\xi|^\frac{1}{2}}exp(-\frac{1}{2}(x_i - \mu_1)^T\xi^{-1}(x_i - \mu_1))$$
$$\mu_1 = \argmax_{\mu_1}① = \argmax\sum_{i=1}^N y_i(-\frac{1}{2}(x_i - \mu_1)^T\xi^{-1}(x_i - \mu_1))$$
$$\frac{\partial ①}{\partial \mu_1} = -\frac{1}{2}\sum_{i=1}^Ny_i(-2\xi^{-1}x_i + 2\xi^{-1}\mu_1) = 0$$
$$\to \sum_{i=1}^N y_i(\mu_1 - x_i) = 0$$
$$\to \hat{\mu_1} = \frac{\sum_{i=1}^Ny_ix_i}{\sum_{i=1}y_i} =\frac{\sum_{i=1}^Ny_ix_i}{N_1}$$

  • 同样的 MLE 求最大两个参数。

6. 线性分类 - 朴素贝叶斯分类器

  • 条件独立性假设: $P(x|y) = \prod_{i=1}^N P(x_i|y)$

  • 目的(动机): 简化运算。

  • $\hat{y} = \argmax_y P(y|x) = \argmax_y \frac{P(x, y)}{P(x)} = \argmax_y P(y) P(x|y)$

  • x 的分类

    • x 是离散的, $x_j \sim Categorical Dist$
    • x 是连续的, $x_j \sim N(\mu_j, \eth^2)$
坚持原创技术分享,您的支持将鼓励我继续创作!