PCA——智能数据分析

Intelligent data analysis


Process

  1. Start with question about data (free dimension, content, predictions).
  2. Build a model to anwser it.

Questions(Outline)

  1. How many degrees of freedom i the data(What are the real factor that are changing, roatte axis, find large variability, ignore noise)?
  2. Are the ‘mutual’ groups in the data?
  3. Predictiive models on data?
  4. Can we deal with docs(unorganized data)?
  5. Ranking PageRank(Not include in test).

Basic Concept of Statics & Propability

Notation:

\(R=\{x^1,x^2...\},\quad x^i\in R^n\)

  1. Expectation and Mean: \(E[x]=\frac{1}{n}\sum_1^nx^i=\mu=\sum_1^kjP(j)\)
  2. Variance: \(Var[x]=\frac{1}{n}\sum_1^n(x^i-E[x])^2=\sum_1^kP(j)(j-E[j])^2\)
  3. Standard Deviation: \(\sigma=\sqrt{Var[X]}\)

Varibility in more dimensions

\(\underline{X}= \left[ \begin{array}{cc} x_1\\ x_2\\ x_3\\ ... \end{array} \right]\).

\(Cov[x_1,x_2] = E[(x_1-E[x_1])(x_2-E[x_2])]\).


Independent measurement

Data transform :

  • mean = 0.
  • Standard deviation = 1.
\[\frac{x_j^i-\mu_j}{\sigma}\rightarrow x_j^i\]
  • After Tranform, Relational Coeffcient:
\[E[Cov[X_1 X_2]]\]
  • Covarience may change with coordinate system.
\[Cov[X_1X_2]>0\rightarrow Cov[\hat{X_1}\hat{X_2}]=0\]

More dimensions

Covarience Matrix

Calculate Every Pair of Covarience

\[X^k\in R^d\] \[C_{ij}=Cov[X_jX_k]\quad(d\times{}d\,matrix)\] \[Cov[X_j,X_k]=Cov[X_k,X_j]\] \[C_{ij}=1\quad (i=j)\]

After Transform

\[\hat{C}_{jk}=\sigma_{jk}^2(j=k)(diagonal)\] \[\hat{C}_{jk}=0(j\ne k)\]

What we want?

    Dataset usually have 10 or even more dimensions. Many of these dimensions may be related so we want a new set of coordinate where datas are unrelated(independent). In the covariance matrix, entries describe how each axis change with other axis, so we want transform it into a more independent coordinate. And then project each spot into these axis. Namely, we combine related axis into together to get unrelated axis.

Steps

  • Normalization
\[\frac{x_j^i-\mu_j}{\sigma}\rightarrow x_j^i\]
  • V and X

    \[V = \begin{bmatrix} v^1&v^2&...\end{bmatrix}\quad\text{(unit of new axis)}\] \[\space X = \begin{bmatrix}x^1&x^2&...\end{bmatrix}\]
  • Covaranice Matrix

    \[\hat{C_{jk}}=\frac{1}{N}\sum_{i=1}^{N}(x_j^i-\mu_j)(x_k^i-\mu_k)=\frac{1}{N}\sum_{i=1}^{N}x_j^ix_k^i\] \[\mu = 0\]
  • Matrix Form

    \[\hat{C}=\frac{1}{N}XX^T\space and\space \tilde{C}=\frac{1}{N}\tilde{X}\tilde{X}^T\]
  • Substitute \(\tilde{X} = V^TX\)

\[\tilde{C}=\frac{1}{N}V^TX(V^TX)^T\] \[\tilde{C}=\frac{1}{N}V^TXX^TV\] \[\tilde{C}=V^TCV\]

    \(\tilde{C}\) is the diagonal eigenvalue matrixe and \(V^T=V^{-1}\), the dimension D we want to keep usually is smaller than the dataset dimension d.

\[\mu(D)=\frac{\sigma_1^2+\sigma_2^2+...+\sigma_D^2}{\sigma_1^2+\sigma_2^2+...+\sigma_d^2}\]

    EigenVector capture the axis that covariance together. PCA Project give a more clear view!


作者:卢弘毅