Intelligent data analysis
Process
- Start with question about data (free dimension, content, predictions).
- Build a model to anwser it.
Questions(Outline)
- How many degrees of freedom i the data(What are the real factor that are changing, roatte axis, find large variability, ignore noise)?
- Are the ‘mutual’ groups in the data?
- Predictiive models on data?
- Can we deal with docs(unorganized data)?
- Ranking PageRank(Not include in test).
Basic Concept of Statics & Propability
Notation:
\(R=\{x^1,x^2...\},\quad x^i\in R^n\)
- Expectation and Mean: \(E[x]=\frac{1}{n}\sum_1^nx^i=\mu=\sum_1^kjP(j)\)
- Variance: \(Var[x]=\frac{1}{n}\sum_1^n(x^i-E[x])^2=\sum_1^kP(j)(j-E[j])^2\)
- Standard Deviation: \(\sigma=\sqrt{Var[X]}\)
Varibility in more dimensions
\(\underline{X}= \left[ \begin{array}{cc} x_1\\ x_2\\ x_3\\ ... \end{array} \right]\).
\(Cov[x_1,x_2] = E[(x_1-E[x_1])(x_2-E[x_2])]\).
Independent measurement
Data transform :
- mean = 0.
- Standard deviation = 1.
- After Tranform, Relational Coeffcient:
- Covarience may change with coordinate system.
More dimensions
Covarience Matrix
Calculate Every Pair of Covarience
\[X^k\in R^d\] \[C_{ij}=Cov[X_jX_k]\quad(d\times{}d\,matrix)\] \[Cov[X_j,X_k]=Cov[X_k,X_j]\] \[C_{ij}=1\quad (i=j)\]After Transform
\[\hat{C}_{jk}=\sigma_{jk}^2(j=k)(diagonal)\] \[\hat{C}_{jk}=0(j\ne k)\]What we want?
Dataset usually have 10 or even more dimensions. Many of these dimensions may be related so we want a new set of coordinate where datas are unrelated(independent). In the covariance matrix, entries describe how each axis change with other axis, so we want transform it into a more independent coordinate. And then project each spot into these axis. Namely, we combine related axis into together to get unrelated axis.
Steps
- Normalization
-
V and X
\[V = \begin{bmatrix} v^1&v^2&...\end{bmatrix}\quad\text{(unit of new axis)}\] \[\space X = \begin{bmatrix}x^1&x^2&...\end{bmatrix}\] -
Covaranice Matrix
\[\hat{C_{jk}}=\frac{1}{N}\sum_{i=1}^{N}(x_j^i-\mu_j)(x_k^i-\mu_k)=\frac{1}{N}\sum_{i=1}^{N}x_j^ix_k^i\] \[\mu = 0\] -
Matrix Form
\[\hat{C}=\frac{1}{N}XX^T\space and\space \tilde{C}=\frac{1}{N}\tilde{X}\tilde{X}^T\] -
Substitute \(\tilde{X} = V^TX\)
\(\tilde{C}\) is the diagonal eigenvalue matrixe and \(V^T=V^{-1}\), the dimension D we want to keep usually is smaller than the dataset dimension d.
\[\mu(D)=\frac{\sigma_1^2+\sigma_2^2+...+\sigma_D^2}{\sigma_1^2+\sigma_2^2+...+\sigma_d^2}\]EigenVector capture the axis that covariance together. PCA Project give a more clear view!
作者:卢弘毅