Document Mining
Data are now texted documents
- Transform data to vector
- Set of Terms
- \(d_k^i\) shows how important is term \(t_k\) in \(d^i\)
Ways of deciding \(d^i_k\)
-
Binary:
\[d_k^i= \left\{ \begin{array}{lr} 0(t_k\space not\space exist)\\ 1(t_k\space exist) \end{array} \right.\] -
Frequency:
- Relative Frequency:
- Term frequency inverse doc frequency
Information Theory View Point
\[d_k^i=\frac{N^i_k}{|d^i|}log\frac{1}{\frac{N_k}{N}}\] \[\frac{N_k}{N}=P(t_k|Libary)\] \[\frac{N_k^i}{|d^i|}=P(t_k|d^i)\]Consider 2 Dice with outcome \(\{1,2,3,4,5,6\}\), random vaiable \(X\) with distribution \(P(X)\), event set is \(A\).
So we have Event \(x\in\) A (eg. \(x=3\)), in the event we have:
\(P(x)\uparrow\) => Lower surprise => Low information.
\(P(x)\downarrow\) => Higher surprise => Higher information.
Information = \(\frac{1}{P(x)}\), Consider info transfer situation.
Binary info channel (N equally event)
New Situation
Unfair Dice1: \(P(x)\).
Infomartion: \(I(x)=log_2\frac{1}{P(x)}\), optimal Average Code Length:
\[E_{P(x)}[log_2\frac{1}{P(x)}]=\sum_{x\in A}P(x)log_2\frac{1}{P(x)}=H(x)(entropy)\]But, we can only estimate the \(P(x)\), the ‘nature’ is \(P(x)\) but what we have is \(Q(x)\)(our estimation), so we have code length \(I(x)=log_2\frac{1}{P(x)}\).
True Average Code Length:
\[E_{P(x)}[log_2\frac{1}{Q(x)}]=\sum_{x\in A}P(x)log_2\frac{1}{Q(x)}\]但是,古尔丹!代价是什么呢?
Cost K-L divergence:
\[E_{P(x)}[log_2\frac{1}{Q(x)}]-E_{P(x)}[log_2\frac{1}{P(x)}]\] \[=\space\space\sum_{x\in A}P(x)[log_2\frac{1}{Q(x)}+log_2P(x)]\] \[=\sum_{x\in A}P(x)log_2\frac{P(x)}{Q(x)}(Cross Entropy)\]This can be used to evaluate the difference between two different prob distribution. That is also why the tdidf uses \(E_{P(x)}[log_2\frac{1}{Q(x)}]\)(allow the document to stand out in the distribution of the whole libary)
The term of \(E_{P(x)}[log_2\frac{1}{P(x)}]\)is just to normalize the difference.
Some fact about entropy
\[H[X,Y]=H[X]+H[Y](independent)\] \[H[X,Y]=H[X]+H[Y|X](dependent)\]Semantic Meaning
In document mining, a document vector usually has 1000 or more dimension. \(d^i\rightarrow d^i\in R^T(T\approx1000)\)
But words are usually related! If in a article it mentioned ‘rain’, it is very likely it will also mention ‘cloud’, so the PCA can give us a remarkable result.
作者:卢弘毅