文本挖掘——智能数据分析

2018-10-05 | 阅读：次

Document Mining

Data are now texted documents

Transform data to vector

\[d^i\rightarrow \underline{d^i}=\begin{bmatrix}d^i_1\\ d^i_2\\ .\\.\\.\\d^i_T\end{bmatrix}\quad(\underline{d^i} \in R^T)\]

Set of Terms

\[T = \{t_1t_2...t_T\}\]

\(d_k^i\) shows how important is term \(t_k\) in \(d^i\)

Ways of deciding \(d^i_k\)

Binary:
\[d_k^i= \left\{ \begin{array}{lr} 0(t_k\space not\space exist)\\ 1(t_k\space exist) \end{array} \right.\]
Frequency:

\[d_k^i=N^i_k\]

Relative Frequency:

\[d_k^i=\frac{N^i_k}{|d^i|}\]

Term frequency inverse doc frequency

\[d_k^i=\frac{N^i_k}{|d^i|}log\frac{N}{N_k}\]

Information Theory View Point

\[d_k^i=\frac{N^i_k}{|d^i|}log\frac{1}{\frac{N_k}{N}}\] \[\frac{N_k}{N}=P(t_k|Libary)\] \[\frac{N_k^i}{|d^i|}=P(t_k|d^i)\]

Consider 2 Dice with outcome \(\{1,2,3,4,5,6\}\), random vaiable \(X\) with distribution \(P(X)\), event set is \(A\).

So we have Event \(x\in\) A (eg. \(x=3\)), in the event we have:

\(P(x)\uparrow\) => Lower surprise => Low information.

\(P(x)\downarrow\) => Higher surprise => Higher information.

Information = \(\frac{1}{P(x)}\), Consider info transfer situation.

Binary info channel (N equally event)

New Situation

Unfair Dice1: \(P(x)\).

Infomartion: \(I(x)=log_2\frac{1}{P(x)}\), optimal Average Code Length:

\[E_{P(x)}[log_2\frac{1}{P(x)}]=\sum_{x\in A}P(x)log_2\frac{1}{P(x)}=H(x)(entropy)\]

But, we can only estimate the \(P(x)\), the ‘nature’ is \(P(x)\) but what we have is \(Q(x)\)(our estimation), so we have code length \(I(x)=log_2\frac{1}{P(x)}\).

True Average Code Length:

\[E_{P(x)}[log_2\frac{1}{Q(x)}]=\sum_{x\in A}P(x)log_2\frac{1}{Q(x)}\]

但是，古尔丹！代价是什么呢？

Cost K-L divergence:

\[E_{P(x)}[log_2\frac{1}{Q(x)}]-E_{P(x)}[log_2\frac{1}{P(x)}]\] \[=\space\space\sum_{x\in A}P(x)[log_2\frac{1}{Q(x)}+log_2P(x)]\] \[=\sum_{x\in A}P(x)log_2\frac{P(x)}{Q(x)}(Cross Entropy)\]

This can be used to evaluate the difference between two different prob distribution. That is also why the tdidf uses \(E_{P(x)}[log_2\frac{1}{Q(x)}]\)(allow the document to stand out in the distribution of the whole libary)

The term of \(E_{P(x)}[log_2\frac{1}{P(x)}]\)is just to normalize the difference.

Some fact about entropy

\[H[X,Y]=H[X]+H[Y](independent)\] \[H[X,Y]=H[X]+H[Y|X](dependent)\]

Semantic Meaning

In document mining, a document vector usually has 1000 or more dimension. \(d^i\rightarrow d^i\in R^T(T\approx1000)\)

But words are usually related! If in a article it mentioned ‘rain’, it is very likely it will also mention ‘cloud’, so the PCA can give us a remarkable result.

作者：卢弘毅

SUSTech MMA

南方科技大学数学建模协会