# Data are now texted documents

• Transform data to vector
$d^i\rightarrow \underline{d^i}=\begin{bmatrix}d^i_1\\ d^i_2\\ .\\.\\.\\d^i_T\end{bmatrix}\quad(\underline{d^i} \in R^T)$
• Set of Terms
$T = \{t_1t_2...t_T\}$
• $$d_k^i$$ shows how important is term $$t_k$$ in $$d^i$$

# Ways of deciding $$d^i_k$$

• Binary:

$d_k^i= \left\{ \begin{array}{lr} 0(t_k\space not\space exist)\\ 1(t_k\space exist) \end{array} \right.$
• Frequency:

$d_k^i=N^i_k$
• Relative Frequency:
$d_k^i=\frac{N^i_k}{|d^i|}$
• Term frequency inverse doc frequency
$d_k^i=\frac{N^i_k}{|d^i|}log\frac{N}{N_k}$

# Information Theory View Point

$d_k^i=\frac{N^i_k}{|d^i|}log\frac{1}{\frac{N_k}{N}}$ $\frac{N_k}{N}=P(t_k|Libary)$ $\frac{N_k^i}{|d^i|}=P(t_k|d^i)$

Consider 2 Dice with outcome $$\{1,2,3,4,5,6\}$$, random vaiable $$X$$ with distribution $$P(X)$$, event set is $$A$$.

So we have Event $$x\in$$ A (eg. $$x=3$$), in the event we have:

$$P(x)\uparrow$$ => Lower surprise => Low information.

$$P(x)\downarrow$$ => Higher surprise => Higher information.

Information = $$\frac{1}{P(x)}$$, Consider info transfer situation.

Binary info channel (N equally event)

# New Situation

Unfair Dice1: $$P(x)$$.

Infomartion: $$I(x)=log_2\frac{1}{P(x)}$$, optimal Average Code Length:

$E_{P(x)}[log_2\frac{1}{P(x)}]=\sum_{x\in A}P(x)log_2\frac{1}{P(x)}=H(x)(entropy)$

But, we can only estimate the $$P(x)$$, the ‘nature’ is $$P(x)$$ but what we have is $$Q(x)$$(our estimation), so we have code length $$I(x)=log_2\frac{1}{P(x)}$$.

True Average Code Length:

$E_{P(x)}[log_2\frac{1}{Q(x)}]=\sum_{x\in A}P(x)log_2\frac{1}{Q(x)}$

# 但是，古尔丹！代价是什么呢？

Cost K-L divergence:

$E_{P(x)}[log_2\frac{1}{Q(x)}]-E_{P(x)}[log_2\frac{1}{P(x)}]$ $=\space\space\sum_{x\in A}P(x)[log_2\frac{1}{Q(x)}+log_2P(x)]$ $=\sum_{x\in A}P(x)log_2\frac{P(x)}{Q(x)}(Cross Entropy)$

This can be used to evaluate the difference between two different prob distribution. That is also why the tdidf uses $$E_{P(x)}[log_2\frac{1}{Q(x)}]$$(allow the document to stand out in the distribution of the whole libary)

The term of $$E_{P(x)}[log_2\frac{1}{P(x)}]$$is just to normalize the difference.

$H[X,Y]=H[X]+H[Y](independent)$ $H[X,Y]=H[X]+H[Y|X](dependent)$
In document mining, a document vector usually has 1000 or more dimension. $$d^i\rightarrow d^i\in R^T(T\approx1000)$$