The **Topic Model** is a type of statistical model to find the topics that occur in a collection of documents.
It is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. For example, image to have some articles or a series of social media messages and we want to understand what is going on inside of them. A common tool to face this problem is via **Unsupervised Machine Learning model**.
From a high level perspective, having a bunch of documents, and like in **K-Mean Clustering** we want to find the **K** number of topics that best describe our corpus of text.

From the figure above, we have three different topics: technology (yellow), business(orange), and arts(blue). In addition to those topics, we also have an association to the documents to topics. In fact, a document can be entirely a technology topic (i.e. red light, green light, 2-tone led, simplify screen), but also a document that is a sort of mixture of two or more topic (see the grey text in the figure below).

Topic Modelcan be seen as a **Matrix Factorization Problem**, where **K** is the number of topics, **M** is the number of documents, and **V** is the size of the vocabulary.

The **MxV** matrix corresponds to the distribution of the words for each of the topics,and to find this the **Latent Semantic Analysis** is widely used.

An alternative of the **Matrix Factorization Problem** is the **Generative Model**.

More particularly, the **Latent Dirichlet Allocation** is commonly used in topic modelling.

\[ P(\boldsymbol{p} | \alpha \boldsymbol{m})=\frac{\Gamma\left(\sum_{k} \alpha m_{k}\right)}{\prod_{k} \Gamma\left(\alpha m_{k}\right)} \prod_{k} p_{k}^{\alpha m_{k}-1} \]

Here above, we have the Dirichlet Distribution Equation, where **alpha** is the variance, and **m** is the mean.

As described in the figure above, when we have a **uniform mean** (1/3), and an **variance** (alpha) of three, when we molltiply them togheter, the **LDA=1**, and so each topic is equally likely.
But when the **variance** is larger and larged the LDA is concentrating the distribution around the mean (the dark circle in the middle of the triangle).

There are other ways to parametrize the LDA distribution.

For example, we can have mean in a differnt location (see the left and center figure below).

We can also have the **variance parameter alpha** smaller than **1**. In this case we push the probability mass to the edges of the simplex (see the right figure below). In this case, with **alpha<1** we have a preference for the multinomial distribution which is far avay for the center.
To be **far away from the center** means we have not a precise topic to assign to our word. It is similar to how people write document where many things are inside a concept.
In other words, the Dirichlet Distribution give us a distribution over all the places where the Multinomila Distribution can land.

The **Dirichlet Distribution** can be used to isolate which document is about a specific topic.
For each topic **K**, we have a multinomial distribution **Betak**, called **Topic Distribution** from a Dirichlet Distribution with parameter **lambda**.
The next step is to draw a **multinomial distribution over topics** represented by **Өd**. Once we have it, we can draw for each word the so called **Topic Assignment** represented in the figure below by **Zn**.
Till now, we don’t know what the word is. We have to look at the **Topic Distribution Betak** in order to generate the word that comes from the multinomial distribution.

The graph above is a representation of the probabilistic model, also called **Plate Notation**.

As we can see we have in the Plate Notation **K** topics in **M** documents, and **N** words in each document.

Crucially, the only thing that we observe are the **words**, and our task is to figure out what topic to assign **Zn**.

Ideally, once we have the collection of words per topic, is the topic is interpretable, people will consistently choose true **Intruder**, or define the words that didn’t belong.

To learn more about LDA please check out this **link**.