Tuesday, April 1, 2008

Precision Vs Recall

In all the applications involving Information Retrieval one of the main concerns would be to balance Precision and Recall parameters. The general application involves a scenario where a query is submitted by the user and the documents that match the query in the corpus are retrieved using some information retrieval model. Well known information retrieval models include Vector Space model, Latent Semantic Indexing and several other Probabilistic models.

Precision refers to the ratio --> Number of relevant documents retrieved / Total number of retrieved documents.
Recall refers to the ratio --> Number of relevant documents retrieved / Total number of relevant documents.

Now one can observe that Recall will be 1 if one returns all the documents in the corpus but the whole application's performance will be in jeopardy if one returns all the documents in the corpus thus rendering the application useless. On the other extreme is when we return no documents, then the Precision is infinity. But even then the application is unusable because no documents are rendered to the user.
So the balance between precision and recall has to be established in the applications involving the Information Retrieval. Due to the ambiguities in the natural language many irrelevant documents are returned resulting in a text search applications with low precision. One of the main reasons that Google has established its ground as a giant is due to its Page rank algorithm which increases the precision a lot. So while developing the applications one can improve the performance by giving the proper weights to precision and recall as required by the specifications of a particular application.
For example there might be some applications where all the relevant documents need to be displayed irrespective of the total number of the retrieved documents. In this case one needs to give a high weight to the recall parameter while designing the search query. Commercial search engines require a high degree of precision because the number of web pages (each web page can be viewed as a document and the set of all web pages is the corpus) is in zillions. There are several measures like F1 measure in which precision and recall are evenly weighted.

No comments: