I was born in India in the state of Uttar Pradesh (Hindi, my native language, for “Northern State”). I spent most of my boyhood in the foothills of the Himalayas. I got a BS degree in Computer Science from University of Roorkee (now IIT Roorkee) in India, a MS, Computer Science again, from University of Minnesota (somehow, back then, I always found myself in cold places) and a PhD in Computer Science from Cornell University. At Cornell I studied with (late) Prof. Gerard Salton, one of the founders of the field of IR. Somewhere between my degrees I had real jobs doing database programming and IR system hacking. After my PhD I joined AT&T Labs in 1996. In 2000, my friend Krishna Bharat persuaded me to join Google.
My research interests are in the area of information retrieval (IR), its application to web search, web graph analysis, and user interfaces for search. Here are some of my selected publications (chronologically ordered). At Google I have worked on using IR techniques to improve web search. Before joining Google in 2000. I did research in the following sub-areas of Information Retrieval:
- Speech Retrieval: Increasing amounts of spoken communication are stored in digital form for archival purposes (for instance, broadcasts material). With advances in automatic speech recognition (ASR) technology, it is now possible to automatically transcribe speech with reasonable accuracy. Once transcribed, IR methods can be used to search speech collections. Think of this as a search engine for speech. However, the interesting problem is to search speech given large number of automatic speech recognition errors. More recently I have done some work in this area. When at AT&T Labs, we developed SCAN, a system that combines speech recognition, information retrieval and user interface techniques to provide a multimodal interface to speech archives.
- Document Ranking: Also called text/document searching/retrieval (that makes four phrases by the way), this is the best known part of our field. If you are reading this page, chances are that you have already used a “search engine” before. Document ranking is what search engines do: given a user query, how to rank a large collection of documents (web pages, news articles, your email, someone else’s email that you happen to have hacked, …) so that what you are looking for is ranked ahead of other less useful (or useless) documents.
- Question Answering: People have questions and they need answers, not documents. Automatic question answering will definitely be a significant advance in the state-of-art information retrieval technology. Systems that can do reliable question answering without domain restrictions have not been developed yet.I organized the first few runnings of the QA Track under the Text REtrieval Conference (TREC) umbrella to advance this sub-field of language processing.
- Document Routing/Filtering: This is the “query by example” version of document ranking. Once you point the system to a few “good documents”, the system then tracks all NEW documents and points you to only those ones that you should be looking at. Typically the system tries to find new documents that are similar to the documents that you said were good.
- Automatic Text Summarization: Documents are huge and we don’t always want to read them all. (I don’t know about you but I certainly don’t have the patience. And given the stuff you find on the web …) Techniques that automatically “summarize” documents will be tremendously useful. Domain independent text summarization is very hard, at times even for humans; typically machines do summarization by text extraction. Relevant pieces (sentences, paragraphs, …) of text are typically extracted and presented as a “summary”.
- Miscellaneous (TREC): Since 1992 National Institute of Standards in Technology (NIST) (along with DARPA) sponsors an annual conference called Text REtrieval Conference (TREC) to support research within the information retrieval community by providing the infrastructure necessary for large-scale evaluation of text retrieval methodologies. I have been actively participating in TRECs since TREC-3 (held in 1994).