Approach to bootstraping the domain specific language models

**WARNING**: Please note that domain specific language models are build in ./alex/applications/*/lm
This text explains a simple approach to building a domain specific language models, which can be different for every
domain.

While an acoustic model can be build domain independent, the language models (LMs) must be domain specific to ensure high accuracy of the ASR.

In general, building an in-domain LM is easy as long as one has enough of in-domain training data. However, when the in-domain data is scarce, e.g. when deploying a new dialogue system, this task is difficult and there is a need for some bootstrap solution.

The approach described here builds on:

  1. some bootstrap text - probably handcrafted, which captures the main aspects of the domain
  2. LM classes - which clusters words into classes, this can be derived from some domain ontology. For example, all food types belong to the FOOD class and all public transport stops stops belong to the STOP class
  3. in-domain data - collected using some prototype or final system
  4. general out-of-domain data - for example Wikipedia - from which is selected a subset of data, similar to our in-domain data

Then a simple process of building a domain specific language model can described as follows:

  1. Append bootstrap text to the text extracted from the indomain data.
  2. Build a class based language model using the data generated in the previous step and the classes derived from the domain ontology.
  3. Score the general (domain independent) data using the LM build in the previous step.
  4. Select some sentences with the lowest perplexity given the class based language model.
  5. Append the selected sentences to the training data generated in the 1. step.
  6. Re-build the class based language model.
  7. Generate dictionaries.

Structure of each domain scripts

Each of the projects should contain:

  1. build.py - builds the final LMs, and computes perplexity of final LMs

Necessary files for the LM

For each domain the LM package should contain:

  1. ARPA trigram language model (final.tg.arpa)
  2. ARPA bigram language model (final.bg.arpa)
  3. HTK wordnet bigram language model (final.bg.wdnet)
  4. List of all words in the language model (final.vocab)
  5. Dictionary including all words in the language model using compatible phone set with the language specific acoustic model (final.dict - without pauses and final.dict.sp_sil with short and long pauses)

CamInfoRest

For more details please see alex.applications.CamInfoRest.lm.README.