Building the language model for the Public Transport Info telephone service (Czech)

*WARNING* To build the language model, you will need a machine with a lot of memory (more than 16GB RAM).

The data

To build the domain specific language model, we use the approach described in Approach to bootstraping the domain specific language models. So far, we have collected this data:

  1. selected out-of-domain data - more than 2000 sentences
  2. bootstrap text - 289 sentences
  3. indomain data - more than 9000 sentences (out of which about 900 of the sentences are used as development data)

Building the models

The models are built using the build.py script.

It requires to set the following variables:

bootstrap_text                  = "bootstrap.txt"
classes                         = "../data/database_SRILM_classes.txt"
indomain_data_dir               = "indomain_data"

The variables description:

  • bootstrap_text - the bootstrap.txt file contains handcrafted in-domain sentences.
  • classes - the ../data/database_SRILM_classes.txt file is created by the database.py script in the alex/applications/PublicTransportInfoCS/data directory.
  • indomain_data_dir - should include links to directories containing asr_transcribed.xml files with transcribed audio data

The process of building/re-building the LM is:

cd ../data
./database.py dump
cd ../lm
./build.py

Distributions of the models

The final.* models are large. Therefore, they should be distributed online on-demand using the online_update function. Please do not forget to place the models generated by the ./build.py script on the distribution servers.

Reuse of build.py

The build.py script can be easily generalised to a different language or different text data, e.g. the in-domain data.