Building a SLU for the PTIen domain¶

Available data¶

At this moment, we only have data which were automatically generated using our handcrafted SLU (HDC SLU) parser on the transcribed audio. In general, the quality of the automatic annotation is very good.

The data can be prepared using the prapare_data.py script. It assumes that there exist the indomain_data directory with links to directories containing asr_transcribed.xml files. Then it uses these files to extract transcriptions and generate automatic SLU annotations using the PTIENHDCSLU parser from the hdc_slu.py file.

The script generates the following files:

*.trn: contains manual transcriptions
*.trn.hdc.sem: contains automatic annotation from transcriptions using handcrafted SLU
*.asr: contains ASR 1-best results
*.asr.hdc.sem: contains automatic annotation from 1-best ASR using handcrafted SLU
*.nbl: contains ASR N-best results
*.nbl.hdc.sem: contains automatic annotation from n-best ASR using handcrafted SLU

The script accepts --uniq parameter for fast generation of unique HDC SLU annotations. This is useful when tuning the HDC SLU.

The script also accepts --fast parameter for fast approximate preparation of all data. It approximates the HDC SLU output from an N-best list using output obtained by parsing the 1-best ASR result.

Building the models¶

First, prepare the data. Link the directories with the in-domain data into the indomain_data directory. Then run the following command:

./prepare_data.py

Second, train and test the models.

./train.py && ./test.py && ./test_bootstrap.py

Third, look at the *.score files or compute the interesting scores by running:

./print_scores.sh

Future work¶

The prepare_data.py will have to use ASR, NBLIST, and CONFNET data generated by the latest ASR system instead of the logged ASR results because the ASR can change over time.
Condition the SLU DialogueActItem decoding on the previous system dialogue act.

Evaluation¶

Evaluation of ASR from the call logs files¶

The current ASR performance computed on from the call logs is as follows:

Please note that the scoring is implicitly ignoring all non-speech events.

Ref: all.trn
Tst: all.asr
|==============================================================================================|
|            | # Sentences  |  # Words  |   Corr   |   Sub    |   Del    |   Ins    |   Err    |
|----------------------------------------------------------------------------------------------|
| Sum/Avg    |     9111     |   24728   |  56.15   |  16.07   |  27.77   |   1.44   |  45.28   |
|==============================================================================================|

The results above were obtained using the Google ASR.

Evaluation of the minimum number of feature counts¶

Using 9111 training examples, we found that pruning should be set to

min feature count = 3
min classifier count = 4

to prevent overfitting.

Cheating experiment: train and test on all data¶

Due to sparsity issue, the evaluation on proper test and dev sets suffers from sampling errors. Therefore, here we presents results when all data are used as training data and the metrics are evaluated on the training data!!!

Using the ./print_scores.sh one can get scores for assessing the quality of trained models. The results from experiments are stored in the old.scores.* files. Please look at the results marked as DATA ALL ASR - *.

If the automatic annotations were correct, we could conclude that the F-measure of the HDC SLU parser on 1-best is higher wne compared to F-measure on N-best%. This is confusing as it looks like that the decoding from n-best lists gives worse results when compared to decoding from 1-best ASR hypothesis.

Evaluation of TRN model on test data¶

The TRN model is trained on transcriptions and evaluated on transcriptions from test data. Please look at the results marked as DATA TEST TRN - *. One can see that the performance of the TRN model on TRN test data is NOT 100 % perfect. This is probably due to the mismatch between the train and test data sets. Once more training data will be available, we can expect better results.

Evaluation of ASR model on test data¶

The ASR model is trained on 1-best ASR output and evaluated on the 1-best ASR output from test data. Please look at the results marked as DATA TEST ASR - *. The ASR model scores significantly better on the ASR test data when compared to the HDC SLU parser when evaluated on the ASR data. The improvement is about 20 % in F-measure (absolute). This shows that SLU trained on the ASR data can be beneficial.

Evaluation of NBL model on test data¶

The NBL model is trained on N-best ASR output and evaluated on the N-best ASR from test data. Please look at the results marked as DATA TEST NBL - *. One can see that using nblists even from Google ASR can help; though only a little (about 1 %). When more data will be available, more test and more feature engineering can be done. However, we are more interested in extracting features from lattices or confusion networks.

Now, we have to wait for a working decoder generating good lattices. The OpenJulius decoder is not a suitable as it crashes unexpectedly and therefore it cannot be used in a real system.