Building a SLU for the PTIen domain¶
At this moment, we only have data which were automatically generated using our handcrafted SLU (HDC SLU) parser on the transcribed audio. In general, the quality of the automatic annotation is very good.
The data can be prepared using the
prapare_data.py script. It assumes that there exist the
with links to directories containing
asr_transcribed.xml files. Then it uses these files to extract transcriptions
and generate automatic SLU annotations using the PTIENHDCSLU parser from the
The script generates the following files:
*.trn: contains manual transcriptions
*.trn.hdc.sem: contains automatic annotation from transcriptions using handcrafted SLU
*.asr: contains ASR 1-best results
*.asr.hdc.sem: contains automatic annotation from 1-best ASR using handcrafted SLU
*.nbl: contains ASR N-best results
*.nbl.hdc.sem: contains automatic annotation from n-best ASR using handcrafted SLU
The script accepts
--uniq parameter for fast generation of unique HDC SLU annotations.
This is useful when tuning the HDC SLU.
The script also accepts
--fast parameter for fast approximate preparation of all data.
It approximates the HDC SLU output from an N-best list using output obtained by parsing the 1-best ASR result.
Building the models¶
First, prepare the data. Link the directories with the in-domain data into the
indomain_data directory. Then run the
Second, train and test the models.
./train.py && ./test.py && ./test_bootstrap.py
Third, look at the
*.score files or compute the interesting scores by running:
prepare_data.pywill have to use ASR, NBLIST, and CONFNET data generated by the latest ASR system instead of the logged ASR results because the ASR can change over time.
- Condition the SLU DialogueActItem decoding on the previous system dialogue act.
Evaluation of ASR from the call logs files¶
The current ASR performance computed on from the call logs is as follows:
Please note that the scoring is implicitly ignoring all non-speech events. Ref: all.trn Tst: all.asr |==============================================================================================| | | # Sentences | # Words | Corr | Sub | Del | Ins | Err | |----------------------------------------------------------------------------------------------| | Sum/Avg | 9111 | 24728 | 56.15 | 16.07 | 27.77 | 1.44 | 45.28 | |==============================================================================================|
The results above were obtained using the Google ASR.
Evaluation of the minimum number of feature counts¶
Using 9111 training examples, we found that pruning should be set to
- min feature count = 3
- min classifier count = 4
to prevent overfitting.
Cheating experiment: train and test on all data¶
Due to sparsity issue, the evaluation on proper test and dev sets suffers from sampling errors. Therefore, here we presents results when all data are used as training data and the metrics are evaluated on the training data!!!
./print_scores.sh one can get scores for assessing the quality of trained models. The results from
experiments are stored in the
old.scores.* files. Please look at the results marked as
DATA ALL ASR - *.
If the automatic annotations were correct, we could conclude that the F-measure of the HDC SLU parser on 1-best is higher wne compared to F-measure on N-best%. This is confusing as it looks like that the decoding from n-best lists gives worse results when compared to decoding from 1-best ASR hypothesis.
Evaluation of TRN model on test data¶
The TRN model is trained on transcriptions and evaluated on transcriptions from test data. Please look at the results
DATA TEST TRN - *. One can see that the performance of the TRN model on TRN test data is NOT
100 % perfect. This is probably due to the mismatch between the train and test data sets. Once more training data will
be available, we can expect better results.
Evaluation of ASR model on test data¶
The ASR model is trained on 1-best ASR output and evaluated on the 1-best ASR output from test data. Please look at
the results marked as
DATA TEST ASR - *. The ASR model scores significantly better on the ASR test data when
compared to the HDC SLU parser when evaluated on the ASR data. The improvement is about 20 % in F-measure (absolute).
This shows that SLU trained on the ASR data can be beneficial.
Evaluation of NBL model on test data¶
The NBL model is trained on N-best ASR output and evaluated on the N-best ASR from test data. Please look at
the results marked as
DATA TEST NBL - *. One can see that using nblists even from Google ASR can help; though
only a little (about 1 %). When more data will be available, more test and more feature engineering can be done.
However, we are more interested in extracting features from lattices or confusion networks.
Now, we have to wait for a working decoder generating good lattices. The OpenJulius decoder is not a suitable as it crashes unexpectedly and therefore it cannot be used in a real system.