alex.corpustools package¶
Submodules¶
alex.corpustools.asr_decode module¶
alex.corpustools.asrscore module¶
alex.corpustools.autopath module¶
self cloning, automatic path configuration
copy this into any subdirectory of pypy from which scripts need to be run, typically all of the test subdirs. The idea is that any such script simply issues
import autopath
and this will make sure that the parent directory containing “pypy” is in sys.path.
If you modify the master “autopath.py” version (in pypy/tool/autopath.py) you can directly run it which will copy itself on all autopath.py files it finds under the pypy root directory.
This module always provides these attributes:
pypydir pypy root directory path this_dir directory where this autopath.py resides
alex.corpustools.cued-audio2ufal-audio module¶
alex.corpustools.cued-call-logs-sem2ufal-call-logs-sem module¶
alex.corpustools.cued-sem2ufal-sem module¶
alex.corpustools.cued module¶
This module is meant to collect functionality for handling call logs – both working with the call log files in the filesystem, and parsing them.
-
alex.corpustools.cued.
find_logs
(infname, ignore_list_file=None, verbose=False)[source]¶ Finds CUED logs below the paths specified and returns their filenames. The logs are determined as files matching one of the following patterns:
user-transcription.norm.xml user-transcription.xml user-transcription-all.xmlIf multiple patterns are matched by files in the same directory, only the first match is taken.
- Arguments:
- infname – either a directory, or a file. In the first case, logs are
- looked for below that directory. In the latter case, the file is read line by line, each line specifying a directory or a glob determining the log to include.
- ignore_list_file – a file of absolute paths or globs (can be mixed)
- specifying logs that should be excluded from the results
verbose – print lots of output?
Returns a set of paths to files satisfying the criteria.
-
alex.corpustools.cued.
find_wavs
(infname, ignore_list_file=None)[source]¶ Finds wavs below the paths specified and returns their filenames.
- Arguments:
- infname – either a directory, or a file. In the first case, wavs are
- looked for below that directory. In the latter case, the file is read line by line, each line specifying a directory or a glob determining the wav to include.
- ignore_list_file – a file of absolute paths or globs (can be mixed)
- specifying wavs that should be excluded from the results
Returns a set of paths to files satisfying the criteria.
-
alex.corpustools.cued.
find_with_ignorelist
(infname, pat, ignore_list_file=None, find_kwargs={})[source]¶ Finds specific files below the paths specified and returns their filenames.
- Arguments:
pat – globbing pattern specifying the files to look for infname – either a directory, or a file. In the first case, wavs are
looked for below that directory. In the latter case, the file is read line by line, each line specifying a directory or a glob determining the wav to include.- ignore_list_file – a file of absolute paths or globs (can be mixed)
- specifying wavs that should be excluded from the results
- find_kwargs – if provided, this dictionary is used as additional
- keyword arguments for the function `utils.fs.find’ for finding positive examples of files (not the ignored ones)
Returns a set of paths to files satisfying the criteria.
alex.corpustools.cued2utt_da_pairs module¶
-
class
alex.corpustools.cued2utt_da_pairs.
TurnRecord
(transcription, cued_da, cued_dahyp, asrhyp, audio)¶ Bases:
tuple
-
asrhyp
¶ Alias for field number 3
-
audio
¶ Alias for field number 4
-
cued_da
¶ Alias for field number 1
-
cued_dahyp
¶ Alias for field number 2
-
transcription
¶ Alias for field number 0
-
-
alex.corpustools.cued2utt_da_pairs.
extract_trns_sems
(infname, verbose, fields=None, ignore_list_file=None, do_exclude=True, normalise=True, known_words=None)[source]¶ Extracts transcriptions and their semantic annotation from a directory containing CUED call log files.
- Arguments:
- infname – either a directory, or a file. In the first case, logs are
- looked for below that directory. In the latter case, the file is read line by line, each line specifying a directory or a glob determining the call log to include.
verbose – print lots of output? fields – names of fields that should be required for the output.
Field names are strings corresponding to the element names in the transcription XML format. (default: all five of them)- ignore_list_file – a file of absolute paths or globs (can be mixed)
- specifying logs that should be skipped
normalise – whether to do normalisation on transcriptions do_exclude – whether to exclude transcriptions not considered suitable known_words – a collection of words. If provided, transcriptions are
excluded which contain other words. If not provided, excluded are transcriptions that contain any of _excluded_characters. What “excluded” means depends on whether the transcriptions are required by being specified in `fields’.
Returns a list of TurnRecords.
-
alex.corpustools.cued2utt_da_pairs.
extract_trns_sems_from_file
(fname, verbose, fields=None, normalise=True, do_exclude=True, known_words=None, robust=False)[source]¶ Extracts transcriptions and their semantic annotation from a CUED call log file.
- Arguments:
fname – path towards the call log file verbose – print lots of output? fields – names of fields that should be required for the output.
Field names are strings corresponding to the element names in the transcription XML format. (default: all five of them)normalise – whether to do normalisation on transcriptions do_exclude – whether to exclude transcriptions not considered suitable known_words – a collection of words. If provided, transcriptions are
excluded which contain other words. If not provided, excluded are transcriptions that contain any of _excluded_characters. What “excluded” means depends on whether the transcriptions are required by being specified in `fields’.- robust – whether to assign recordings to turns robustly or trust where
- they are in the log. This could be useful for older CUED logs where the elements sometimes escape to another <turn> than they belong. However, in cases where `robust’ leads to finding the correct recording for the user turn, the log is damaged at other places too, and the resulting turn record would be misleading. Therefore, we recommend leaving robust=False.
Returns a list of TurnRecords.
alex.corpustools.cued2wavaskey module¶
Finds CUED XML files describing calls in the directory specified, extracts a couple of fields from them for each turn (transcription, ASR 1-best, semantics transcription, SLU 1-best) and outputs them to separate files in the following format:
{wav_filename} => {field}
An example ignore list file could contain the following three lines:
/some-path/call-logs/log_dir/some_id.wav some_id.wav jurcic-??[13579]*.wav
The first one is an example of an ignored path. On UNIX, it has to start with a slash. On other platforms, an analogic convention has to be used.
The second one is an example of a literal glob.
The last one is an example of a more advanced glob. It says basically that all odd dialogue turns should be ignored.
alex.corpustools.cuedda module¶
alex.corpustools.fisherptwo2ufal-audio module¶
alex.corpustools.gen-gsm-audio module¶
alex.corpustools.grammar_weighted module¶
alex.corpustools.kky-transcriber2ufal-audio module¶
alex.corpustools.librispeech2ufal-audio module¶
alex.corpustools.lm module¶
alex.corpustools.malach-en2ufal-audio module¶
alex.corpustools.merge_uttcns module¶
alex.corpustools.num_time_stats module¶
Traverses the filesystem below a specified directory, looking for call log directories. Writes a file containing statistics about each phone number (extracted from the call log dirs’ names):
- number of calls
- total size of recorded wav files
- last expected date the caller would call
- last date the caller actually called
- the phone number
Call with -h to obtain the help for command line arguments.
2012-12-11 Matěj Korvas
alex.corpustools.recording_splitter module¶
alex.corpustools.semscore module¶
-
alex.corpustools.semscore.
score
(fn_refsem, fn_testsem, item_level=False, detailed_error_output=False, outfile=<open file '<stdout>', mode 'w'>)[source]¶
-
alex.corpustools.semscore.
score_da
(ref_da, test_da, daid)[source]¶ Computed according to http://en.wikipedia.org/wiki/Precision_and_recall
alex.corpustools.split-asr-data module¶
alex.corpustools.srilm_ppl_filter module¶
alex.corpustools.text_norm_cs module¶
This module provides tools for CZECH normalisation of transcriptions, mainly for those obtained from human transcribers.
alex.corpustools.text_norm_en module¶
This module provides tools for ENGLISH normalisation of transcriptions, mainly for those obtained from human transcribers.
alex.corpustools.text_norm_es module¶
This module provides tools for ENGLISH normalisation of transcriptions, mainly for those obtained from human transcribers.
alex.corpustools.ufal-call-logs-audio2ufal-audio module¶
alex.corpustools.ufal-transcriber2ufal-audio module¶
alex.corpustools.ufaldatabase module¶
alex.corpustools.vad-mlf-from-ufal-audio module¶
alex.corpustools.vctk2ufal-audio module¶
alex.corpustools.voxforge2ufal-audio module¶
alex.corpustools.wavaskey module¶
-
alex.corpustools.wavaskey.
load_wavaskey
(fname, constructor, limit=None, encoding=u'UTF-8')[source]¶ Loads a dictionary of objects stored in the “wav as key” format.
The input file is assumed to contain lines of the following form:
[[:space:]..]<key>[[:space:]..]=>[[:space:]..]<obj_str>[[:space:]..]or just (without keys):
[[:space:]..]<obj_str>[[:space:]..]where <obj_str> is to be given as the only argument to the `constructor’ when constructing the objects stored in the file.
- Arguments:
fname – path towards the file to read the objects from constructor – function that will be called on each string stored in
the file and whose result will become a value of the returned dictionarylimit – limit on the number of objects to read encoding – the file encoding
Returns a dictionary with objects constructed by `constructor’ as values.
-
alex.corpustools.wavaskey.
save_wavaskey
(fname, in_dict, encoding=u'UTF-8', trans=<function <lambda>>)[source]¶ Saves a dictionary of objects in the wave as key format into a file.
Parameters: - file_name – name of the target file
- utt – a dictionary with the objects where the keys are the names of teh corresponding wave files
Parma trans: a function which can transform a saved object
Returns: None