alex.corpustools package¶

Submodules¶

alex.corpustools.asr_decode module¶

alex.corpustools.asrscore module¶

alex.corpustools.autopath module¶

self cloning, automatic path configuration

copy this into any subdirectory of pypy from which scripts need to be run, typically all of the test subdirs. The idea is that any such script simply issues

import autopath

and this will make sure that the parent directory containing “pypy” is in sys.path.

If you modify the master “autopath.py” version (in pypy/tool/autopath.py) you can directly run it which will copy itself on all autopath.py files it finds under the pypy root directory.

This module always provides these attributes:

pypydir pypy root directory path this_dir directory where this autopath.py resides

alex.corpustools.cued-audio2ufal-audio module¶

alex.corpustools.cued-call-logs-sem2ufal-call-logs-sem module¶

alex.corpustools.cued-sem2ufal-sem module¶

alex.corpustools.cued module¶

This module is meant to collect functionality for handling call logs – both working with the call log files in the filesystem, and parsing them.

alex.corpustools.cued.find_logs(infname, ignore_list_file=None, verbose=False)[source]¶

Finds CUED logs below the paths specified and returns their filenames. The logs are determined as files matching one of the following patterns:

user-transcription.norm.xml user-transcription.xml user-transcription-all.xml

If multiple patterns are matched by files in the same directory, only the first match is taken.

Arguments:

infname – either a directory, or a file. In the first case, logs are: looked for below that directory. In the latter case, the file is read line by line, each line specifying a directory or a glob determining the log to include.
ignore_list_file – a file of absolute paths or globs (can be mixed): specifying logs that should be excluded from the results

verbose – print lots of output?

Returns a set of paths to files satisfying the criteria.

alex.corpustools.cued.find_wavs(infname, ignore_list_file=None)[source]¶

Finds wavs below the paths specified and returns their filenames.

Arguments:

infname – either a directory, or a file. In the first case, wavs are: looked for below that directory. In the latter case, the file is read line by line, each line specifying a directory or a glob determining the wav to include.
ignore_list_file – a file of absolute paths or globs (can be mixed): specifying wavs that should be excluded from the results

Returns a set of paths to files satisfying the criteria.

alex.corpustools.cued.find_with_ignorelist(infname, pat, ignore_list_file=None, find_kwargs={})[source]¶

Finds specific files below the paths specified and returns their filenames.

Arguments:

pat – globbing pattern specifying the files to look for infname – either a directory, or a file. In the first case, wavs are

looked for below that directory. In the latter case, the file is read line by line, each line specifying a directory or a glob determining the wav to include.

ignore_list_file – a file of absolute paths or globs (can be mixed): specifying wavs that should be excluded from the results
find_kwargs – if provided, this dictionary is used as additional: keyword arguments for the function `utils.fs.find’ for finding positive examples of files (not the ignored ones)

Returns a set of paths to files satisfying the criteria.

alex.corpustools.cued2utt_da_pairs module¶

class alex.corpustools.cued2utt_da_pairs.TurnRecord¶

Bases: tuple

TurnRecord(transcription, cued_da, cued_dahyp, asrhyp, audio)

asrhyp¶: Alias for field number 3

audio¶: Alias for field number 4

cued_da¶: Alias for field number 1

cued_dahyp¶: Alias for field number 2

transcription¶: Alias for field number 0

alex.corpustools.cued2utt_da_pairs.extract_trns_sems(infname, verbose, fields=None, ignore_list_file=None, do_exclude=True, normalise=True, known_words=None)[source]¶

Extracts transcriptions and their semantic annotation from a directory containing CUED call log files.

Arguments:

infname – either a directory, or a file. In the first case, logs are: looked for below that directory. In the latter case, the file is read line by line, each line specifying a directory or a glob determining the call log to include.

verbose – print lots of output? fields – names of fields that should be required for the output.

Field names are strings corresponding to the element names in the transcription XML format. (default: all five of them)

ignore_list_file – a file of absolute paths or globs (can be mixed): specifying logs that should be skipped

normalise – whether to do normalisation on transcriptions do_exclude – whether to exclude transcriptions not considered suitable known_words – a collection of words. If provided, transcriptions are

excluded which contain other words. If not provided, excluded are transcriptions that contain any of _excluded_characters. What “excluded” means depends on whether the transcriptions are required by being specified in `fields’.

Returns a list of TurnRecords.

alex.corpustools.cued2utt_da_pairs.extract_trns_sems_from_file(fname, verbose, fields=None, normalise=True, do_exclude=True, known_words=None, robust=False)[source]¶

Extracts transcriptions and their semantic annotation from a CUED call log file.

Arguments:

fname – path towards the call log file verbose – print lots of output? fields – names of fields that should be required for the output.

Field names are strings corresponding to the element names in the transcription XML format. (default: all five of them)

normalise – whether to do normalisation on transcriptions do_exclude – whether to exclude transcriptions not considered suitable known_words – a collection of words. If provided, transcriptions are

excluded which contain other words. If not provided, excluded are transcriptions that contain any of _excluded_characters. What “excluded” means depends on whether the transcriptions are required by being specified in `fields’.

robust – whether to assign recordings to turns robustly or trust where: they are in the log. This could be useful for older CUED logs where the elements sometimes escape to another <turn> than they belong. However, in cases where `robust’ leads to finding the correct recording for the user turn, the log is damaged at other places too, and the resulting turn record would be misleading. Therefore, we recommend leaving robust=False.

Returns a list of TurnRecords.

alex.corpustools.cued2utt_da_pairs.write_asrhyp_sem(outdir, fname, data)[source]¶

alex.corpustools.cued2utt_da_pairs.write_asrhyp_semhyp(outdir, fname, data)[source]¶

alex.corpustools.cued2utt_da_pairs.write_data(outdir, fname, data, tpt)[source]¶

alex.corpustools.cued2utt_da_pairs.write_trns_sem(outdir, fname, data)[source]¶

alex.corpustools.cued2wavaskey module¶

Finds CUED XML files describing calls in the directory specified, extracts a couple of fields from them for each turn (transcription, ASR 1-best, semantics transcription, SLU 1-best) and outputs them to separate files in the following format:

{wav_filename} => {field}

An example ignore list file could contain the following three lines:

/some-path/call-logs/log_dir/some_id.wav some_id.wav jurcic-??[13579]*.wav

The first one is an example of an ignored path. On UNIX, it has to start with a slash. On other platforms, an analogic convention has to be used.

The second one is an example of a literal glob.

The last one is an example of a more advanced glob. It says basically that all odd dialogue turns should be ignored.

alex.corpustools.cued2wavaskey.main(args)[source]¶

alex.corpustools.cuedda module¶

class alex.corpustools.cuedda.CUEDDialogueAct(text, da, database=None, dictionary=None)[source]¶

get_cued_da()[source]¶

get_slots_and_values()[source]¶

get_ufal_da()[source]¶

parse()[source]¶

class alex.corpustools.cuedda.CUEDSlot(slot)[source]¶

parse()[source]¶

alex.corpustools.fisherptwo2ufal-audio module¶

alex.corpustools.grammar_weighted module¶

class alex.corpustools.grammar_weighted.A(*rules)[source]¶: Bases: alex.corpustools.grammar_weighted.Alternative

class alex.corpustools.grammar_weighted.Alternative(*rules)[source]¶

Bases: alex.corpustools.grammar_weighted.Rule

sample()[source]¶

class alex.corpustools.grammar_weighted.GrammarGen(root)[source]¶

Bases: object

sample(n)[source]¶: Sampling of n sentences.

sample_uniq(n)[source]¶: Unique sampling of n sentences.

class alex.corpustools.grammar_weighted.O(rule, prob=0.5)[source]¶: Bases: alex.corpustools.grammar_weighted.Option

class alex.corpustools.grammar_weighted.Option(rule, prob=0.5)[source]¶

Bases: alex.corpustools.grammar_weighted.Rule

sample()[source]¶

class alex.corpustools.grammar_weighted.Rule[source]¶: Bases: object

class alex.corpustools.grammar_weighted.S(*rules)[source]¶: Bases: alex.corpustools.grammar_weighted.Sequence

class alex.corpustools.grammar_weighted.Sequence(*rules)[source]¶

Bases: alex.corpustools.grammar_weighted.Rule

sample()[source]¶

class alex.corpustools.grammar_weighted.T(string)[source]¶: Bases: alex.corpustools.grammar_weighted.Terminal

class alex.corpustools.grammar_weighted.Terminal(string)[source]¶

Bases: alex.corpustools.grammar_weighted.Rule

sample()[source]¶

class alex.corpustools.grammar_weighted.UA(*rules)[source]¶: Bases: alex.corpustools.grammar_weighted.UniformAlternative

class alex.corpustools.grammar_weighted.UniformAlternative(*rules)[source]¶

Bases: alex.corpustools.grammar_weighted.Rule

load(fn)[source]¶

Load alternative terminal strings from a file.

Parameters:	fn – a file name

sample()[source]¶

alex.corpustools.grammar_weighted.as_terminal(rule)[source]¶

alex.corpustools.grammar_weighted.as_weight_tuple(rule, def_weight=1.0)[source]¶

alex.corpustools.grammar_weighted.clamp_01(number)[source]¶

alex.corpustools.grammar_weighted.counter_weight(rules)[source]¶

alex.corpustools.grammar_weighted.remove_spaces(utterance)[source]¶

alex.corpustools.librispeech2ufal-audio module¶

alex.corpustools.lm module¶

alex.corpustools.malach-en2ufal-audio module¶

alex.corpustools.merge_uttcns module¶

alex.corpustools.merge_uttcns.find_best_cn(cns)[source]¶: Determines which one of decoded confnets seems the best.

alex.corpustools.merge_uttcns.merge_files(fnames, outfname)[source]¶

alex.corpustools.num_time_stats module¶

Traverses the filesystem below a specified directory, looking for call log directories. Writes a file containing statistics about each phone number (extracted from the call log dirs’ names):

number of calls

total size of recorded wav files

last expected date the caller would call

last date the caller actually called

the phone number

Call with -h to obtain the help for command line arguments.

2012-12-11 Matěj Korvas

alex.corpustools.num_time_stats.get_call_data_from_fs(rootdir)[source]¶

alex.corpustools.num_time_stats.get_call_data_from_log(log_fname)[source]¶

alex.corpustools.num_time_stats.get_timestamp(date)[source]¶: Total seconds in the timedelta.

alex.corpustools.num_time_stats.mean(collection)[source]¶

alex.corpustools.num_time_stats.sd(collection)[source]¶

alex.corpustools.num_time_stats.set_and_ret(indexable, idx, val)[source]¶

alex.corpustools.num_time_stats.var(collection)[source]¶

alex.corpustools.recording_splitter module¶

alex.corpustools.semscore module¶

alex.corpustools.semscore.load_semantics(file_name)[source]¶

alex.corpustools.semscore.score(fn_refsem, fn_testsem, item_level=False, detailed_error_output=False, outfile=<open file '<stdout>', mode 'w' at 0x7fb1cc3a6150>)[source]¶

alex.corpustools.semscore.score_da(ref_da, test_da, daid)[source]¶: Computed according to http://en.wikipedia.org/wiki/Precision_and_recall

alex.corpustools.semscore.score_file(refsem, testsem)[source]¶

alex.corpustools.split-asr-data module¶

alex.corpustools.srilm_ppl_filter module¶

alex.corpustools.srilm_ppl_filter.main()[source]¶

alex.corpustools.srilm_ppl_filter.srilm_scores(d3)[source]¶

alex.corpustools.text_norm_cs module¶

This module provides tools for CZECH normalisation of transcriptions, mainly for those obtained from human transcribers.

alex.corpustools.text_norm_cs.normalise_text(text)[source]¶: Normalises the transcription. This is the main function of this module.

alex.corpustools.text_norm_cs.exclude_by_dict(text, known_words)[source]¶

Determines whether text is not good enough and should be excluded.

“Good enough” is defined as having all its words present in the `known_words’ collection.

alex.corpustools.text_norm_en module¶

This module provides tools for ENGLISH normalisation of transcriptions, mainly for those obtained from human transcribers.

alex.corpustools.text_norm_en.normalise_text(text)[source]¶: Normalises the transcription. This is the main function of this module.

alex.corpustools.text_norm_en.exclude_by_dict(text, known_words)[source]¶

Determines whether text is not good enough and should be excluded.

“Good enough” is defined as having all its words present in the `known_words’ collection.

alex.corpustools.text_norm_es module¶

This module provides tools for ENGLISH normalisation of transcriptions, mainly for those obtained from human transcribers.

alex.corpustools.text_norm_es.normalise_text(text)[source]¶: Normalises the transcription. This is the main function of this module.

alex.corpustools.text_norm_es.exclude_by_dict(text, known_words)[source]¶

Determines whether text is not good enough and should be excluded.

“Good enough” is defined as having all its words present in the `known_words’ collection.

alex.corpustools.ufal-call-logs-audio2ufal-audio module¶

alex.corpustools.ufal-transcriber2ufal-audio module¶

alex.corpustools.ufaldatabase module¶

alex.corpustools.ufaldatabase.save_database(odir, slots)[source]¶

alex.corpustools.vad-mlf-from-ufal-audio module¶

alex.corpustools.voxforge2ufal-audio module¶

alex.corpustools.wavaskey module¶

alex.corpustools.wavaskey.load_wavaskey(fname, constructor, limit=None, encoding=u'UTF-8')[source]¶

Loads a dictionary of objects stored in the “wav as key” format.

The input file is assumed to contain lines of the following form:

[[:space:]..]<key>[[:space:]..]=>[[:space:]..]<obj_str>[[:space:]..]

or just (without keys):

[[:space:]..]<obj_str>[[:space:]..]

where <obj_str> is to be given as the only argument to the `constructor’ when constructing the objects stored in the file.

Arguments:

fname – path towards the file to read the objects from constructor – function that will be called on each string stored in

the file and whose result will become a value of the returned dictionary

limit – limit on the number of objects to read encoding – the file encoding

Returns a dictionary with objects constructed by `constructor’ as values.

alex.corpustools.wavaskey.save_wavaskey(fname, in_dict, encoding=u'UTF-8', trans=<function <lambda> at 0x7fb1c74b6050>)[source]¶

Saves a dictionary of objects in the wave as key format into a file.

Parameters:	file_name – name of the target file utt – a dictionary with the objects where the keys are the names of teh corresponding wave files
Parma trans:	a function which can transform a saved object
Returns:	None

alex.corpustools package¶

Submodules¶

alex.corpustools.asr_decode module¶

alex.corpustools.asrscore module¶

alex.corpustools.autopath module¶

alex.corpustools.cued-audio2ufal-audio module¶

alex.corpustools.cued-call-logs-sem2ufal-call-logs-sem module¶

alex.corpustools.cued-sem2ufal-sem module¶

alex.corpustools.cued module¶

alex.corpustools.cued2utt_da_pairs module¶

alex.corpustools.cued2wavaskey module¶

alex.corpustools.cuedda module¶

alex.corpustools.fisherptwo2ufal-audio module¶

alex.corpustools.grammar_weighted module¶

alex.corpustools.librispeech2ufal-audio module¶

alex.corpustools.lm module¶

alex.corpustools.malach-en2ufal-audio module¶

alex.corpustools.merge_uttcns module¶

alex.corpustools.num_time_stats module¶

alex.corpustools.recording_splitter module¶

alex.corpustools.semscore module¶

alex.corpustools.split-asr-data module¶

alex.corpustools.srilm_ppl_filter module¶

alex.corpustools.text_norm_cs module¶

alex.corpustools.text_norm_en module¶

alex.corpustools.text_norm_es module¶

alex.corpustools.ufal-call-logs-audio2ufal-audio module¶

alex.corpustools.ufal-transcriber2ufal-audio module¶

alex.corpustools.ufaldatabase module¶

alex.corpustools.vad-mlf-from-ufal-audio module¶

alex.corpustools.voxforge2ufal-audio module¶

alex.corpustools.wavaskey module¶

Module contents¶