alex.corpustools package

Submodules

alex.corpustools.asr_decode module

alex.corpustools.asrscore module

alex.corpustools.autopath module

self cloning, automatic path configuration

copy this into any subdirectory of pypy from which scripts need to be run, typically all of the test subdirs. The idea is that any such script simply issues

import autopath

and this will make sure that the parent directory containing “pypy” is in sys.path.

If you modify the master “autopath.py” version (in pypy/tool/autopath.py) you can directly run it which will copy itself on all autopath.py files it finds under the pypy root directory.

This module always provides these attributes:

pypydir pypy root directory path this_dir directory where this autopath.py resides

alex.corpustools.cued-audio2ufal-audio module

alex.corpustools.cued-call-logs-sem2ufal-call-logs-sem module

alex.corpustools.cued-sem2ufal-sem module

alex.corpustools.cued module

This module is meant to collect functionality for handling call logs – both working with the call log files in the filesystem, and parsing them.

alex.corpustools.cued.find_logs(infname, ignore_list_file=None, verbose=False)[source]

Finds CUED logs below the paths specified and returns their filenames. The logs are determined as files matching one of the following patterns:

user-transcription.norm.xml user-transcription.xml user-transcription-all.xml

If multiple patterns are matched by files in the same directory, only the first match is taken.

Arguments:
infname – either a directory, or a file. In the first case, logs are
looked for below that directory. In the latter case, the file is read line by line, each line specifying a directory or a glob determining the log to include.
ignore_list_file – a file of absolute paths or globs (can be mixed)
specifying logs that should be excluded from the results

verbose – print lots of output?

Returns a set of paths to files satisfying the criteria.

alex.corpustools.cued.find_wavs(infname, ignore_list_file=None)[source]

Finds wavs below the paths specified and returns their filenames.

Arguments:
infname – either a directory, or a file. In the first case, wavs are
looked for below that directory. In the latter case, the file is read line by line, each line specifying a directory or a glob determining the wav to include.
ignore_list_file – a file of absolute paths or globs (can be mixed)
specifying wavs that should be excluded from the results

Returns a set of paths to files satisfying the criteria.

alex.corpustools.cued.find_with_ignorelist(infname, pat, ignore_list_file=None, find_kwargs={})[source]

Finds specific files below the paths specified and returns their filenames.

Arguments:

pat – globbing pattern specifying the files to look for infname – either a directory, or a file. In the first case, wavs are

looked for below that directory. In the latter case, the file is read line by line, each line specifying a directory or a glob determining the wav to include.
ignore_list_file – a file of absolute paths or globs (can be mixed)
specifying wavs that should be excluded from the results
find_kwargs – if provided, this dictionary is used as additional
keyword arguments for the function `utils.fs.find’ for finding positive examples of files (not the ignored ones)

Returns a set of paths to files satisfying the criteria.

alex.corpustools.cued2utt_da_pairs module

class alex.corpustools.cued2utt_da_pairs.TurnRecord

Bases: tuple

TurnRecord(transcription, cued_da, cued_dahyp, asrhyp, audio)

asrhyp

Alias for field number 3

audio

Alias for field number 4

cued_da

Alias for field number 1

cued_dahyp

Alias for field number 2

transcription

Alias for field number 0

alex.corpustools.cued2utt_da_pairs.extract_trns_sems(infname, verbose, fields=None, ignore_list_file=None, do_exclude=True, normalise=True, known_words=None)[source]

Extracts transcriptions and their semantic annotation from a directory containing CUED call log files.

Arguments:
infname – either a directory, or a file. In the first case, logs are
looked for below that directory. In the latter case, the file is read line by line, each line specifying a directory or a glob determining the call log to include.

verbose – print lots of output? fields – names of fields that should be required for the output.

Field names are strings corresponding to the element names in the transcription XML format. (default: all five of them)
ignore_list_file – a file of absolute paths or globs (can be mixed)
specifying logs that should be skipped

normalise – whether to do normalisation on transcriptions do_exclude – whether to exclude transcriptions not considered suitable known_words – a collection of words. If provided, transcriptions are

excluded which contain other words. If not provided, excluded are transcriptions that contain any of _excluded_characters. What “excluded” means depends on whether the transcriptions are required by being specified in `fields’.

Returns a list of TurnRecords.

alex.corpustools.cued2utt_da_pairs.extract_trns_sems_from_file(fname, verbose, fields=None, normalise=True, do_exclude=True, known_words=None, robust=False)[source]

Extracts transcriptions and their semantic annotation from a CUED call log file.

Arguments:

fname – path towards the call log file verbose – print lots of output? fields – names of fields that should be required for the output.

Field names are strings corresponding to the element names in the transcription XML format. (default: all five of them)

normalise – whether to do normalisation on transcriptions do_exclude – whether to exclude transcriptions not considered suitable known_words – a collection of words. If provided, transcriptions are

excluded which contain other words. If not provided, excluded are transcriptions that contain any of _excluded_characters. What “excluded” means depends on whether the transcriptions are required by being specified in `fields’.
robust – whether to assign recordings to turns robustly or trust where
they are in the log. This could be useful for older CUED logs where the elements sometimes escape to another <turn> than they belong. However, in cases where `robust’ leads to finding the correct recording for the user turn, the log is damaged at other places too, and the resulting turn record would be misleading. Therefore, we recommend leaving robust=False.

Returns a list of TurnRecords.

alex.corpustools.cued2utt_da_pairs.write_asrhyp_sem(outdir, fname, data)[source]
alex.corpustools.cued2utt_da_pairs.write_asrhyp_semhyp(outdir, fname, data)[source]
alex.corpustools.cued2utt_da_pairs.write_data(outdir, fname, data, tpt)[source]
alex.corpustools.cued2utt_da_pairs.write_trns_sem(outdir, fname, data)[source]

alex.corpustools.cued2wavaskey module

Finds CUED XML files describing calls in the directory specified, extracts a couple of fields from them for each turn (transcription, ASR 1-best, semantics transcription, SLU 1-best) and outputs them to separate files in the following format:

{wav_filename} => {field}

An example ignore list file could contain the following three lines:

/some-path/call-logs/log_dir/some_id.wav some_id.wav jurcic-??[13579]*.wav

The first one is an example of an ignored path. On UNIX, it has to start with a slash. On other platforms, an analogic convention has to be used.

The second one is an example of a literal glob.

The last one is an example of a more advanced glob. It says basically that all odd dialogue turns should be ignored.

alex.corpustools.cued2wavaskey.main(args)[source]

alex.corpustools.cuedda module

class alex.corpustools.cuedda.CUEDDialogueAct(text, da, database=None, dictionary=None)[source]
get_cued_da()[source]
get_slots_and_values()[source]
get_ufal_da()[source]
parse()[source]
class alex.corpustools.cuedda.CUEDSlot(slot)[source]
parse()[source]

alex.corpustools.fisherptwo2ufal-audio module

alex.corpustools.grammar_weighted module

class alex.corpustools.grammar_weighted.A(*rules)[source]

Bases: alex.corpustools.grammar_weighted.Alternative

class alex.corpustools.grammar_weighted.Alternative(*rules)[source]

Bases: alex.corpustools.grammar_weighted.Rule

sample()[source]
class alex.corpustools.grammar_weighted.GrammarGen(root)[source]

Bases: object

sample(n)[source]

Sampling of n sentences.

sample_uniq(n)[source]

Unique sampling of n sentences.

class alex.corpustools.grammar_weighted.O(rule, prob=0.5)[source]

Bases: alex.corpustools.grammar_weighted.Option

class alex.corpustools.grammar_weighted.Option(rule, prob=0.5)[source]

Bases: alex.corpustools.grammar_weighted.Rule

sample()[source]
class alex.corpustools.grammar_weighted.Rule[source]

Bases: object

class alex.corpustools.grammar_weighted.S(*rules)[source]

Bases: alex.corpustools.grammar_weighted.Sequence

class alex.corpustools.grammar_weighted.Sequence(*rules)[source]

Bases: alex.corpustools.grammar_weighted.Rule

sample()[source]
class alex.corpustools.grammar_weighted.T(string)[source]

Bases: alex.corpustools.grammar_weighted.Terminal

class alex.corpustools.grammar_weighted.Terminal(string)[source]

Bases: alex.corpustools.grammar_weighted.Rule

sample()[source]
class alex.corpustools.grammar_weighted.UA(*rules)[source]

Bases: alex.corpustools.grammar_weighted.UniformAlternative

class alex.corpustools.grammar_weighted.UniformAlternative(*rules)[source]

Bases: alex.corpustools.grammar_weighted.Rule

load(fn)[source]

Load alternative terminal strings from a file.

Parameters:fn – a file name
sample()[source]
alex.corpustools.grammar_weighted.as_terminal(rule)[source]
alex.corpustools.grammar_weighted.as_weight_tuple(rule, def_weight=1.0)[source]
alex.corpustools.grammar_weighted.clamp_01(number)[source]
alex.corpustools.grammar_weighted.counter_weight(rules)[source]
alex.corpustools.grammar_weighted.remove_spaces(utterance)[source]

alex.corpustools.librispeech2ufal-audio module

alex.corpustools.lm module

alex.corpustools.malach-en2ufal-audio module

alex.corpustools.merge_uttcns module

alex.corpustools.merge_uttcns.find_best_cn(cns)[source]

Determines which one of decoded confnets seems the best.

alex.corpustools.merge_uttcns.merge_files(fnames, outfname)[source]

alex.corpustools.num_time_stats module

Traverses the filesystem below a specified directory, looking for call log directories. Writes a file containing statistics about each phone number (extracted from the call log dirs’ names):

  • number of calls
  • total size of recorded wav files
  • last expected date the caller would call
  • last date the caller actually called
  • the phone number

Call with -h to obtain the help for command line arguments.

2012-12-11 Matěj Korvas

alex.corpustools.num_time_stats.get_call_data_from_fs(rootdir)[source]
alex.corpustools.num_time_stats.get_call_data_from_log(log_fname)[source]
alex.corpustools.num_time_stats.get_timestamp(date)[source]

Total seconds in the timedelta.

alex.corpustools.num_time_stats.mean(collection)[source]
alex.corpustools.num_time_stats.sd(collection)[source]
alex.corpustools.num_time_stats.set_and_ret(indexable, idx, val)[source]
alex.corpustools.num_time_stats.var(collection)[source]

alex.corpustools.recording_splitter module

alex.corpustools.semscore module

alex.corpustools.semscore.load_semantics(file_name)[source]
alex.corpustools.semscore.score(fn_refsem, fn_testsem, item_level=False, detailed_error_output=False, outfile=<open file '<stdout>', mode 'w' at 0x7fb1cc3a6150>)[source]
alex.corpustools.semscore.score_da(ref_da, test_da, daid)[source]

Computed according to http://en.wikipedia.org/wiki/Precision_and_recall

alex.corpustools.semscore.score_file(refsem, testsem)[source]

alex.corpustools.split-asr-data module

alex.corpustools.srilm_ppl_filter module

alex.corpustools.srilm_ppl_filter.main()[source]
alex.corpustools.srilm_ppl_filter.srilm_scores(d3)[source]

alex.corpustools.text_norm_cs module

This module provides tools for CZECH normalisation of transcriptions, mainly for those obtained from human transcribers.

alex.corpustools.text_norm_cs.normalise_text(text)[source]

Normalises the transcription. This is the main function of this module.

alex.corpustools.text_norm_cs.exclude_by_dict(text, known_words)[source]

Determines whether text is not good enough and should be excluded.

“Good enough” is defined as having all its words present in the `known_words’ collection.

alex.corpustools.text_norm_en module

This module provides tools for ENGLISH normalisation of transcriptions, mainly for those obtained from human transcribers.

alex.corpustools.text_norm_en.normalise_text(text)[source]

Normalises the transcription. This is the main function of this module.

alex.corpustools.text_norm_en.exclude_by_dict(text, known_words)[source]

Determines whether text is not good enough and should be excluded.

“Good enough” is defined as having all its words present in the `known_words’ collection.

alex.corpustools.text_norm_es module

This module provides tools for ENGLISH normalisation of transcriptions, mainly for those obtained from human transcribers.

alex.corpustools.text_norm_es.normalise_text(text)[source]

Normalises the transcription. This is the main function of this module.

alex.corpustools.text_norm_es.exclude_by_dict(text, known_words)[source]

Determines whether text is not good enough and should be excluded.

“Good enough” is defined as having all its words present in the `known_words’ collection.

alex.corpustools.ufal-call-logs-audio2ufal-audio module

alex.corpustools.ufal-transcriber2ufal-audio module

alex.corpustools.ufaldatabase module

alex.corpustools.ufaldatabase.save_database(odir, slots)[source]

alex.corpustools.vad-mlf-from-ufal-audio module

alex.corpustools.voxforge2ufal-audio module

alex.corpustools.wavaskey module

alex.corpustools.wavaskey.load_wavaskey(fname, constructor, limit=None, encoding=u'UTF-8')[source]

Loads a dictionary of objects stored in the “wav as key” format.

The input file is assumed to contain lines of the following form:

[[:space:]..]<key>[[:space:]..]=>[[:space:]..]<obj_str>[[:space:]..]

or just (without keys):

[[:space:]..]<obj_str>[[:space:]..]

where <obj_str> is to be given as the only argument to the `constructor’ when constructing the objects stored in the file.

Arguments:

fname – path towards the file to read the objects from constructor – function that will be called on each string stored in

the file and whose result will become a value of the returned dictionary

limit – limit on the number of objects to read encoding – the file encoding

Returns a dictionary with objects constructed by `constructor’ as values.

alex.corpustools.wavaskey.save_wavaskey(fname, in_dict, encoding=u'UTF-8', trans=<function <lambda> at 0x7fb1c74b6050>)[source]

Saves a dictionary of objects in the wave as key format into a file.

Parameters:
  • file_name – name of the target file
  • utt – a dictionary with the objects where the keys are the names of teh corresponding wave files
Parma trans:

a function which can transform a saved object

Returns:

None

Module contents