audiofilemfcc

This module contains the following classes:

  • AudioFileMFCC, representing a mono WAVE audio file as a matrix of Mel-frequency ceptral coefficients (MFCC).
class aeneas.audiofilemfcc.AudioFileMFCC(file_path=None, file_format=None, mfcc_matrix=None, audio_file=None, rconf=None, logger=None)[source]

A monoaural (single channel) WAVE audio file, represented as a NumPy 2D matrix of Mel-frequency ceptral coefficients (MFCC).

The matrix is “fat”, that is, its number of rows is equal to the number of MFCC coefficients and its number of columns is equal to the number of window shifts in the audio file. The number of MFCC coefficients and the MFCC window shift can be modified via the MFCC_SIZE and MFCC_WINDOW_SHIFT keys in the rconf object.

If mfcc_matrix is not None, it will be used as the MFCC matrix.

If file_path or audio_file is not None, the MFCCs will be computed upon creation of the object, possibly converting to PCM16 Mono WAVE and/or loading audio data in memory.

The MFCCs for the entire wave are divided into three contiguous intervals (possibly, zero-length):

HEAD   = [:middle_begin[
MIDDLE = [middle_begin:middle_end[
TAIL   = [middle_end:[

The usual NumPy convention of including the left/start index and excluding the right/end index is adopted.

For alignment purposes, only the MIDDLE portion of the wave is taken into account; the HEAD and TAIL intervals are ignored.

This class heavily uses NumPy views and in-place operations to avoid creating temporary data or copying data around.

Parameters:
  • file_path (string) – the path of the PCM16 mono WAVE file, or None
  • file_format (tuple) – the format of the audio file, if known in advance: (codec, channels, rate) or None
  • mfcc_matrix (numpy.ndarray) – the MFCC matrix to be set, or None
  • audio_file (AudioFile) – an audio file, or None
  • rconf (RuntimeConfiguration) – a runtime configuration
  • logger (Logger) – the logger object
Raises:

ValueError: if file_path, audio_file, and mfcc_matrix are all None

New in version 1.5.0.

all_length

The length, in MFCC coefficients, of the entire audio file, that is, HEAD + MIDDLE + TAIL.

Return type:int
all_mfcc

The MFCCs of the entire audio file, that is, HEAD + MIDDLE + TAIL.

Return type:numpy.ndarray (2D)
audio_length

The length, in seconds, of the audio file.

This value is the actual length of the audio file, computed as number of samples / sample_rate, hence it might differ than len(self.__mfcc) * mfcc_window_shift.

Return type:TimeValue
head_length

The length, in MFCC coefficients, of the HEAD of the audio file.

Return type:int
inside_nonspeech(index)[source]

If index is contained in a nonspeech interval, return a pair (interval_begin, interval_end) such that interval_begin <= index < interval_end, i.e., interval_end is assumed not to be included.

Otherwise, return None.

Return type:None or tuple
intervals(speech=True, time=True)[source]

Return a list of intervals:

[(b_1, e_1), (b_2, e_2), ..., (b_k, e_k)]

where b_i is the time when the i-th interval begins, and e_i is the time when it ends.

Parameters:
  • speech (bool) – if True, return speech intervals, otherwise return nonspeech intervals
  • time (bool) – if True, return TimeInterval objects, otherwise return indices (int)
Return type:

list of pairs (see above)

is_reversed

Return True if currently reversed.

Return type:bool
masked_length

Return the number of MFCC speech frames in the FULL wave.

Return type:int
masked_map

Return the map from the MFCC speech frame indices to the MFCC FULL frame indices.

Return type:numpy.ndarray (1D)
masked_mfcc

Return the MFCC speech frames in the FULL wave.

Return type:numpy.ndarray (2D)
masked_middle_length

Return the number of MFCC speech frames in the MIDDLE portion of the wave.

Return type:int
masked_middle_map

Return the map from the MFCC speech frame indices in the MIDDLE portion of the wave to the MFCC FULL frame indices.

Return type:numpy.ndarray (1D)
masked_middle_mfcc

Return the MFCC speech frames in the MIDDLE portion of the wave.

Return type:numpy.ndarray (2D)
middle_begin

Return the index where MIDDLE starts.

Return type:int
middle_begin_seconds

Return the time instant, in seconds, where MIDDLE starts.

Return type:TimeValue
middle_end

Return the index (+1) where MIDDLE ends.

Return type:int
middle_end_seconds

Return the time instant, in seconds, where MIDDLE ends.

Return type:TimeValue
middle_length

The length, in MFCC coefficients, of the middle part of the audio file, that is, without HEAD and TAIL.

Return type:int
middle_map

Return the map from the MFCC frame indices in the MIDDLE portion of the wave to the MFCC FULL frame indices, that is, an numpy.arange(self.middle_begin, self.middle_end).

NOTE: to translate indices of MIDDLE, instead of using fancy indexing with the result of this function, you might want to simply add self.head_length. This function is provided mostly for consistency with the MASKED case.

Return type:numpy.ndarray (1D)
middle_mfcc

The MFCCs of the middle part of the audio file, that is, without HEAD and TAIL.

Return type:numpy.ndarray (2D)
reverse()[source]

Reverse the audio file.

The reversing is done efficiently using NumPy views inplace instead of swapping values.

Only speech and nonspeech intervals are actually recomputed as Python lists.

run_vad(log_energy_threshold=None, min_nonspeech_length=None, extend_before=None, extend_after=None)[source]

Determine which frames contain speech and nonspeech, and store the resulting boolean mask internally.

The four parameters might be None: in this case, the corresponding RuntimeConfiguration values are applied.

Parameters:
  • log_energy_threshold (float) – the minimum log energy threshold to consider a frame as speech
  • min_nonspeech_length (int) – the minimum length, in frames, of a nonspeech interval
  • extend_before (int) – extend each speech interval by this number of frames to the left (before)
  • extend_after (int) – extend each speech interval by this number of frames to the right (after)
set_head_middle_tail(head_length=None, middle_length=None, tail_length=None)[source]

Set the HEAD, MIDDLE, TAIL explicitly.

If a parameter is None, it will be ignored. If both middle_length and tail_length are specified, only middle_length will be applied.

Parameters:
  • head_length (TimeValue) – the length of HEAD, in seconds
  • middle_length (TimeValue) – the length of MIDDLE, in seconds
  • tail_length (TimeValue) – the length of TAIL, in seconds
Raises:

TypeError: if one of the arguments is not None or TimeValue

Raises:

ValueError: if one of the arguments is greater than the length of the audio file

tail_begin

The index, in MFCC coefficients, where the TAIL of the audio file starts.

Return type:int
tail_length

The length, in MFCC coefficients, of the TAIL of the audio file.

Return type:int