textfile¶

This module contains the following classes:

TextFile, representing a text file;
TextFileFormat, an enumeration of text file formats;
TextFilter, an abstract class for filtering text;
TextFilterIgnoreRegex, a regular expression text filter;
TextFilterTransliterate, a transliteration text filter;
TextFragment, representing a single text fragment;
TransliterationMap, a full transliteration map.

class aeneas.textfile.TextFile(file_path=None, file_format=None, parameters=None, rconf=None, logger=None)[source]¶

A tree of text fragments, representing a text file.

Parameters:	file_path (string) – the path to the text file. If not `None` (and also `file_format` is not `None`), the file will be read immediately. file_format (`TextFileFormat`) – the format of the text file parameters (dict) – additional parameters used to parse the text file rconf (`RuntimeConfiguration`) – a runtime configuration logger (`Logger`) – the logger object
Raises:	OSError: if `file_path` cannot be read
Raises:	TypeError: if `parameters` is not an instance of `dict`
Raises:	ValueError: if `file_format` value is not allowed

add_fragment(fragment, as_last=True)[source]¶

Add the given text fragment as the first or last child of the root node of the text file tree.

Parameters:	fragment (`TextFragment`) – the text fragment to be added as_last (bool) – if `True` append fragment, otherwise prepend it

characters¶

The number of characters in this text file.

Return type:	int

chars¶

Return the number of characters of the text file, not counting line or fragment separators.

Return type:	int

children_not_empty¶

Return the direct not empty children of the root of the fragments tree, as TextFile objects.

Return type:	list of `TextFile`

clear()[source]¶: Clear the text file, removing all the current fragments.

file_format¶

The format of the text file.

Return type:	`TextFileFormat`

file_path¶

The path of the text file.

Return type:	string

fragments¶

The current list of text fragments which are the children of the root node of the text file tree.

Return type:	list of `TextFragment`

fragments_tree¶

Return the current tree of fragments.

Return type:	`Tree`

get_slice(start=None, end=None)[source]¶

Return a new list of text fragments, indexed from start (included) to end (excluded).

Parameters:	start (int) – the start index, included end (int) – the end index, excluded
Return type:	`TextFile`

get_subtree(root)[source]¶

Return a new TextFile object, rooted at the given node root.

Parameters:	root (`Tree`) – the root node
Return type:	`TextFile`

parameters¶

Additional parameters used to parse the text file.

Return type:	dict

read_from_list(lines)[source]¶

Read text fragments from a given list of strings:

[fragment_1, fragment_2, ..., fragment_n]

Parameters:	lines (list) – the text fragments

read_from_list_with_ids(lines)[source]¶

Read text fragments from a given list of tuples:

[(id_1, text_1), (id_2, text_2), ..., (id_n, text_n)].

Parameters:	lines (list) – the list of `[id, text]` fragments (see above)

set_language(language)[source]¶

Set the given language for all the text fragments.

Parameters:	language (`Language`) – the language of the text fragments

class aeneas.textfile.TextFileFormat[source]¶

Enumeration of the supported formats for text files.

ALLOWED_VALUES = ['mplain', 'munparsed', 'parsed', 'plain', 'subtitles', 'unparsed']¶: List of all the allowed values

MPLAIN = 'mplain'¶

Multilevel version of the PLAIN format.

The text file contains fragments on multiple levels: paragraphs are separated by (at least) a blank line, sentences are on different lines, words will be recognized automatically:

First sentence of Paragraph One.
Second sentence of Paragraph One.

First sentence of Paragraph Two.

First sentence of Paragraph Three.
Second sentence of Paragraph Three.
Third sentence of Paragraph Three.

The above will produce the following text tree:

Paragraph1 ("First ... One.")
  Sentence1 ("First ... One.")
    Word1 ("First")
    Word2 ("sentence")
    ...
    Word5 ("One.")
  Sentence2 ("Second ... One.")
    Word1 ("Second")
    Word2 ("sentence")
    ...
    Word5 ("One.")
Paragraph2 ("First ... Two.")
  Sentence1 ("First ... Two.")
    Word1 ("First")
    Word2 ("sentence")
    ...
    Word5 ("Two.")
...

MULTILEVEL_VALUES = ['mplain', 'munparsed']¶: List of all multilevel formats

MUNPARSED = 'munparsed'¶

Multilevel version of the UNPARSED format.

The text file contains fragments on three levels: level 1 (paragraph), level 2 (sentence), level 3 (word):

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" lang="en" xml:lang="en">
 <head>
  <meta charset="utf-8"/>
  <link rel="stylesheet" href="../Styles/style.css" type="text/css"/>
  <title>Sonnet I</title>
 </head>
 <body>
  <div id="divTitle">
   <h1>
    <span id="p000001">
     <span id="p000001s000001">
      <span id="p000001s000001w000001">I</span>
     </span>
    </span>
   </h1>
  </div>
  <div id="divSonnet">
   <p class="stanza" id="p000002">
    <span id="p000002s000001">
     <span id="p000002s000001w000001">From</span>
     <span id="p000002s000001w000002">fairest</span>
     <span id="p000002s000001w000003">creatures</span>
     <span id="p000002s000001w000004">we</span>
     <span id="p000002s000001w000005">desire</span>
     <span id="p000002s000001w000006">increase,</span>
    </span><br/>
    <span id="p000002s000002">
     <span id="p000002s000002w000001">That</span>
     <span id="p000002s000002w000002">thereby</span>
     <span id="p000002s000002w000003">beauty’s</span>
     <span id="p000002s000002w000004">rose</span>
     <span id="p000002s000002w000005">might</span>
     <span id="p000002s000002w000006">never</span>
     <span id="p000002s000002w000007">die,</span>
    </span><br/>
    ...
   </p>
   ...
  </div>
 </body>
</html>

PARSED = 'parsed'¶

The text file contains the fragments, one per line, with the syntax id|text, where id is a non-empty fragment identifier and text is the text of the fragment:

f001|Text of the first fragment
f002|Text of the second fragment
f003|Text of the third fragment

PLAIN = 'plain'¶

The text file contains the fragments, one per line, without explicitly-assigned identifiers:

Text of the first fragment
Text of the second fragment
Text of the third fragment

SUBTITLES = 'subtitles'¶

The text file contains the fragments, each fragment is contained in one or more consecutive lines, separated by (at least) a blank line, without explicitly-assigned identifiers. Use this format if you want to output to SRT/TTML/VTT and you want to keep multilines in the output file:

Fragment on a single row

Fragment on two rows
because it is quite long

Another one liner

Another fragment
on two rows

UNPARSED = 'unparsed'¶

The text file is a well-formed HTML/XHTML file, where the text fragments have already been marked up.

The text fragments will be extracted by matching the id and/or class attributes of each elements with the provided regular expressions:

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" lang="en" xml:lang="en">
 <head>
  <meta charset="utf-8"/>
  <link rel="stylesheet" href="../Styles/style.css" type="text/css"/>
  <title>Sonnet I</title>
 </head>
 <body>
  <div id="divTitle">
   <h1><span class="ra" id="f001">I</span></h1>
  </div>
  <div id="divSonnet">
   <p>
    <span class="ra" id="f002">From fairest creatures we desire increase,</span><br/>
    <span class="ra" id="f003">That thereby beauty’s rose might never die,</span><br/>
    <span class="ra" id="f004">But as the riper should by time decease,</span><br/>
    <span class="ra" id="f005">His tender heir might bear his memory:</span><br/>
    <span class="ra" id="f006">But thou contracted to thine own bright eyes,</span><br/>
    <span class="ra" id="f007">Feed’st thy light’s flame with self-substantial fuel,</span><br/>
    <span class="ra" id="f008">Making a famine where abundance lies,</span><br/>
    <span class="ra" id="f009">Thy self thy foe, to thy sweet self too cruel:</span><br/>
    <span class="ra" id="f010">Thou that art now the world’s fresh ornament,</span><br/>
    <span class="ra" id="f011">And only herald to the gaudy spring,</span><br/>
    <span class="ra" id="f012">Within thine own bud buriest thy content,</span><br/>
    <span class="ra" id="f013">And tender churl mak’st waste in niggarding:</span><br/>
    <span class="ra" id="f014">Pity the world, or else this glutton be,</span><br/>
    <span class="ra" id="f015">To eat the world’s due, by the grave and thee.</span>
   </p>
  </div>
 </body>
</html>

class aeneas.textfile.TextFilter(rconf=None, logger=None)[source]¶

A text filter is a function acting on a list of strings, and returning a new list of strings derived from the former (with the same number of elements).

For example, a filter might apply a regex to the input string, or it might transliterate its characters.

Filters can be chained, to the left or to the right.

Parameters:	rconf (`RuntimeConfiguration`) – a runtime configuration logger (`Logger`) – the logger object

add_filter(new_filter, as_last=True)[source]¶

Compose this filter with the given new_filter filter.

Parameters:	new_filter (`TextFilter`) – the filter to be composed as_last (bool) – if `True`, compose to the right, otherwise to the left

apply_filter(strings)[source]¶

Apply the text filter filter to the given list of strings.

Parameters:	strings (list) – the list of input strings

class aeneas.textfile.TextFilterIgnoreRegex(regex, rconf=None, logger=None)[source]¶

Delete the text matching the given regex.

Leading/trailing spaces, and repeated spaces are removed.

Parameters:	regex (regex) – the regular expression to be applied rconf (`RuntimeConfiguration`) – a runtime configuration logger (`Logger`) – the logger object
Raises:	ValueError: if `regex` is not a valid regex

class aeneas.textfile.TextFilterTransliterate(map_file_path=None, map_object=None, rconf=None, logger=None)[source]¶

Transliterate the text using the given map file.

Leading/trailing spaces, and repeated spaces are removed.

Parameters:	map_object (`TransliterationMap`) – the map object map_file_path (string) – the path to a map file rconf (`RuntimeConfiguration`) – a runtime configuration logger (`Logger`) – the logger object
Raises:	OSError: if `map_file_path` cannot be read
Raises:	TypeError: if `map_object` is not an instance of `TransliterationMap`

class aeneas.textfile.TextFragment(identifier=None, language=None, lines=None, filtered_lines=None)[source]¶

A text fragment.

Internally, all the text objects are Unicode strings.

Parameters:	identifier (string) – the identifier of the fragment language (`Language`) – the language of the text of the fragment lines (list) – the lines in which text is split up filtered_lines (list) – the lines in which text is split up, possibly filtered for the alignment purpose
Raises:	TypeError: if `identifier` is not a Unicode string
Raises:	TypeError: if `lines` is not an instance of `list` or it contains an element which is not a Unicode string

characters¶

The number of characters in this text fragment, including line separators, if any.

Return type:	int

chars¶

Return the number of characters of the text fragment, not including the line separators.

Return type:	int

filtered_characters¶

The number of filtered characters in this text fragment.

Return type:	int

filtered_text¶

The filtered text of the text fragment.

Return type:	string

identifier¶

The identifier of the text fragment.

Return type:	string

language¶

The language of the text fragment.

Return type:	`Language`

lines¶

The lines of the text fragment.

Return type:	list of strings

text¶

The text of the text fragment.

Return type:	string

class aeneas.textfile.TransliterationMap(file_path, rconf=None, logger=None)[source]¶

A transliteration map is a dictionary that maps Unicode characters to their equivalent Unicode characters or strings (character sequences). If a character is unmapped, its image is the character itself. If a character is mapped to the empty string, it will be deleted. Otherwise, a character will be replaced with the associated string.

For its format, please read the initial comment included at the top of the transliteration.map sample file.

Parameters:	file_path (string) – the path to the map file to be read rconf (`RuntimeConfiguration`) – a runtime configuration logger (`Logger`) – the logger object
Raises:	OSError: if `file_path` cannot be read

file_path¶

The path of the map file.

Return type:	string