textfile

This module contains the following classes:

class aeneas.textfile.TextFile(file_path=None, file_format=None, parameters=None, rconf=None, logger=None)[source]

A tree of text fragments, representing a text file.

Parameters:
  • file_path (string) – the path to the text file. If not None (and also file_format is not None), the file will be read immediately.
  • file_format (TextFileFormat) – the format of the text file
  • parameters (dict) – additional parameters used to parse the text file
  • rconf (RuntimeConfiguration) – a runtime configuration
  • logger (Logger) – the logger object
Raises:

OSError: if file_path cannot be read

Raises:

TypeError: if parameters is not an instance of dict

Raises:

ValueError: if file_format value is not allowed

add_fragment(fragment, as_last=True)[source]

Add the given text fragment as the first or last child of the root node of the text file tree.

Parameters:
  • fragment (TextFragment) – the text fragment to be added
  • as_last (bool) – if True append fragment, otherwise prepend it
characters

The number of characters in this text file.

Return type:int
chars

Return the number of characters of the text file, not counting line or fragment separators.

Return type:int
children_not_empty

Return the direct not empty children of the root of the fragments tree, as TextFile objects.

Return type:list of TextFile
clear()[source]

Clear the text file, removing all the current fragments.

file_format

The format of the text file.

Return type:TextFileFormat
file_path

The path of the text file.

Return type:string
fragments

The current list of text fragments which are the children of the root node of the text file tree.

Return type:list of TextFragment
fragments_tree

Return the current tree of fragments.

Return type:Tree
get_slice(start=None, end=None)[source]

Return a new list of text fragments, indexed from start (included) to end (excluded).

Parameters:
  • start (int) – the start index, included
  • end (int) – the end index, excluded
Return type:

TextFile

get_subtree(root)[source]

Return a new TextFile object, rooted at the given node root.

Parameters:root (Tree) – the root node
Return type:TextFile
parameters

Additional parameters used to parse the text file.

Return type:dict
read_from_list(lines)[source]

Read text fragments from a given list of strings:

[fragment_1, fragment_2, ..., fragment_n]
Parameters:lines (list) – the text fragments
read_from_list_with_ids(lines)[source]

Read text fragments from a given list of tuples:

[(id_1, text_1), (id_2, text_2), ..., (id_n, text_n)].
Parameters:lines (list) – the list of [id, text] fragments (see above)
set_language(language)[source]

Set the given language for all the text fragments.

Parameters:language (Language) – the language of the text fragments
class aeneas.textfile.TextFileFormat[source]

Enumeration of the supported formats for text files.

ALLOWED_VALUES = ['mplain', 'munparsed', 'parsed', 'plain', 'subtitles', 'unparsed']

List of all the allowed values

MPLAIN = 'mplain'

Multilevel version of the PLAIN format.

The text file contains fragments on multiple levels: paragraphs are separated by (at least) a blank line, sentences are on different lines, words will be recognized automatically:

First sentence of Paragraph One.
Second sentence of Paragraph One.

First sentence of Paragraph Two.

First sentence of Paragraph Three.
Second sentence of Paragraph Three.
Third sentence of Paragraph Three.

The above will produce the following text tree:

Paragraph1 ("First ... One.")
  Sentence1 ("First ... One.")
    Word1 ("First")
    Word2 ("sentence")
    ...
    Word5 ("One.")
  Sentence2 ("Second ... One.")
    Word1 ("Second")
    Word2 ("sentence")
    ...
    Word5 ("One.")
Paragraph2 ("First ... Two.")
  Sentence1 ("First ... Two.")
    Word1 ("First")
    Word2 ("sentence")
    ...
    Word5 ("Two.")
...
MULTILEVEL_VALUES = ['mplain', 'munparsed']

List of all multilevel formats

MUNPARSED = 'munparsed'

Multilevel version of the UNPARSED format.

The text file contains fragments on three levels: level 1 (paragraph), level 2 (sentence), level 3 (word):

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" lang="en" xml:lang="en">
 <head>
  <meta charset="utf-8"/>
  <link rel="stylesheet" href="../Styles/style.css" type="text/css"/>
  <title>Sonnet I</title>
 </head>
 <body>
  <div id="divTitle">
   <h1>
    <span id="p000001">
     <span id="p000001s000001">
      <span id="p000001s000001w000001">I</span>
     </span>
    </span>
   </h1>
  </div>
  <div id="divSonnet">
   <p class="stanza" id="p000002">
    <span id="p000002s000001">
     <span id="p000002s000001w000001">From</span>
     <span id="p000002s000001w000002">fairest</span>
     <span id="p000002s000001w000003">creatures</span>
     <span id="p000002s000001w000004">we</span>
     <span id="p000002s000001w000005">desire</span>
     <span id="p000002s000001w000006">increase,</span>
    </span><br/>
    <span id="p000002s000002">
     <span id="p000002s000002w000001">That</span>
     <span id="p000002s000002w000002">thereby</span>
     <span id="p000002s000002w000003">beauty’s</span>
     <span id="p000002s000002w000004">rose</span>
     <span id="p000002s000002w000005">might</span>
     <span id="p000002s000002w000006">never</span>
     <span id="p000002s000002w000007">die,</span>
    </span><br/>
    ...
   </p>
   ...
  </div>
 </body>
</html>
PARSED = 'parsed'

The text file contains the fragments, one per line, with the syntax id|text, where id is a non-empty fragment identifier and text is the text of the fragment:

f001|Text of the first fragment
f002|Text of the second fragment
f003|Text of the third fragment
PLAIN = 'plain'

The text file contains the fragments, one per line, without explicitly-assigned identifiers:

Text of the first fragment
Text of the second fragment
Text of the third fragment
SUBTITLES = 'subtitles'

The text file contains the fragments, each fragment is contained in one or more consecutive lines, separated by (at least) a blank line, without explicitly-assigned identifiers. Use this format if you want to output to SRT/TTML/VTT and you want to keep multilines in the output file:

Fragment on a single row

Fragment on two rows
because it is quite long

Another one liner

Another fragment
on two rows
UNPARSED = 'unparsed'

The text file is a well-formed HTML/XHTML file, where the text fragments have already been marked up.

The text fragments will be extracted by matching the id and/or class attributes of each elements with the provided regular expressions:

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" lang="en" xml:lang="en">
 <head>
  <meta charset="utf-8"/>
  <link rel="stylesheet" href="../Styles/style.css" type="text/css"/>
  <title>Sonnet I</title>
 </head>
 <body>
  <div id="divTitle">
   <h1><span class="ra" id="f001">I</span></h1>
  </div>
  <div id="divSonnet">
   <p>
    <span class="ra" id="f002">From fairest creatures we desire increase,</span><br/>
    <span class="ra" id="f003">That thereby beauty’s rose might never die,</span><br/>
    <span class="ra" id="f004">But as the riper should by time decease,</span><br/>
    <span class="ra" id="f005">His tender heir might bear his memory:</span><br/>
    <span class="ra" id="f006">But thou contracted to thine own bright eyes,</span><br/>
    <span class="ra" id="f007">Feed’st thy light’s flame with self-substantial fuel,</span><br/>
    <span class="ra" id="f008">Making a famine where abundance lies,</span><br/>
    <span class="ra" id="f009">Thy self thy foe, to thy sweet self too cruel:</span><br/>
    <span class="ra" id="f010">Thou that art now the world’s fresh ornament,</span><br/>
    <span class="ra" id="f011">And only herald to the gaudy spring,</span><br/>
    <span class="ra" id="f012">Within thine own bud buriest thy content,</span><br/>
    <span class="ra" id="f013">And tender churl mak’st waste in niggarding:</span><br/>
    <span class="ra" id="f014">Pity the world, or else this glutton be,</span><br/>
    <span class="ra" id="f015">To eat the world’s due, by the grave and thee.</span>
   </p>
  </div>
 </body>
</html>
class aeneas.textfile.TextFilter(rconf=None, logger=None)[source]

A text filter is a function acting on a list of strings, and returning a new list of strings derived from the former (with the same number of elements).

For example, a filter might apply a regex to the input string, or it might transliterate its characters.

Filters can be chained, to the left or to the right.

Parameters:
add_filter(new_filter, as_last=True)[source]

Compose this filter with the given new_filter filter.

Parameters:
  • new_filter (TextFilter) – the filter to be composed
  • as_last (bool) – if True, compose to the right, otherwise to the left
apply_filter(strings)[source]

Apply the text filter filter to the given list of strings.

Parameters:strings (list) – the list of input strings
class aeneas.textfile.TextFilterIgnoreRegex(regex, rconf=None, logger=None)[source]

Delete the text matching the given regex.

Leading/trailing spaces, and repeated spaces are removed.

Parameters:
  • regex (regex) – the regular expression to be applied
  • rconf (RuntimeConfiguration) – a runtime configuration
  • logger (Logger) – the logger object
Raises:

ValueError: if regex is not a valid regex

class aeneas.textfile.TextFilterTransliterate(map_file_path=None, map_object=None, rconf=None, logger=None)[source]

Transliterate the text using the given map file.

Leading/trailing spaces, and repeated spaces are removed.

Parameters:
Raises:

OSError: if map_file_path cannot be read

Raises:

TypeError: if map_object is not an instance of TransliterationMap

class aeneas.textfile.TextFragment(identifier=None, language=None, lines=None, filtered_lines=None)[source]

A text fragment.

Internally, all the text objects are Unicode strings.

Parameters:
  • identifier (string) – the identifier of the fragment
  • language (Language) – the language of the text of the fragment
  • lines (list) – the lines in which text is split up
  • filtered_lines (list) – the lines in which text is split up, possibly filtered for the alignment purpose
Raises:

TypeError: if identifier is not a Unicode string

Raises:

TypeError: if lines is not an instance of list or it contains an element which is not a Unicode string

characters

The number of characters in this text fragment, including line separators, if any.

Return type:int
chars

Return the number of characters of the text fragment, not including the line separators.

Return type:int
filtered_characters

The number of filtered characters in this text fragment.

Return type:int
filtered_text

The filtered text of the text fragment.

Return type:string
identifier

The identifier of the text fragment.

Return type:string
language

The language of the text fragment.

Return type:Language
lines

The lines of the text fragment.

Return type:list of strings
text

The text of the text fragment.

Return type:string
class aeneas.textfile.TransliterationMap(file_path, rconf=None, logger=None)[source]

A transliteration map is a dictionary that maps Unicode characters to their equivalent Unicode characters or strings (character sequences). If a character is unmapped, its image is the character itself. If a character is mapped to the empty string, it will be deleted. Otherwise, a character will be replaced with the associated string.

For its format, please read the initial comment included at the top of the transliteration.map sample file.

Parameters:
  • file_path (string) – the path to the map file to be read
  • rconf (RuntimeConfiguration) – a runtime configuration
  • logger (Logger) – the logger object
Raises:

OSError: if file_path cannot be read

file_path

The path of the map file.

Return type:string