textfile¶
This module contains the following classes:
TextFile
, representing a text file;TextFileFormat
, an enumeration of text file formats;TextFilter
, an abstract class for filtering text;TextFilterIgnoreRegex
, a regular expression text filter;TextFilterTransliterate
, a transliteration text filter;TextFragment
, representing a single text fragment;TransliterationMap
, a full transliteration map.
-
class
aeneas.textfile.
TextFile
(file_path=None, file_format=None, parameters=None, rconf=None, logger=None)[source]¶ A tree of text fragments, representing a text file.
Parameters: - file_path (string) – the path to the text file.
If not
None
(and alsofile_format
is notNone
), the file will be read immediately. - file_format (
TextFileFormat
) – the format of the text file - parameters (dict) – additional parameters used to parse the text file
- rconf (
RuntimeConfiguration
) – a runtime configuration - logger (
Logger
) – the logger object
Raises: OSError: if
file_path
cannot be readRaises: TypeError: if
parameters
is not an instance ofdict
Raises: ValueError: if
file_format
value is not allowed-
add_fragment
(fragment, as_last=True)[source]¶ Add the given text fragment as the first or last child of the root node of the text file tree.
Parameters: - fragment (
TextFragment
) – the text fragment to be added - as_last (bool) – if
True
append fragment, otherwise prepend it
- fragment (
-
characters
¶ The number of characters in this text file.
Return type: int
-
chars
¶ Return the number of characters of the text file, not counting line or fragment separators.
Return type: int
-
children_not_empty
¶ Return the direct not empty children of the root of the fragments tree, as
TextFile
objects.Return type: list of TextFile
-
file_format
¶ The format of the text file.
Return type: TextFileFormat
-
file_path
¶ The path of the text file.
Return type: string
-
fragments
¶ The current list of text fragments which are the children of the root node of the text file tree.
Return type: list of TextFragment
-
fragments_tree
¶ Return the current tree of fragments.
Return type: Tree
-
get_slice
(start=None, end=None)[source]¶ Return a new list of text fragments, indexed from start (included) to end (excluded).
Parameters: - start (int) – the start index, included
- end (int) – the end index, excluded
Return type:
-
get_subtree
(root)[source]¶ Return a new
TextFile
object, rooted at the given noderoot
.Parameters: root ( Tree
) – the root nodeReturn type: TextFile
-
parameters
¶ Additional parameters used to parse the text file.
Return type: dict
-
read_from_list
(lines)[source]¶ Read text fragments from a given list of strings:
[fragment_1, fragment_2, ..., fragment_n]
Parameters: lines (list) – the text fragments
- file_path (string) – the path to the text file.
If not
-
class
aeneas.textfile.
TextFileFormat
[source]¶ Enumeration of the supported formats for text files.
-
ALLOWED_VALUES
= ['mplain', 'munparsed', 'parsed', 'plain', 'subtitles', 'unparsed']¶ List of all the allowed values
-
MPLAIN
= 'mplain'¶ Multilevel version of the
PLAIN
format.The text file contains fragments on multiple levels: paragraphs are separated by (at least) a blank line, sentences are on different lines, words will be recognized automatically:
First sentence of Paragraph One. Second sentence of Paragraph One. First sentence of Paragraph Two. First sentence of Paragraph Three. Second sentence of Paragraph Three. Third sentence of Paragraph Three.
The above will produce the following text tree:
Paragraph1 ("First ... One.") Sentence1 ("First ... One.") Word1 ("First") Word2 ("sentence") ... Word5 ("One.") Sentence2 ("Second ... One.") Word1 ("Second") Word2 ("sentence") ... Word5 ("One.") Paragraph2 ("First ... Two.") Sentence1 ("First ... Two.") Word1 ("First") Word2 ("sentence") ... Word5 ("Two.") ...
-
MULTILEVEL_VALUES
= ['mplain', 'munparsed']¶ List of all multilevel formats
-
MUNPARSED
= 'munparsed'¶ Multilevel version of the
UNPARSED
format.The text file contains fragments on three levels: level 1 (paragraph), level 2 (sentence), level 3 (word):
<?xml version="1.0" encoding="UTF-8"?> <html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" lang="en" xml:lang="en"> <head> <meta charset="utf-8"/> <link rel="stylesheet" href="../Styles/style.css" type="text/css"/> <title>Sonnet I</title> </head> <body> <div id="divTitle"> <h1> <span id="p000001"> <span id="p000001s000001"> <span id="p000001s000001w000001">I</span> </span> </span> </h1> </div> <div id="divSonnet"> <p class="stanza" id="p000002"> <span id="p000002s000001"> <span id="p000002s000001w000001">From</span> <span id="p000002s000001w000002">fairest</span> <span id="p000002s000001w000003">creatures</span> <span id="p000002s000001w000004">we</span> <span id="p000002s000001w000005">desire</span> <span id="p000002s000001w000006">increase,</span> </span><br/> <span id="p000002s000002"> <span id="p000002s000002w000001">That</span> <span id="p000002s000002w000002">thereby</span> <span id="p000002s000002w000003">beauty’s</span> <span id="p000002s000002w000004">rose</span> <span id="p000002s000002w000005">might</span> <span id="p000002s000002w000006">never</span> <span id="p000002s000002w000007">die,</span> </span><br/> ... </p> ... </div> </body> </html>
-
PARSED
= 'parsed'¶ The text file contains the fragments, one per line, with the syntax
id|text
, where id is a non-empty fragment identifier and text is the text of the fragment:f001|Text of the first fragment f002|Text of the second fragment f003|Text of the third fragment
-
PLAIN
= 'plain'¶ The text file contains the fragments, one per line, without explicitly-assigned identifiers:
Text of the first fragment Text of the second fragment Text of the third fragment
-
SUBTITLES
= 'subtitles'¶ The text file contains the fragments, each fragment is contained in one or more consecutive lines, separated by (at least) a blank line, without explicitly-assigned identifiers. Use this format if you want to output to SRT/TTML/VTT and you want to keep multilines in the output file:
Fragment on a single row Fragment on two rows because it is quite long Another one liner Another fragment on two rows
-
UNPARSED
= 'unparsed'¶ The text file is a well-formed HTML/XHTML file, where the text fragments have already been marked up.
The text fragments will be extracted by matching the
id
and/orclass
attributes of each elements with the provided regular expressions:<?xml version="1.0" encoding="UTF-8"?> <html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" lang="en" xml:lang="en"> <head> <meta charset="utf-8"/> <link rel="stylesheet" href="../Styles/style.css" type="text/css"/> <title>Sonnet I</title> </head> <body> <div id="divTitle"> <h1><span class="ra" id="f001">I</span></h1> </div> <div id="divSonnet"> <p> <span class="ra" id="f002">From fairest creatures we desire increase,</span><br/> <span class="ra" id="f003">That thereby beauty’s rose might never die,</span><br/> <span class="ra" id="f004">But as the riper should by time decease,</span><br/> <span class="ra" id="f005">His tender heir might bear his memory:</span><br/> <span class="ra" id="f006">But thou contracted to thine own bright eyes,</span><br/> <span class="ra" id="f007">Feed’st thy light’s flame with self-substantial fuel,</span><br/> <span class="ra" id="f008">Making a famine where abundance lies,</span><br/> <span class="ra" id="f009">Thy self thy foe, to thy sweet self too cruel:</span><br/> <span class="ra" id="f010">Thou that art now the world’s fresh ornament,</span><br/> <span class="ra" id="f011">And only herald to the gaudy spring,</span><br/> <span class="ra" id="f012">Within thine own bud buriest thy content,</span><br/> <span class="ra" id="f013">And tender churl mak’st waste in niggarding:</span><br/> <span class="ra" id="f014">Pity the world, or else this glutton be,</span><br/> <span class="ra" id="f015">To eat the world’s due, by the grave and thee.</span> </p> </div> </body> </html>
-
-
class
aeneas.textfile.
TextFilter
(rconf=None, logger=None)[source]¶ A text filter is a function acting on a list of strings, and returning a new list of strings derived from the former (with the same number of elements).
For example, a filter might apply a regex to the input string, or it might transliterate its characters.
Filters can be chained, to the left or to the right.
Parameters: - rconf (
RuntimeConfiguration
) – a runtime configuration - logger (
Logger
) – the logger object
-
add_filter
(new_filter, as_last=True)[source]¶ Compose this filter with the given
new_filter
filter.Parameters: - new_filter (
TextFilter
) – the filter to be composed - as_last (bool) – if
True
, compose to the right, otherwise to the left
- new_filter (
- rconf (
-
class
aeneas.textfile.
TextFilterIgnoreRegex
(regex, rconf=None, logger=None)[source]¶ Delete the text matching the given regex.
Leading/trailing spaces, and repeated spaces are removed.
Parameters: - regex (regex) – the regular expression to be applied
- rconf (
RuntimeConfiguration
) – a runtime configuration - logger (
Logger
) – the logger object
Raises: ValueError: if
regex
is not a valid regex
-
class
aeneas.textfile.
TextFilterTransliterate
(map_file_path=None, map_object=None, rconf=None, logger=None)[source]¶ Transliterate the text using the given map file.
Leading/trailing spaces, and repeated spaces are removed.
Parameters: - map_object (
TransliterationMap
) – the map object - map_file_path (string) – the path to a map file
- rconf (
RuntimeConfiguration
) – a runtime configuration - logger (
Logger
) – the logger object
Raises: OSError: if
map_file_path
cannot be readRaises: TypeError: if
map_object
is not an instance ofTransliterationMap
- map_object (
-
class
aeneas.textfile.
TextFragment
(identifier=None, language=None, lines=None, filtered_lines=None)[source]¶ A text fragment.
Internally, all the text objects are Unicode strings.
Parameters: - identifier (string) – the identifier of the fragment
- language (
Language
) – the language of the text of the fragment - lines (list) – the lines in which text is split up
- filtered_lines (list) – the lines in which text is split up, possibly filtered for the alignment purpose
Raises: TypeError: if
identifier
is not a Unicode stringRaises: TypeError: if
lines
is not an instance oflist
or it contains an element which is not a Unicode string-
characters
¶ The number of characters in this text fragment, including line separators, if any.
Return type: int
-
chars
¶ Return the number of characters of the text fragment, not including the line separators.
Return type: int
-
filtered_characters
¶ The number of filtered characters in this text fragment.
Return type: int
-
filtered_text
¶ The filtered text of the text fragment.
Return type: string
-
identifier
¶ The identifier of the text fragment.
Return type: string
-
lines
¶ The lines of the text fragment.
Return type: list of strings
-
text
¶ The text of the text fragment.
Return type: string
-
class
aeneas.textfile.
TransliterationMap
(file_path, rconf=None, logger=None)[source]¶ A transliteration map is a dictionary that maps Unicode characters to their equivalent Unicode characters or strings (character sequences). If a character is unmapped, its image is the character itself. If a character is mapped to the empty string, it will be deleted. Otherwise, a character will be replaced with the associated string.
For its format, please read the initial comment included at the top of the
transliteration.map
sample file.Parameters: - file_path (string) – the path to the map file to be read
- rconf (
RuntimeConfiguration
) – a runtime configuration - logger (
Logger
) – the logger object
Raises: OSError: if
file_path
cannot be read-
file_path
¶ The path of the map file.
Return type: string