Fixing problems and getting explanations¶
Ode to a Shipping Label¶
A poem about mojibake, whose original author might be Carlos Bueno on Facebook, shows a shipping label that serves as an excellent example for this section, addressed to the surname LóPEZ.
We can use ftfy not only to fix the text that was on the label, but to show us what happened to it (like the poem does):
>>> from ftfy import fix_and_explain, apply_plan
>>> shipping_label = "LóPEZ"
>>> fixed, explanation = fix_and_explain(shipping_label)
>>> fixed
'LóPEZ'
>>> explanation
[('apply', 'unescape_html'),
('apply', 'unescape_html'),
('apply', 'unescape_html'),
('encode', 'latin-1'),
('decode', 'utf-8')]
The capitalization is inconsistent because the encoding of a lowercase “ó” is in there, but everything was printed in capital letters.
The explanation may even be able to be applied to different text with the same problem:
>>> label2 = "CARRé"
>>> apply_plan(label2, explanation)
'CARRé'
Functions that fix text¶
The function that you’ll probably use most often is ftfy.fix_text(), which applies all the fixes it can to every line of text, and returns the fixed text.
ftfy.fix_and_explain() takes the same arguments as ftfy.fix_text(), but provides an explanation, like we saw in the first section.
Unlike ftfy.fix_text(), ftfy.fix_and_explain() doesn’t separate the text into lines that it fixes separately – because it’s looking for a unified explanation of what happened to the text, not a different one for each line.
A more targeted function is ftfy.fix_encoding_and_explain(), which only fixes problems that can be solved by encoding and decoding the text, not other problems such as HTML entities:
This function has a counterpart that returns just the fixed string, without the explanation. It still fixes the string as a whole, not line by line.
The return type of the ..._and_explain functions is a kind of NamedTuple called ExplainedText:
These explanations can be re-applied to text using apply_plan():
Showing the characters in a string¶
A different kind of explanation you might need is simply a breakdown of what Unicode characters a string contains. For this, ftfy provides a utility function, ftfy.explain_unicode().
A command-line utility that provides similar information, and even more detail, is lunasorcery’s utf8info.