KITCHEN(1) | kitchen | KITCHEN(1) |
kitchen - kitchen 1.2.6
We've all done it. In the process of writing a brand new application we've discovered that we need a little bit of code that we've invented before. Perhaps it's something to handle unicode text. Perhaps it's something to make a bit of python-2.5 code run on python-2.4. Whatever it is, it ends up being a tiny bit of code that seems too small to worry about pushing into its own module so it sits there, a part of your current project, waiting to be cut and pasted into your next project. And the next. And the next. And since that little bittybit of code proved so useful to you, it's highly likely that it proved useful to someone else as well. Useful enough that they've written it and copy and pasted it over and over into each of their new projects.
Well, no longer! Kitchen aims to pull these small snippets of code into a few python modules which you can import and use within your project. No more copy and paste! Now you can let someone else maintain and release these small snippets so that you can get on with your life.
This package forms the core of Kitchen. It contains some useful modules for using newer python standard library modules on older python versions, text manipulation, PEP 386 versioning, and initializing gettext. With this package we're trying to provide a few useful features that don't have too many dependencies outside of the python standard library. We'll be releasing other modules that drop into the kitchen namespace to add other features (possibly with larger deps) as time goes on.
We've tried to keep the core kitchen module's requirements lightweight. At the moment kitchen only requires
WARNING:
If found, these libraries will be used to make the implementation of some part of kitchen better in some way. If they are not present, the API that they enable will still exist but may function in a different manner.
These libraries implement commonly used functionality that everyone seems to invent. Rather than reinvent their wheel, I simply list the things that they do well for now. Perhaps if people can't find them normally, I'll add them as requirements in setup.py or link them into kitchen's namespace. For now, I just mention them here:
This python module is distributed under the terms of the GNU Lesser General Public License Version 2 or later.
NOTE:
Kitchen's functions won't automatically make you a better programmer. You have to learn when and how to use them as well. This section of the documentation is intended to show you some of the ways that you can apply kitchen's functions to problems that may have arisen in your life. The goal of this section is to give you enough information to understand what the kitchen API can do for you and where in the Kitchen API docs to look for something that can help you with your next issue. Along the way, you might pick up the knack for identifying issues with your code before you publish it. And that will make you a better coder.
In python-2.x, there's two types that deal with text.
NOTE:
One mistake that people encountering this issue for the first time make is confusing the unicode type and the encodings of unicode stored in the str type. In python, the unicode type stores an abstract sequence of code points. Each code point represents a grapheme. By contrast, byte str stores a sequence of bytes which can then be mapped to a sequence of code points. Each unicode encoding (UTF-8, UTF-7, UTF-16, UTF-32, etc) maps different sequences of bytes to the unicode code points.
What does that mean to you as a programmer? When you're dealing with text manipulations (finding the number of characters in a string or cutting a string on word boundaries) you should be dealing with unicode strings as they abstract characters in a manner that's appropriate for thinking of them as a sequence of letters that you will see on a page. When dealing with I/O, reading to and from the disk, printing to a terminal, sending something over a network link, etc, you should be dealing with byte str as those devices are going to need to deal with concrete implementations of what bytes represent your abstract characters.
In the python2 world many APIs use these two classes interchangably but there are several important APIs where only one or the other will do the right thing. When you give the wrong type of string to an API that wants the other type, you may end up with an exception being raised (UnicodeDecodeError or UnicodeEncodeError). However, these exceptions aren't always raised because python implicitly converts between types... sometimes.
Although converting when possible seems like the right thing to do, it's actually the first source of frustration. A programmer can test out their program with a string like: The quick brown fox jumped over the lazy dog and not encounter any issues. But when they release their software into the wild, someone enters the string: I sat down for coffee at the café and suddenly an exception is thrown. The reason? The mechanism that converts between the two types is only able to deal with ASCII characters. Once you throw non-ASCII characters into your strings, you have to start dealing with the conversion manually.
So, if I manually convert everything to either byte str or unicode strings, will I be okay? The answer is.... sometimes.
The problem you run into when converting everything to byte str or unicode strings is that you'll be using someone else's API quite often (this includes the APIs in the python standard library) and find that the API will only accept byte str or only accept unicode strings. Or worse, that the code will accept either when you're dealing with strings that consist solely of ASCII but throw an error when you give it a string that's got non-ASCII characters. When you encounter these APIs you first need to identify which type will work better and then you have to convert your values to the correct type for that code. Thus the programmer that wants to proactively fix all unicode errors in their code needs to do two things:
NOTE:
Alright, since the python community is moving to using unicode strings everywhere, we might as well convert everything to unicode strings and use that by default, right? Sounds good most of the time but there's at least one huge caveat to be aware of. Anytime you output text to the terminal or to a file, the text has to be converted into a byte str. Python will try to implicitly convert from unicode to byte str... but it will throw an exception if the bytes are non-ASCII:
>>> string = unicode(raw_input(), 'utf8') café >>> log = open('/var/tmp/debug.log', 'w') >>> log.write(string) Traceback (most recent call last):
File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128)
Okay, this is simple enough to solve: Just convert to a byte str and we're all set:
>>> string = unicode(raw_input(), 'utf8') café >>> string_for_output = string.encode('utf8', 'replace') >>> log = open('/var/tmp/debug.log', 'w') >>> log.write(string_for_output) >>>
So that was simple, right? Well... there's one gotcha that makes things a bit harder to debug sometimes. When you attempt to write non-ASCII unicode strings to a file-like object you get a traceback everytime. But what happens when you use print()? The terminal is a file-like object so it should raise an exception right? The answer to that is.... sometimes:
$ python >>> print u'café' café
No exception. Okay, we're fine then?
We are until someone does one of the following:
$ LC_ALL=C python >>> # Note: if you're using a good terminal program when running in the C locale >>> # The terminal program will prevent you from entering non-ASCII characters >>> # python will still recognize them if you use the codepoint instead: >>> print u'caf\xe9' Traceback (most recent call last):
File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128)
$ cat test.py #!/usr/bin/python -tt # -*- coding: utf-8 -*- print u'café' $ ./test.py >t Traceback (most recent call last):
File "./test.py", line 4, in <module>
print u'café' UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128)
Okay, the locale thing is a pain but understandable: the C locale doesn't understand any characters outside of ASCII so naturally attempting to display those won't work. Now why does redirecting to a file cause problems? It's because print() in python2 is treated specially. Whereas the other file-like objects in python always convert to ASCII unless you set them up differently, using print() to output to the terminal will use the user's locale to convert before sending the output to the terminal. When print() is not outputting to the terminal (being redirected to a file, for instance), print() decides that it doesn't know what locale to use for that file and so it tries to convert to ASCII instead.
So what does this mean for you, as a programmer? Unless you have the luxury of controlling how your users use your code, you should always, always, always convert to a byte str before outputting strings to the terminal or to a file. Python even provides you with a facility to do just this. If you know that every unicode string you send to a particular file-like object (for instance, stdout) should be converted to a particular encoding you can use a codecs.StreamWriter object to convert from a unicode string into a byte str. In particular, codecs.getwriter() will return a StreamWriter class that will help you to wrap a file-like object for output. Using our print() example:
$ cat test.py #!/usr/bin/python -tt # -*- coding: utf-8 -*- import codecs import sys UTF8Writer = codecs.getwriter('utf8') sys.stdout = UTF8Writer(sys.stdout) print u'café' $ ./test.py >t $ cat t café
In English, there's a saying "waiting for the other shoe to drop". It means that when one event (usually bad) happens, you come to expect another event (usually worse) to come after. In this case we have two other shoes.
If you wrap sys.stdout using codecs.getwriter() and think you are now safe to print any variable without checking its type I am afraid I must inform you that you're not paying enough attention to Murphy's Law. The StreamWriter that codecs.getwriter() provides will take unicode strings and transform them into byte str before they get to sys.stdout. The problem is if you give it something that's already a byte str it tries to transform that as well. To do that it tries to turn the byte str you give it into unicode and then transform that back into a byte str... and since it uses the ASCII codec to perform those conversions, chances are that it'll blow up when making them:
>>> import codecs >>> import sys >>> UTF8Writer = codecs.getwriter('utf8') >>> sys.stdout = UTF8Writer(sys.stdout) >>> print 'café' Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.6/codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
To work around this, kitchen provides an alternate version of codecs.getwriter() that can deal with both byte str and unicode strings. Use kitchen.text.converters.getwriter() in place of the codecs version like this:
>>> import sys >>> from kitchen.text.converters import getwriter >>> UTF8Writer = getwriter('utf8') >>> sys.stdout = UTF8Writer(sys.stdout) >>> print u'café' café >>> print 'café' café
Sometimes you do everything right in your code but other people's code fails you. With unicode issues this happens more often than we want. A glaring example of this is when you get values back from a function that aren't consistently unicode string or byte str.
An example from the python standard library is gettext. The gettext functions are used to help translate messages that you display to users in the users' native languages. Since most languages contain letters outside of the ASCII range, the values that are returned contain unicode characters. gettext provides you with ugettext() and ungettext() to return these translations as unicode strings and gettext(), ngettext(), lgettext(), and lngettext() to return them as encoded byte str. Unfortunately, even though they're documented to return only one type of string or the other, the implementation has corner cases where the wrong type can be returned.
This means that even if you separate your unicode string and byte str correctly before you pass your strings to a gettext function, afterwards, you might have to check that you have the right sort of string type again.
NOTE:
Now that we've identified the issues, can we define a comprehensive strategy for dealing with them?
If you get some piece of text from a library, read from a file, etc, turn it into a unicode string immediately. Since python is moving in the direction of unicode strings everywhere it's going to be easier to work with unicode strings within your code.
If your code is heavily involved with using things that are bytes, you can do the opposite and convert all text into byte str at the border and only convert to unicode when you need it for passing to another library or performing string operations on it.
In either case, the important thing is to pick a default type for strings and stick with it throughout your code. When you mix the types it becomes much easier to operate on a string with a function that can only use the other type by mistake.
NOTE:
Sometimes you're converting nearly all of your data to unicode strings but you have one or two values where you have to keep byte str around. This is often the case when you need to use the value verbatim with some external resource. For instance, filenames or key values in a database. When you do this, use a naming convention for the data you're working with so you (and others reading your code later) don't get confused about what's being stored in the value.
If you need both a textual string to present to the user and a byte value for an exact match, consider keeping both versions around. You can either use two variables for this or a dict whose key is the byte value.
NOTE:
When you go to send your data back outside of your program (to the filesystem, over the network, displaying to the user, etc) turn the data back into a byte str. How you do this will depend on the expected output format of the data. For displaying to the user, you can use the user's default encoding using locale.getpreferredencoding(). For entering into a file, you're best bet is to pick a single encoding and stick with it.
WARNING:
You can use kitchen.text.converters.getwriter() to do this automatically for sys.stdout. When creating exception messages be sure to convert to bytes manually.
Unless you know that a specific portion of your code will only deal with ASCII, be sure to include non-ASCII values in your unittests. Including a few characters from several different scripts is highly advised as well because some code may have special cased accented roman characters but not know how to handle characters used in Asian alphabets.
Similarly, unless you know that that portion of your code will only be given unicode strings or only byte str be sure to try variables of both types in your unittests. When doing this, make sure that the variables are also non-ASCII as python's implicit conversion will mask problems with pure ASCII data. In many cases, it makes sense to check what happens if byte str and unicode strings that won't decode in the present locale are given.
Make sure that the libraries you use return only unicode strings or byte str. Unittests can help you spot issues here by running many variations of data through your functions and checking that you're still getting the types of string that you expect.
The kitchen library provides a wide array of functions to help you deal with byte str and unicode strings in your program. Here's a short example that uses many kitchen functions to do its work:
#!/usr/bin/python -tt # -*- coding: utf-8 -*- import locale import os import sys import unicodedata from kitchen.text.converters import getwriter, to_bytes, to_unicode from kitchen.i18n import get_translation_object if __name__ == '__main__':
# Setup gettext driven translations but use the kitchen functions so
# we don't have the mismatched bytes-unicode issues.
translations = get_translation_object('example')
# We use _() for marking strings that we operate on as unicode
# This is pretty much everything
_ = translations.ugettext
# And b_() for marking strings that we operate on as bytes.
# This is limited to exceptions
b_ = translations.lgettext
# Setup stdout
encoding = locale.getpreferredencoding()
Writer = getwriter(encoding)
sys.stdout = Writer(sys.stdout)
# Load data. Format is filename\0description
# description should be utf-8 but filename can be any legal filename
# on the filesystem
# Sample datafile.txt:
# /etc/shells\x00Shells available on caf\xc3\xa9.lan
# /var/tmp/file\xff\x00File with non-utf8 data in the filename
#
# And to create /var/tmp/file\xff (under bash or zsh) do:
# echo 'Some data' > /var/tmp/file$'\377'
datafile = open('datafile.txt', 'r')
data = {}
for line in datafile:
# We're going to keep filename as bytes because we will need the
# exact bytes to access files on a POSIX operating system.
# description, we'll immediately transform into unicode type.
b_filename, description = line.split('\0', 1)
# to_unicode defaults to decoding output from utf-8 and replacing
# any problematic bytes with the unicode replacement character
# We accept mangling of the description here knowing that our file
# format is supposed to use utf-8 in that field and that the
# description will only be displayed to the user, not used as
# a key value.
description = to_unicode(description, 'utf-8').strip()
data[b_filename] = description
datafile.close()
# We're going to add a pair of extra fields onto our data to show the
# length of the description and the filesize. We put those between
# the filename and description because we haven't checked that the
# description is free of NULLs.
datafile = open('newdatafile.txt', 'w')
# Name filename with a b_ prefix to denote byte string of unknown encoding
for b_filename in data:
# Since we have the byte representation of filename, we can read any
# filename
if os.access(b_filename, os.F_OK):
size = os.path.getsize(b_filename)
else:
size = 0
# Because the description is unicode type, we know the number of
# characters corresponds to the length of the normalized unicode
# string.
length = len(unicodedata.normalize('NFC', description))
# Print a summary to the screen
# Note that we do not let implici type conversion from str to
# unicode transform b_filename into a unicode string. That might
# fail as python would use the ASCII filename. Instead we use
# to_unicode() to explictly transform in a way that we know will
# not traceback.
print _(u'filename: %s') % to_unicode(b_filename)
print _(u'file size: %s') % size
print _(u'desc length: %s') % length
print _(u'description: %s') % data[b_filename]
# First combine the unicode portion
line = u'%s\0%s\0%s' % (size, length, data[b_filename])
# Since the filenames are bytes, turn everything else to bytes before combining
# Turning into unicode first would be wrong as the bytes in b_filename
# might not convert
b_line = '%s\0%s\n' % (b_filename, to_bytes(line))
# Just to demonstrate that getwriter will pass bytes through fine
print b_('Wrote: %s') % b_line
datafile.write(b_line)
datafile.close()
# And just to show how to properly deal with an exception.
# Note two things about this:
# 1) We use the b_() function to translate the string. This returns a
# byte string instead of a unicode string
# 2) We're using the b_() function returned by kitchen. If we had
# used the one from gettext we would need to convert the message to
# a byte str first
message = u'Demonstrate the proper way to raise exceptions. Sincerely, \u3068\u3057\u304a'
raise Exception(b_(message))
SEE ALSO:
APIs that deal with byte str and unicode strings are difficult to get right. Here are a few strategies with pros and cons of each.
In this strategy, you allow the user to enter either unicode strings or byte str but what you give back is always unicode. This strategy is easy for novice endusers to start using immediately as they will be able to feed either type of string into the function and get back a string that they can use in other places.
However, it does lead to the novice writing code that functions correctly when testing it with ASCII-only data but fails when given data that contains non-ASCII characters. Worse, if your API is not designed to be flexible, the consumer of your code won't be able to easily correct those problems once they find them.
Here's a good API that uses this strategy:
from kitchen.text.converters import to_unicode def truncate(msg, max_length, encoding='utf8', errors='replace'):
msg = to_unicode(msg, encoding, errors)
return msg[:max_length]
The call to truncate() starts with the essential parameters for performing the task. It ends with two optional keyword arguments that define the encoding to use to transform from a byte str to unicode and the strategy to use if undecodable bytes are encountered. The defaults may vary depending on the use cases you have in mind. When the output is generally going to be printed for the user to see, errors='replace' is a good default. If you are constructing keys to a database, raisng an exception (with errors='strict') may be a better default. In either case, having both parameters allows the person using your API to choose how they want to handle any problems. Having the values is also a clue to them that a conversion from byte str to unicode string is going to occur.
NOTE:
Evaluate your usages of the variables in question to see what makes sense.
Here's a bad example of using this strategy:
from kitchen.text.converters import to_unicode def truncate(msg, max_length):
msg = to_unicode(msg)
return msg[:max_length]
In this example, we don't have the optional keyword arguments for encoding and errors. A user who uses this function is more likely to miss the fact that a conversion from byte str to unicode is going to occur. And once an error is reported, they will have to look through their backtrace and think harder about where they want to transform their data into unicode strings instead of having the opportunity to control how the conversion takes place in the function itself. Note that the user does have the ability to make this work by making the transformation to unicode themselves:
from kitchen.text.converters import to_unicode msg = to_unicode(msg, encoding='euc_jp', errors='ignore') new_msg = truncate(msg, 5)
This strategy is sometimes called polymorphic because the type of data that is returned is dependent on the type of data that is received. The concept is that when you are given a byte str to process, you return a byte str in your output. When you are given unicode strings to process, you return unicode strings in your output.
This can work well for end users as the ones that know about the difference between the two string types will already have transformed the strings to their desired type before giving it to this function. The ones that don't can remain blissfully ignorant (at least, as far as your function is concerned) as the function does not change the type.
In cases where the encoding of the byte str is known or can be discovered based on the input data this works well. If you can't figure out the input encoding, however, this strategy can fail in any of the following cases:
First, a couple examples of using this strategy in a good way:
def translate(msg, table):
replacements = table.keys()
new_msg = []
for index, char in enumerate(msg):
if char in replacements:
new_msg.append(table[char])
else:
new_msg.append(char)
return ''.join(new_msg)
In this example, all of the strings that we use (except the empty string which is okay because it doesn't have any characters to encode) come from outside of the function. Due to that, the user is responsible for making sure that the msg, and the keys and values in table all match in terms of type (unicode vs str) and encoding (You can do some error checking to make sure the user gave all the same type but you can't do the same for the user giving different encodings). You do not need to make changes to the string that require you to know the encoding or type of the string; everything is a simple replacement of one element in the array of characters in message with the character in table.
import json from kitchen.text.converters import to_unicode, to_bytes def first_field_from_json_data(json_string):
'''Return the first field in a json data structure.
The format of the json data is a simple list of strings.
'["one", "two", "three"]'
'''
if isinstance(json_string, unicode):
# On all python versions, json.loads() returns unicode if given
# a unicode string
return json.loads(json_string)[0]
# Byte str: figure out which encoding we're dealing with
if '\x00' not in json_data[:2]
encoding = 'utf8'
elif '\x00\x00\x00' == json_data[:3]:
encoding = 'utf-32-be'
elif '\x00\x00\x00' == json_data[1:4]:
encoding = 'utf-32-le'
elif '\x00' == json_data[0] and '\x00' == json_data[2]:
encoding = 'utf-16-be'
else:
encoding = 'utf-16-le'
data = json.loads(unicode(json_string, encoding))
return data[0].encode(encoding)
In this example the function takes either a byte str type or a unicode string that has a list in json format and returns the first field from it as the type of the input string. The first section of code is very straightforward; we receive a unicode string, parse it with a function, and then return the first field from our parsed data (which our function returned to us as json data).
The second portion that deals with byte str is not so straightforward. Before we can parse the string we have to determine what characters the bytes in the string map to. If we didn't do that, we wouldn't be able to properly find which characters are present in the string. In order to do that we have to figure out the encoding of the byte str. Luckily, the json specification states that all strings are unicode and encoded with one of UTF32be, UTF32le, UTF16be, UTF16le, or UTF-8. It further defines the format such that the first two characters are always ASCII. Each of these has a different sequence of NULLs when they encode an ASCII character. We can use that to detect which encoding was used to create the byte str.
Finally, we return the byte str by encoding the unicode back to a byte str.
As you can see, in this example we have to convert from byte str to unicode and back. But we know from the json specification that byte str has to be one of a limited number of encodings that we are able to detect. That ability makes this strategy work.
Now for some examples of using this strategy in ways that fail:
import unicodedata def first_char(msg):
'''Return the first character in a string'''
if not isinstance(msg, unicode):
try:
msg = unicode(msg, 'utf8')
except UnicodeError:
msg = unicode(msg, 'latin1')
msg = unicodedata.normalize('NFC', msg)
return msg[0]
If you look at that code and think that there's something fragile and prone to breaking in the try: except: block you are correct in being suspicious. This code will fail on multi-byte character sets that aren't UTF-8. It can also fail on data where the sequence of bytes is valid UTF-8 but the bytes are actually of a different encoding. The reasons this code fails is that we don't know what encoding the bytes are in and the code must convert from a byte str to a unicode string in order to function.
In order to make this code robust we must know the encoding of msg. The only way to know that is to ask the user so the API must do that:
import unicodedata def number_of_chars(msg, encoding='utf8', errors='strict'):
if not isinstance(msg, unicode):
msg = unicode(msg, encoding, errors)
msg = unicodedata.normalize('NFC', msg)
return len(msg)
Another example of failure:
import os def listdir(directory):
files = os.listdir(directory)
if isinstance(directory, str):
return files
# files could contain both bytes and unicode
new_files = []
for filename in files:
if not isinstance(filename, unicode):
# What to do here?
continue
new_files.appen(filename)
return new_files
This function illustrates the second failure mode. Here, not all of the possible values can be represented as unicode without knowing more about the encoding of each of the filenames involved. Since each filename could have a different encoding there's a few different options to pursue. We could make this function always return byte str since that can accurately represent anything that could be returned. If we want to return unicode we need to at least allow the user to specify what to do in case of an error decoding the bytes to unicode. We can also let the user specify the encoding to use for doing the decoding but that won't help in all cases since not all files will be in the same encoding (or even necessarily in any encoding):
import locale import os def listdir(directory, encoding=locale.getpreferredencoding(), errors='strict'):
# Note: In python-3.1+, surrogateescape may be a better default
files = os.listdir(directory)
if isinstance(directory, str):
return files
new_files = []
for filename in files:
if not isinstance(filename, unicode):
filename = unicode(filename, encoding=encoding, errors=errors)
new_files.append(filename)
return new_files
Note that although we use errors in this example as what to pass to the codec that decodes to unicode we could also have an errors argument that decides other things to do like skip a filename entirely, return a placeholder (Nondisplayable filename), or raise an exception.
This leaves us with one last failure to describe:
def first_field(csv_string):
'''Return the first field in a comma separated values string.'''
try:
return csv_string[:csv_string.index(',')]
except ValueError:
return csv_string
This code looks simple enough. The hidden error here is that we are searching for a comma character in a byte str but not all encodings will use the same sequence of bytes to represent the comma. If you use an encoding that's not ASCII compatible on the byte level, then the literal comma ',' in the above code will match inappropriate bytes. Some examples of how it can fail:
There are two ways to solve this. You can either take the encoding value from the user or you can take the separator value from the user. Of the two, taking the encoding is the better option for two reasons:
NOTE:
With that in mind, here's how to improve the API:
def first_field(csv_string, encoding='utf-8', errors='replace'):
if not isinstance(csv_string, unicode):
u_string = unicode(csv_string, encoding, errors)
is_unicode = False
else:
u_string = csv_string
try:
field = u_string[:U_string.index(u',')]
except ValueError:
return csv_string
if not is_unicode:
field = field.encode(encoding, errors)
return field
NOTE:
def first_field(csv_string, encoding='utf-8'):
try:
return csv_string[:csv_string.index(','.encode(encoding))]
except ValueError:
return csv_string
Sometimes you want to be able to take either byte str or unicode strings, perform similar operations on either one and then return data in the same format as was given. Probably the easiest way to do that is to have separate functions for each and adopt a naming convention to show that one is for working with byte str and the other is for working with unicode strings:
def translate_b(msg, table):
'''Replace values in str with other byte values like unicode.translate'''
if not isinstance(msg, str):
raise TypeError('msg must be of type str')
str_table = [chr(s) for s in xrange(0,256)]
delete_chars = []
for chr_val in (k for k in table.keys() if isinstance(k, int)):
if chr_val > 255:
raise ValueError('Keys in table must not exceed 255)')
if table[chr_val] == None:
delete_chars.append(chr(chr_val))
elif isinstance(table[chr_val], int):
if table[chr_val] > 255:
raise TypeError('table values cannot be more than 255 or less than 0')
str_table[chr_val] = chr(table[chr_val])
else:
if not isinstance(table[chr_val], str):
raise TypeError('character mapping must return integer, None or str')
str_table[chr_val] = table[chr_val]
str_table = ''.join(str_table)
delete_chars = ''.join(delete_chars)
return msg.translate(str_table, delete_chars) def translate(msg, table):
'''Replace values in a unicode string with other values'''
if not isinstance(msg, unicode):
raise TypeError('msg must be of type unicode')
return msg.translate(table)
There's several things that we have to do in this API:
Not all functions have a return value. Sometimes a function is there to interact with something external to python, for instance, writing a file out to disk or a method exists to update the internal state of a data structure. One of the main questions with these APIs is whether to take byte str, unicode string, or both. The answer depends on your use case but I'll give some examples here.
When your information is going to an external data source like writing to a file you need to decide whether to take in unicode strings or byte str. Remember that most external data sources are not going to be dealing with unicode directly. Instead, they're going to be dealing with a sequence of bytes that may be interpreted as unicode. With that in mind, you either need to have the user give you a byte str or convert to a byte str inside the function.
Next you need to think about the type of data that you're receiving. If it's textual data, (for instance, this is a chat client and the user is typing messages that they expect to be read by another person) it probably makes sense to take in unicode strings and do the conversion inside your function. On the other hand, if this is a lower level function that's passing data into a network socket, it probably should be taking byte str instead.
Just as noted in the API notes above, you should specify an encoding and errors argument if you need to transform from unicode string to byte str and you are unable to guess the encoding from the data itself.
Sometimes your API is just going to update a data structure and not immediately output that data anywhere. Just as when writing external data, you should think about both what your function is going to do with the data eventually and what the caller of your function is thinking that they're giving you. Most of the time, you'll want to take unicode strings and enter them into the data structure as unicode when the data is textual in nature. You'll want to take byte str and enter them into the data structure as byte str when the data is not text. Use a naming convention so the user knows what's expected.
There are a few APIs that are just wrong. If you catch yourself making an API that does one of these things, change it before anyone sees your code.
This type of API usually deals with byte str at some point and converts it to unicode because it's usually thought to be text. However, there are times when the bytes fail to convert to a unicode string. When that happens, this API returns the raw byte str instead of a unicode string. One example of this is present in the python standard library: python2's os.listdir():
>>> import os >>> import locale >>> locale.getpreferredencoding() 'UTF-8' >>> os.mkdir('/tmp/mine') >>> os.chdir('/tmp/mine') >>> open('nonsense_char_\xff', 'w').close() >>> open('all_ascii', 'w').close() >>> os.listdir(u'.') [u'all_ascii', 'nonsense_char_\xff']
The problem with APIs like this is that they cause failures that are hard to debug because they don't happen where the variables are set. For instance, let's say you take the filenames from os.listdir() and give it to this function:
def normalize_filename(filename):
'''Change spaces and dashes into underscores'''
return filename.translate({ord(u' '):u'_', ord(u' '):u'_'})
When you test this, you use filenames that all are decodable in your preferred encoding and everything seems to work. But when this code is run on a machine that has filenames in multiple encodings the filenames returned by os.listdir() suddenly include byte str. And byte str has a different string.translate() function that takes different values. So the code raises an exception where it's not immediately obvious that os.listdir() is at fault.
An early version of python3 attempted to fix the os.listdir() problem pointed out in the last section by returning all values that were decodable to unicode and omitting the filenames that were not. This lead to the following output:
>>> import os >>> import locale >>> locale.getpreferredencoding() 'UTF-8' >>> os.mkdir('/tmp/mine') >>> os.chdir('/tmp/mine') >>> open(b'nonsense_char_\xff', 'w').close() >>> open('all_ascii', 'w').close() >>> os.listdir('.') ['all_ascii']
The issue with this type of code is that it is silently doing something surprising. The caller expects to get a full list of files back from os.listdir(). Instead, it silently ignores some of the files, returning only a subset. This leads to code that doesn't do what is expected that may go unnoticed until the code is in production and someone notices that something important is being missed.
Believe it or not, a few libraries exist that make it impossible to deal with unicode text without raising a UnicodeError. What seems to occur in these libraries is that the library has functions that expect to receive a unicode string. However, internally, those functions call other functions that expect to receive a byte str. The programmer of the API was smart enough to convert from a unicode string to a byte str but they did not give the user the chance to specify the encodings to use or how to deal with errors. This results in exceptions when the user passes in a byte str because the initial function wants a unicode string and exceptions when the user passes in a unicode string because the function can't convert the string to bytes in the encoding that it's selected.
Do not put the user in the position of not being able to use your API without raising a UnicodeError with certain values. If you can only safely take unicode strings, document that byte str is not allowed and vice versa. If you have to convert internally, make sure to give the caller of your function parameters to control the encoding and how to treat errors that may occur during the encoding/decoding process. If your code will raise a UnicodeError with non-ASCII values no matter what, you should probably rethink your API.
If you've read all the way down to this section without skipping you've seen several admonitions about the type of data you are processing affecting the viability of the various API choices.
Here's a few things to consider in your data:
Much of the data in libraries, programs, and the general environment outside of python is written where strings are sequences of bytes. So when we interact with data that comes from outside of python or data that is about to leave python it may make sense to only operate on the data as a byte str. There's two times when this may make sense:
Even when your code is operating in this area you still need to think a little more about your data. For instance, it might make sense for the person using your API to pass in unicode strings and let the function convert that into the byte str that it then sends over the wire.
There are also times when it might make sense to operate only on unicode strings. unicode represents text so anytime that you are working on textual data that isn't going to leave python it has the potential to be a unicode-only API. However, there's two things that you should consider when designing a unicode-only API:
NOTE:
If you determine that you have to deal with byte str you should realize that not all encodings are created equal. Each has different properties that may make it possible to provide a simpler API provided that you can reasonably tell the users of your API that they cannot use certain classes of encodings.
As one example, if you are required to find a comma (,) in a byte str you have different choices based on what encodings are allowed. If you can reasonably restrict your API users to only giving ASCII compatible encodings you can do this simply by searching for the literal comma character because that character will be represented by the same byte sequence in all ASCII compatible encodings.
The following are some classes of encodings to be aware of as you decide how generic your code needs to be.
Single byte encodings can only represent 256 total characters. They encode the code points for a character to the equivalent number in a single byte.
Most single byte encodings are ASCII compatible. ASCII compatible encodings are the most likely to be usable without changes to code so this is good news. A notable exception to this is the EBDIC family of encodings.
Multibyte encodings use more than one byte to encode some characters.
Fixed width encodings have a set number of bytes to represent all of the characters in the character set. UTF-32 is an example of a fixed width encoding that uses four bytes per character and can express every unicode characters. There are a number of problems with writing APIs that need to operate on fixed width, multibyte characters. To go back to our earlier example of finding a comma in a string, we have to realize that even in UTF-32 where the code point for ASCII characters is the same as in ASCII, the byte sequence for them is different. So you cannot search for the literal byte character as it may pick up false positives and may break a byte sequence in an odd place.
UTF-8 and the EUC family of encodings are examples of ASCII compatible multi-byte encodings. They achieve this by adhering to two principles:
Some multibyte encodings work by using only bytes from the ASCII encoding but when a particular sequence of those byes is found, they are interpreted as meaning something other than their ASCII values. UTF-7 is one such encoding that can encode all of the unicode code points. For instance, here's a some Japanese characters encoded as UTF-7:
>>> a = u'\u304f\u3089\u3068\u307f' >>> print a くらとみ >>> print a.encode('utf-7') +ME8wiTBoMH8-
These encodings can be used when you need to encode unicode data that may contain non-ASCII characters for inclusion in an ASCII only transport medium or file.
However, they are not ASCII compatible in the sense that we used earlier as the bytes that represent a ASCII character are being reused as part of other characters. If you were to search for a literal plus sign in this encoded string, you would run across many false positives, for instance.
There are many other popular variable width encodings, for instance UTF-16 and shift-JIS. Many of these are not ASCII compatible so you cannot search for a literal ASCII character without danger of false positives or false negatives.
Kitchen is structured as a collection of modules. In its current configuration, Kitchen ships with the following modules. Other addon modules that may drag in more dependencies can be found on the project webpage
I18N is an important piece of any modern program. Unfortunately, setting up i18n in your program is often a confusing process. The functions provided here aim to make the programming side of that a little easier.
Most projects will be able to do something like this when they startup:
# myprogram/__init__.py: import os import sys from kitchen.i18n import easy_gettext_setup _, N_ = easy_gettext_setup('myprogram', localedirs=(
os.path.join(os.path.realpath(os.path.dirname(__file__)), 'locale'),
os.path.join(sys.prefix, 'lib', 'locale')
))
Then, in other files that have strings that need translating:
# myprogram/commands.py: from myprogram import _, N_ def print_usage():
print _(u"""available commands are:
--help Display help
--version Display version of this program
--bake-me-a-cake as fast as you can
""") def print_invitations(age):
print _('Please come to my party.')
print N_('I will be turning %(age)s year old',
'I will be turning %(age)s years old', age) % {'age': age}
See the documentation of easy_gettext_setup() and get_translation_object() for more details.
SEE ALSO:
easy_gettext_setup() should satisfy the needs of most users. get_translation_object() is designed to ease the way for anyone that needs more control.
Setting up gettext can be a little tricky because of lack of documentation. This function will setup gettext using the Class-based API for you. For the simple case, you can use the default arguments and call it like this:
_, N_ = easy_gettext_setup()
This will get you two functions, _() and N_() that you can use to mark strings in your code for translation. _() is used to mark strings that don't need to worry about plural forms no matter what the value of the variable is. N_() is used to mark strings that do need to have a different form if a variable in the string is plural.
SEE ALSO:
NOTE:
Changed in version kitchen-0.2.4: ; API kitchen.i18n 2.0.0 Changed easy_gettext_setup() to return the lgettext functions instead of gettext functions when use_unicode=False.
Iterator of language codes to check for message catalogs. If unspecified, the user's locale settings will be used.
SEE ALSO:
If you need more flexibility than easy_gettext_setup(), use this function. It sets up a gettext Translation object and returns it to you. Then you can access any of the methods of the object that you need directly. For instance, if you specifically need to access lgettext():
translations = get_translation_object('foo') translations.lgettext('My Message')
This function is similar to the python standard library gettext.translation() but makes it better in two ways
The latter is important when setting up gettext in a portable manner. There is not a common directory for translations across operating systems so one needs to look in multiple directories for the translations. get_translation_object() is able to handle that if you give it a list of directories to search for catalogs:
translations = get_translation_object('foo', localedirs=(
os.path.join(os.path.realpath(os.path.dirname(__file__)), 'locale'),
os.path.join(sys.prefix, 'lib', 'locale')))
This will search for several different directories:
This allows gettext to work on Windows and in development (where the message catalogs are typically in the toplevel module directory) and also when installed under Linux (where the message catalogs are installed in /usr/share/locale). You (or the system packager) just need to install the message catalogs in /usr/share/locale and remove the locale directory from the module to make this work. ie:
In development:
~/foo # Toplevel module directory
~/foo/__init__.py
~/foo/locale # With message catalogs below here:
~/foo/locale/es/LC_MESSAGES/foo.mo Installed on Linux:
/usr/lib/python2.7/site-packages/foo
/usr/lib/python2.7/site-packages/foo/__init__.py
/usr/share/locale/ # With message catalogs below here:
/usr/share/locale/es/LC_MESSAGES/foo.mo
NOTE:
Changed in version kitchen-1.1.0: ; API kitchen.i18n 2.1.0 Add more parameters to get_translation_object() so it can more easily be used as a replacement for gettext.translation(). Also change the way we use localedirs. We cycle through them until we find a suitable locale file rather than simply cycling through until we find a directory that exists. The new code is based heavily on the python standard library gettext.translation() function.
Changed in version kitchen-1.2.0: ; API kitchen.i18n 2.2.0 Add python2_api parameter
The standard translation objects from the gettext module suffer from several problems:
DummyTranslations and NewGNUTranslations were written to fix these issues.
This Translations class doesn't translate the strings and is intended to be used as a fallback when there were errors setting up a real Translations object. It's safer than gettext.NullTranslations in its handling of byte bytes vs str strings.
Unlike NullTranslations, this Translation class will never throw a UnicodeError. The code that you have around a call to DummyTranslations might throw a UnicodeError but at least that will be in code you control and can fix. Also, unlike NullTranslations all of this Translation object's methods guarantee to return byte bytes except for ugettext() and ungettext() which guarantee to return str strings.
When byte bytes are returned, the strings will be encoded according to this algorithm:
For ugettext() and ungettext(), we go through the same set of steps with the following differences:
Any characters that aren't able to be transformed from a byte bytes to str string or vice versa will be replaced with a replacement character (ie: u'�' in unicode based encodings, '?' in other ASCII compatible encodings).
SEE ALSO:
Changed in version kitchen-1.1.0: ; API kitchen.i18n 2.1.0 *
Although we had adapted gettext(), ngettext(),
lgettext(), and lngettext() to always return byte
bytes, we hadn't forced those byte bytes to always be
in a specified charset. We now make sure that gettext() and
ngettext() return byte bytes encoded using
output_charset if set, otherwise charset and if
neither of those, UTF-8. With lgettext() and
lngettext() output_charset if set, otherwise
locale.getpreferredencoding(). * Make setting input_charset and
output_charset also
set those attributes on any fallback translation objects.
Changed in version kitchen-1.2.0: ; API kitchen.i18n 2.2.0 Add python2_api parameter to __init__()
This serves two purposes. The normal gettext.NullTranslations.set_output_charset() does not set the output on fallback objects. On python-2.3, gettext.NullTranslations objects don't contain this method.
gettext.GNUTranslations suffers from two problems that this class fixes.
When byte bytes are returned, the strings will be encoded according to this algorithm:
For ugettext() and ungettext(), we go through the same set of steps with the following differences:
Any characters that aren't able to be transformed from a byte bytes to str string or vice versa will be replaced with a replacement character (ie: u'�' in unicode based encodings, '?' in other ASCII compatible encodings).
SEE ALSO:
Changed in version kitchen-1.1.0: ; API kitchen.i18n 2.1.0 Although we had adapted gettext(), ngettext(), lgettext(), and lngettext() to always return byte bytes, we hadn't forced those byte bytes to always be in a specified charset. We now make sure that gettext() and ngettext() return byte bytes encoded using output_charset if set, otherwise charset and if neither of those, UTF-8. With lgettext() and lngettext() output_charset if set, otherwise locale.getpreferredencoding().
The kitchen.text module contains functions that deal with text manipulation.
Functions to handle conversion of byte bytes and str strings.
Changed in version kitchen: 0.2a2 ; API kitchen.text 2.0.0 Added getwriter()
Changed in version kitchen: 0.2.2 ; API kitchen.text 2.1.0 Added exception_to_unicode(), exception_to_bytes(), EXCEPTION_CONVERTERS, and BYTE_EXCEPTION_CONVERTERS
Changed in version kitchen: 1.0.1 ; API kitchen.text 2.1.1 Deprecated BYTE_EXCEPTION_CONVERTERS as we've simplified exception_to_unicode() and exception_to_bytes() to make it unnecessary
Python2 has two string types, str and unicode. unicode represents an abstract sequence of text characters. It can hold any character that is present in the unicode standard. str can hold any byte of data. The operating system and python work together to display these bytes as characters in many cases but you should always keep in mind that the information is really a sequence of bytes, not a sequence of characters. In python2 these types are interchangeable a large amount of the time. They are one of the few pairs of types that automatically convert when used in equality:
>>> # string is converted to unicode and then compared >>> "I am a string" == u"I am a string" True >>> # Other types, like int, don't have this special treatment >>> 5 == "5" False
However, this automatic conversion tends to lull people into a false sense of security. As long as you're dealing with ASCII characters the automatic conversion will save you from seeing any differences. Once you start using characters that are not in ASCII, you will start getting UnicodeError and UnicodeWarning as the automatic conversions between the types fail:
>>> "I am an ñ" == u"I am an ñ" __main__:1: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal False
Why do these conversions fail? The reason is that the python2 unicode type represents an abstract sequence of unicode text known as code points. str, on the other hand, really represents a sequence of bytes. Those bytes are converted by your operating system to appear as characters on your screen using a particular encoding (usually with a default defined by the operating system and customizable by the individual user.) Although ASCII characters are fairly standard in what bytes represent each character, the bytes outside of the ASCII range are not. In general, each encoding will map a different character to a particular byte. Newer encodings map individual characters to multiple bytes (which the older encodings will instead treat as multiple characters). In the face of these differences, python refuses to guess at an encoding and instead issues a warning or exception and refuses to convert.
SEE ALSO:
So what is the best method of dealing with this weltering babble of incoherent encodings? The basic strategy is to explicitly turn everything into unicode when it first enters your program. Then, when you send it to output, you can transform the unicode back into bytes. Doing this allows you to control the encodings that are used and avoid getting tracebacks due to UnicodeError. Using the functions defined in this module, that looks something like this:
>>> from kitchen.text.converters import to_unicode, to_bytes >>> name = raw_input('Enter your name: ') Enter your name: Toshio くらとみ >>> name 'Toshio \xe3\x81\x8f\xe3\x82\x89\xe3\x81\xa8\xe3\x81\xbf' >>> type(name) <type 'str'> >>> unicode_name = to_unicode(name) >>> type(unicode_name) <type 'unicode'> >>> unicode_name u'Toshio \u304f\u3089\u3068\u307f' >>> # Do a lot of other things before needing to save/output again: >>> output = open('datafile', 'w') >>> output.write(to_bytes(u'Name: %s\\n' % unicode_name))
A few notes:
Looking at line 6, you'll notice that the input we took from the user was a byte str. In general, anytime we're getting a value from outside of python (The filesystem, reading data from the network, interacting with an external command, reading values from the environment) we are interacting with something that will want to give us a byte str. Some python standard library modules and third party libraries will automatically attempt to convert a byte str to unicode strings for you. This is both a boon and a curse. If the library can guess correctly about the encoding that the data is in, it will return unicode objects to you without you having to convert. However, if it can't guess correctly, you may end up with one of several problems:
On line 8, we convert from a byte str to a unicode string. to_unicode() does this for us. It has some error handling and sane defaults that make this a nicer function to use than calling str.decode() directly:
All three of these can be overridden using different keyword arguments to the function. See the to_unicode() documentation for more information.
On line 15 we push the data back out to a file. Two things you should note here:
The default strategy of decoding to unicode strings when you take data in and encoding to a byte str when you send the data back out works great for most problems but there are a few times when you shouldn't:
In each of these instances, there is a reason to keep around the byte str version of a value. Here's a few hints to keep your sanity in these situations:
try:
b_input = to_bytes(input_should_be_bytes_already, errors='strict', nonstring='strict') except:
handle_errors_somehow()
The reason is that the default of to_bytes() will take characters that are illegal in the chosen encoding and transform them to replacement characters. Since the point of keeping this data as a byte str is to keep the exact same bytes when you send it outside of your code, changing things to replacement characters should be rasing red flags that something is wrong. Setting errors to strict will raise an exception which gives you an opportunity to fail gracefully.
print to_bytes(_('Username: %(user)s'), 'utf-8') % {'user': b_username}
Even when you have a good conceptual understanding of how python2 treats unicode and str there are still some things that can surprise you. In most cases this is because, as noted earlier, python or one of the python libraries you depend on is trying to convert a value automatically and failing. Explicit conversion at the appropriate place usually solves that.
One common idiom for getting a simple, string representation of an object is to use:
str(obj)
Unfortunately, this is not safe. Sometimes str(obj) will return unicode. Sometimes it will return a byte str. Sometimes, it will attempt to convert from a unicode string to a byte str, fail, and throw a UnicodeError. To be safe from all of these, first decide whether you need unicode or str to be returned. Then use to_unicode() or to_bytes() to get the simple representation like this:
u_representation = to_unicode(obj, nonstring='simplerepr') b_representation = to_bytes(obj, nonstring='simplerepr')
python has a builtin print() statement that outputs strings to the terminal. This originated in a time when python only dealt with byte str. When unicode strings came about, some enhancements were made to the print() statement so that it could print those as well. The enhancements make print() work most of the time. However, the times when it doesn't work tend to make for cryptic debugging.
The basic issue is that print() has to figure out what encoding to use when it prints a unicode string to the terminal. When python is attached to your terminal (ie, you're running the interpreter or running a script that prints to the screen) python is able to take the encoding value from your locale settings LC_ALL or LC_CTYPE and print the characters allowed by that encoding. On most modern Unix systems, the encoding is utf-8 which means that you can print any unicode character without problem.
There are two common cases of things going wrong:
$ LC_ALL=C python >>> print u'\ufffd' Traceback (most recent call last):
File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 0: ordinal not in range(128)
This often happens when a script that you've written and debugged from the terminal is run from an automated environment like cron. It also occurs when you have written a script using a utf-8 aware locale and released it for consumption by people all over the internet. Inevitably, someone is running with a locale that can't handle all unicode characters and you get a traceback reported.
#! /usr/bin/python -tt print u'\ufffd'
And then look at the difference between running it normally and redirecting to a file:
$ ./test.py � $ ./test.py > t Traceback (most recent call last):
File "test.py", line 3, in <module>
print u'\ufffd' UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 0: ordinal not in range(128)
The short answer to dealing with this is to always use bytes when writing output. You can do this by explicitly converting to bytes like this:
from kitchen.text.converters import to_bytes u_string = u'\ufffd' print to_bytes(u_string)
or you can wrap stdout and stderr with a StreamWriter. A StreamWriter is convenient in that you can assign it to encode for sys.stdout or sys.stderr and then have output automatically converted but it has the drawback of still being able to throw UnicodeError if the writer can't encode all possible unicode codepoints. Kitchen provides an alternate version which can be retrieved with kitchen.text.converters.getwriter() which will not traceback in its standard configuration.
The hash() of the ASCII characters is the same for unicode and byte str. When you use them in dict keys, they evaluate to the same dictionary slot:
>>> u_string = u'a' >>> b_string = 'a' >>> hash(u_string), hash(b_string) (12416037344, 12416037344) >>> d = {} >>> d[u_string] = 'unicode' >>> d[b_string] = 'bytes' >>> d {u'a': 'bytes'}
When you deal with key values outside of ASCII, unicode and byte str evaluate unequally no matter what their character content or hash value:
>>> u_string = u'ñ' >>> b_string = u_string.encode('utf-8') >>> print u_string ñ >>> print b_string ñ >>> d = {} >>> d[u_string] = 'unicode' >>> d[b_string] = 'bytes' >>> d {u'\\xf1': 'unicode', '\\xc3\\xb1': 'bytes'} >>> b_string2 = '\\xf1' >>> hash(u_string), hash(b_string2) (30848092528, 30848092528) >>> d = {} >>> d[u_string] = 'unicode' >>> d[b_string2] = 'bytes' {u'\\xf1': 'unicode', '\\xf1': 'bytes'}
How do you work with this one? Remember rule #1: Keep your unicode and byte str values separate. That goes for keys in a dictionary just like anything else.
>>> from kitchen.text.converters import to_unicode >>> u_string = u'one' >>> b_string = 'two' >>> d = {} >>> d[to_unicode(u_string)] = 1 >>> d[to_unicode(b_string)] = 2 >>> d {u'two': 2, u'one': 1}
How to treat nonstring values. Possible values are:
Default is simplerepr
Usually this should be used on a byte bytes but it can take both byte bytes and str strings intelligently. Nonstring objects are handled in different ways depending on the setting of the nonstring parameter.
The default values of this function are set so as to always return a str string and never raise an error when converting from a byte bytes to a str string. However, when you do not pass validly encoded text (or a nonstring object), you may end up with output that you don't expect. Be sure you understand the requirements of your data, not just ignore errors by passing it through this function.
Changed in version 0.2.1a2: Deprecated non_string in favor of nonstring parameter and changed default value to simplerepr
If errors are found while encoding, perform this action. Defaults to replace which replaces the invalid bytes with a character that means the bytes were unable to be encoded. Other values are the same as the error handling schemes in the codec base classes. For instance strict which raises an exception and ignore which simply omits the non-encodable characters.
How to treat nonstring values. Possible values are:
Default is simplerepr.
WARNING:
to_bytes(to_unicode(text), encoding='utf-8')
The initial to_unicode() call will ensure text is a str string. Then, to_bytes() will turn that into a byte bytes with the specified encoding.
Usually, this should be used on a str string but it can take either a byte bytes or a str string intelligently. Nonstring objects are handled in different ways depending on the setting of the nonstring parameter.
The default values of this function are set so as to always return a byte bytes and never raise an error when converting from unicode to bytes. However, when you do not pass an encoding that can validly encode the object (or a non-string object), you may end up with output that you don't expect. Be sure you understand the requirements of your data, not just ignore errors by passing it through this function.
Changed in version 0.2.1a2: Deprecated non_string in favor of nonstring parameter and changed default value to simplerepr
This is a reimplemetation of codecs.getwriter() that returns a StreamWriter that resists issuing tracebacks. The StreamWriter that is returned uses kitchen.text.converters.to_bytes() to convert str strings into byte bytes. The departures from codecs.getwriter() are:
Example usage:
$ LC_ALL=C python >>> import sys >>> from kitchen.text.converters import getwriter >>> UTF8Writer = getwriter('utf-8') >>> unwrapped_stdout = sys.stdout >>> sys.stdout = UTF8Writer(unwrapped_stdout) >>> print 'caf\xc3\xa9' café >>> print u'caf\xe9' café >>> ASCIIWriter = getwriter('ascii') >>> sys.stdout = ASCIIWriter(unwrapped_stdout) >>> print 'caf\xc3\xa9' café >>> print u'caf\xe9' caf?
SEE ALSO:
New in version kitchen: 0.2a2, API: kitchen.text 1.1.0
This function converts something to a byte bytes if it isn't one. It's used to call str() or unicode() on the object to get its simple representation without danger of getting a UnicodeError. You should be using to_unicode() or to_bytes() explicitly instead.
If you need str strings:
to_unicode(obj, nonstring='simplerepr')
If you need byte bytes:
to_bytes(obj, nonstring='simplerepr')
Convert str to an encoded utf-8 byte bytes. You should be using to_bytes() instead:
to_bytes(obj, encoding='utf-8', non_string='passthru')
control characters are not allowed in XML documents. When we encounter those we need to know what to do. Valid options are:
XML files consist mainly of text encoded using a particular charset. XML also denies the use of certain bytes in the encoded text (example: ASCII Null). There are also special characters that must be escaped if they are present in the input (example: <). This function takes care of all of those issues for you.
There are a few different ways to use this function depending on your needs. The simplest invocation is like this:
unicode_to_xml(u'String with non-ASCII characters: <"á と">')
This will return the following to you, encoded in utf-8:
'String with non-ASCII characters: <"á と">'
Pretty straightforward. Now, what if you need to encode your document in something other than utf-8? For instance, latin-1? Let's see:
unicode_to_xml(u'String with non-ASCII characters: <"á と">', encoding='latin-1') 'String with non-ASCII characters: <"á と">'
Because the と character is not available in the latin-1 charset, it is replaced with と in our output. This is an xml character reference which represents the character at unicode codepoint 12392, the と character.
When you want to reverse this, use xml_to_unicode() which will turn a byte bytes into a str string and replace the xml character references with the unicode characters.
XML also has the quirk of not allowing control characters in its output. The control_chars parameter allows us to specify what to do with those. For use cases that don't need absolute character by character fidelity (example: holding strings that will just be used for display in a GUI app later), the default value of replace works well:
unicode_to_xml(u'String with disallowed control chars: \u0000\u0007') 'String with disallowed control chars: ??'
If you do need to be able to reproduce all of the characters at a later date (examples: if the string is a key value in a database or a path on a filesystem) you have many choices. Here are a few that rely on utf-7, a verbose encoding that encodes control characters (as well as non-ASCII unicode values) to characters from within the ASCII printable characters. The good thing about doing this is that the code is pretty simple. You just need to use utf-7 both when encoding the field for xml and when decoding it for use in your python program:
unicode_to_xml(u'String with unicode: と and control char: ?', encoding='utf7') 'String with unicode: +MGg and control char: +AAc-' # [...] xml_to_unicode('String with unicode: +MGg and control char: +AAc-', encoding='utf7') u'String with unicode: と and control char: ?'
As you can see, the utf-7 encoding will transform even characters that would be representable in utf-8. This can be a drawback if you want unicode characters in the file to be readable without being decoded first. You can work around this with increased complexity in your application code:
encoding = 'utf-8' u_string = u'String with unicode: と and control char: ?' try:
# First attempt to encode to utf8
data = unicode_to_xml(u_string, encoding=encoding, errors='strict') except XmlEncodeError:
# Fallback to utf-7
encoding = 'utf-7'
data = unicode_to_xml(u_string, encoding=encoding, errors='strict') write_tag('<mytag encoding=%s>%s</mytag>' % (encoding, data)) # [...] encoding = tag.attributes.encoding u_string = xml_to_unicode(u_string, encoding=encoding)
Using code similar to that, you can have some fields encoded using your default encoding and fallback to utf-7 if there are control characters present.
NOTE:
SEE ALSO:
This function attempts to reverse what unicode_to_xml() does. It takes a byte bytes (presumably read in from an xml file) and expands all the html entities into unicode characters and decodes the byte bytes into a str string. One thing it cannot do is restore any control characters that were removed prior to inserting into the file. If you need to keep such characters you need to use xml_to_bytes() and bytes_to_xml() or use on of the strategies documented in unicode_to_xml() instead.
How to handle errors encountered while decoding the byte_string into str at the beginning of the process. Values are:
XML does not allow control characters. When we encounter those we need to know what to do. Valid options are:
Use this when you have a byte bytes representing text that you need to make suitable for output to xml. There are several cases where this is the case. For instance, if you need to transform some strings encoded in latin-1 to utf-8 for output:
utf8_string = byte_string_to_xml(latin1_string, input_encoding='latin-1')
If you already have strings in the proper encoding you may still want to use this function to remove control characters:
cleaned_string = byte_string_to_xml(string, input_encoding='utf-8', output_encoding='utf-8')
SEE ALSO:
This function attempts to reverse what unicode_to_xml() does. It takes a byte bytes (presumably read in from an xml file) and expands all the html entities into unicode characters and decodes the byte bytes into a str string. One thing it cannot do is restore any control characters that were removed prior to inserting into the file. If you need to keep such characters you need to use xml_to_bytes() and bytes_to_xml() or use one of the strategies documented in unicode_to_xml() instead.
This function is made especially to put binary information into xml documents.
This function is intended for encoding things that must be preserved byte-for-byte. If you want to encode a byte string that's text and don't mind losing the actual bytes you probably want to try byte_string_to_xml() or guess_encoding_to_xml() instead.
NOTE:
If you've got fields in an xml document that were encoded with bytes_to_xml() then you want to use this function to undecode them. It converts a base64 encoded string into a byte bytes.
NOTE:
from kitchen.text.converters import (EXCEPTION_CONVERTERS,
exception_to_unicode) class MyError(Exception):
def __init__(self, message):
self.value = message c = [lambda e: e.value] c.extend(EXCEPTION_CONVERTERS) try:
raise MyError('An Exception message') except MyError, e:
print exception_to_unicode(e, converters=c)
Another reason would be if you're converting to a byte bytes and you know the bytes needs to be a non-utf-8 encoding. exception_to_bytes() defaults to utf-8 but if you convert into a byte bytes explicitly using a converter then you can choose a different encoding:
from kitchen.text.converters import (EXCEPTION_CONVERTERS,
exception_to_bytes, to_bytes) c = [lambda e: to_bytes(e.args[0], encoding='euc_jp'),
lambda e: to_bytes(e, encoding='euc_jp')] c.extend(EXCEPTION_CONVERTERS) try:
do_something() except Exception, e:
log = open('logfile.euc_jp', 'a')
log.write('%s
Each function in this list should take the exception as its sole argument and return a string containing the message representing the exception. The functions may return the message as a :byte class:bytes, a str string, or even an object if you trust the object to return a decent string representation. The exception_to_unicode() and exception_to_bytes() functions will make sure to convert the string to the proper type before returning.
New in version 0.2.2.
Tuple of functions to try to use to convert an exception into a string representation. This tuple is similar to the one in EXCEPTION_CONVERTERS but it's used with exception_to_bytes() instead. Ideally, these functions should do their best to return the data as a byte bytes but the results will be run through to_bytes() before being returned.
New in version 0.2.2.
Changed in version 1.0.1: Deprecated as simplifications allow EXCEPTION_CONVERTERS to perform the same function.
New in version 0.2.2.
New in version 0.2.2.
Changed in version 1.0.1: Code simplification allowed us to switch to using EXCEPTION_CONVERTERS as the default value of converters.
Functions related to displaying unicode text. Unicode characters don't all have the same width so we need helper functions for displaying them.
New in version 0.2: kitchen.display API 1.0.0
specify how to deal with control characters. Possible values are:
NOTE:
This is what you want to use instead of %.*s, as it does the "right" thing with regard to UTF-8 sequences, control characters, and characters that take more than one cell position. Eg:
>>> # Wrong: only displays 8 characters because it is operating on bytes >>> print "%.*s" % (10, 'café ñunru!') café ñun >>> # Properly operates on graphemes >>> '%s' % (textual_width_chop('café ñunru!', 10)) café ñunru >>> # takes too many columns because the kanji need two cell positions >>> print '1234567890\n%.*s' % (10, u'一二三四五六七八九十') 1234567890 一二三四五六七八九十 >>> # Properly chops at 10 columns >>> print '1234567890\n%s' % (textual_width_chop(u'一二三四五六七八九十', 10)) 1234567890 一二三四五
NOTE:
WARNING:
This function expands a string to fill a field of a particular textual width. Use it instead of %*.*s, as it does the "right" thing with regard to UTF-8 sequences, control characters, and characters that take more than one cell position in a display. Example usage:
>>> msg = u'一二三四五六七八九十' >>> # Wrong: This uses 10 characters instead of 10 cells: >>> u":%-*.*s:" % (10, 10, msg[:9]) :一二三四五六七八九 : >>> # This uses 10 cells like we really want: >>> u":%s:" % (textual_width_fill(msg[:9], 10, 10)) :一二三四五: >>> # Wrong: Right aligned in the field, but too many cells >>> u"%20.10s" % (msg)
一二三四五六七八九十 >>> # Correct: Right aligned with proper number of cells >>> u"%s" % (textual_width_fill(msg, 20, 10, left=False))
一二三四五 >>> # Wrong: Adding some escape characters to highlight the line but too many cells >>> u"%s%20.10s%s" % (prefix, msg, suffix) u'?[7m 一二三四五六七八九十?[0m' >>> # Correct highlight of the line >>> u"%s%s%s" % (prefix, display.textual_width_fill(msg, 20, 10, left=False), suffix) u'?[7m 一二三四五?[0m' >>> # Correct way to not highlight the fill >>> u"%s" % (display.textual_width_fill(msg, 20, 10, left=False, prefix=prefix, suffix=suffix)) u' ?[7m一二三四五?[0m'
textwrap.wrap() from the python standard library has two drawbacks that this attempts to fix:
SEE ALSO:
This function is a light wrapper around kitchen.text.display.wrap(). Where that function returns a list of lines, this function returns one string with each line separated by a newline.
NOTE:
SEE ALSO:
There are a few internal functions and variables in this module. Code outside of kitchen shouldn't use them but people coding on kitchen itself may find them useful.
This table was last regenerated on python-3.8.0a3 with unicodedata.unidata_version 12.0.0
In normal use, this function serves to tell how we're generating the combining char list. For speed reasons, we use this to generate a static list and just use that later.
Markus Kuhn's list of combining characters is more complete than what's in the python unicodedata library but the python unicodedata is synced against later versions of the unicode database
This is used to generate the _COMBINING table.
This will print a new _COMBINING table in the format used in kitchen/text/display.py. It's useful for updating the _COMBINING table with updated data from a new python as the format won't change from what's already in the file.
This function checks whether a numeric value is present within a table of intervals. It checks using a binary search algorithm, dividing the list of values in half and checking against the values until it determines whether the value is in the table.
specify how to deal with control characters. Possible values are:
NOTE:
We often want to know "does X fit in Y". It takes a while to use textual_width() to calculate this. However, we know that the number of canonically composed str characters is always going to have 1 or 2 for the textual width per character. With this we can take the following shortcuts:
textual width of a canonically composed str string will always be greater than or equal to the the number of str characters. So we can first check if the number of composed str characters is less than the asked for width. If it is we can return True immediately. If not, then we must do a full textual width lookup.
Collection of text functions that don't fit in another category.
Changed in version kitchen: 1.2.0, API: kitchen.text 2.2.0 Added isbasestring(), isbytestring(), and isunicodestring() to help tell which string type is which on python2 and python3
NOTE:
In some cases you'll have a whole bunch of byte strings and rather than transforming them to str and back to byte bytes for output to xml, you will just want to make sure they work with the xml file you're constructing. This function will help you do that. Example:
ARRAY_OF_MOSTLY_UTF8_STRINGS = [...] processed_array = [] for string in ARRAY_OF_MOSTLY_UTF8_STRINGS:
if byte_string_valid_xml(string, 'utf-8'):
processed_array.append(string)
else:
processed_array.append(guess_bytes_to_xml(string, encoding='utf-8')) output_xml(processed_array)
We start by attempting to decode the byte bytes as UTF-8. If this succeeds we tell the world it's UTF-8 text. If it doesn't and chardet is installed on the system and disable_chardet is False this function will use it to try detecting the encoding of byte_string. If it is not installed or chardet cannot determine the encoding with a high enough confidence then we rather arbitrarily claim that it is latin-1. Since latin-1 will encode to every byte, decoding from latin-1 to str will not cause UnicodeErrors although the output might be mangled.
In python2 this is eqiuvalent to isinstance(obj, basestring). In python3 it checks whether the object is an instance of str, bytes, or bytearray. This is an aid to porting code that needed to test whether an object was derived from basestring in python2 (commonly used in unicode-bytes conversion functions)
New in version Kitchen:: 1.2.0, API kitchen.text 2.2.0
In python2 this is equivalent to isinstance(obj, str). In python3 it checks whether the object is an instance of bytes or bytearray.
New in version Kitchen:: 1.2.0, API kitchen.text 2.2.0
In python2 this is equivalent to isinstance(obj, unicode). In python3 it checks whether the object is an instance of bytes.
New in version Kitchen:: 1.2.0, API kitchen.text 2.2.0
XML does not allow ASCII control characters. When we encounter those we need to know what to do. Valid options are:
Changed in version kitchen: 1.2.0, API: kitchen.text 2.2.0 Strip out the C1 control characters in addition to the C0 control characters.
This function prevents UnicodeError (python-2.4 or less) and UnicodeWarning (python 2.5 and higher) when we compare a str string to a byte bytes. The errors normally arise because the conversion is done to ASCII. This function lets you convert to utf-8 or another encoding instead.
NOTE:
Note that str1 == str2 is faster than this function if you can accept the following limitations:
Functions for operating on byte bytes encoded as UTF-8
NOTE:
WARNING:
Use kitchen.text.display.fill() instead.
Use kitchen.text.display.wrap() instead
Use kitchen.text.misc.byte_string_valid_encoding() instead.
Use kitchen.text.display.textual_width() instead.
Use textual_width_chop() and textual_width() instead:
>>> msg = 'く ku ら ra と to み mi' >>> # Old way: >>> utf8_width_chop(msg, 5) (5, 'く ku') >>> # New way >>> from kitchen.text.converters import to_bytes >>> from kitchen.text.display import textual_width, textual_width_chop >>> (textual_width(msg), to_bytes(textual_width_chop(msg, 5))) (5, 'く ku')
Use byte_string_textual_width_fill() instead
kitchen.collections.StrictDict provides a dictionary that treats bytes and str as distinct key values.
Functions to manipulate iterables
New in version Kitchen:: 0.2.1a1
Module author: Toshio Kuratomi <toshio@fedoraproject.org>
Module author: Luke Macken <lmacken@redhat.com>
This function will create an iterator out of any scalar or iterable. It is useful for making a value given to you an iterable before operating on it. Iterables have their items returned. scalars are transformed into iterables. A string is treated as a scalar value unless the include_string parameter is set to True. Example usage:
>>> list(iterate(None)) [None] >>> list(iterate([None])) [None] >>> list(iterate([1, 2, 3])) [1, 2, 3] >>> list(iterate(set([1, 2, 3]))) [1, 2, 3] >>> list(iterate(dict(a='1', b='2'))) ['a', 'b'] >>> list(iterate(1)) [1] >>> list(iterate(iter([1, 2, 3]))) [1, 2, 3] >>> list(iterate('abc')) ['abc'] >>> list(iterate('abc', include_string=True)) ['a', 'b', 'c']
PEP 386 defines a standard format for version strings. This module contains a function for creating strings in that format.
This function implements just enough of PEP 386 to satisfy our needs. PEP 386 defines a standard format for version strings and refers to a function that will be merged into the python standard library that transforms a tuple of version information into a standard version string. This function is an implementation of that function. Once that function becomes available in the python standard library we will start using it and deprecate this function.
version_info takes the form that PEP 386's NormalizedVersion.from_parts() uses:
((Major, Minor, [Micros]), [(Alpha/Beta/rc marker, version)],
[(post/dev marker, version)]) Ex: ((1, 0, 0), ('a', 2), ('dev', 3456))
It generates a PEP 386 compliant version string:
N.N[.N]+[{a|b|c|rc}N[.N]+][.postN][.devN] Ex: 1.0.0a2.dev3456
WARNING:
It's recommended that you use this function to keep a __version_info__ tuple and __version__ string in your modules. Why do we need both a tuple and a string? The string is often useful for putting into human readable locations like release announcements, version strings in tarballs, etc. Meanwhile the tuple is very easy for a computer to compare. For example, kitchen sets up its version information like this:
from kitchen.versioning import version_tuple_to_string __version_info__ = ((0, 2, 1),) __version__ = version_tuple_to_string(__version_info__)
Other programs that depend on a kitchen version between 0.2.1 and 0.3.0 can find whether the present version is okay with code like this:
from kitchen import __version_info__, __version__ if __version_info__ < ((0, 2, 1),) or __version_info__ >= ((0, 3, 0),):
print 'kitchen is present but not at the right version.'
print 'We need at least version 0.2.1 and less than 0.3.0'
print 'Currently found: kitchen-%s' % __version__
Kitchen has a hierarchy of exceptions that should make it easy to catch many errors emitted by kitchen itself.
Exception classes for kitchen and the root of the exception hierarchy for all kitchen modules.
Exception classes thrown by kitchen's text processing routines.
The 0.1 through 1.0.0 releases focused on bringing in functions from yum and python-fedora. This porting guide tells how to port from those APIs to their kitchen replacements.
python-fedora | kitchen replacement |
fedora.iterutils.isiterable() | kitchen.iterutils.isiterable() [1] |
fedora.textutils.to_unicode() | kitchen.text.converters.to_unicode() |
fedora.textutils.to_bytes() | kitchen.text.converters.to_bytes() |
>>> # Old code >>> isiterable('abcdef') True >>> # New code >>> isiterable('abcdef', include_string=True) True
yum | kitchen replacement |
yum.i18n.dummy_wrapper() | kitchen.i18n.DummyTranslations.ugettext() [2] |
yum.i18n.dummyP_wrapper() | kitchen.i18n.DummyTanslations.ungettext() [2] |
yum.i18n.utf8_width() | kitchen.text.display.textual_width() |
yum.i18n.utf8_width_chop() | kitchen.text.display.textual_width_chop() and kitchen.text.display.textual_width() [3] [5] |
yum.i18n.utf8_valid() | kitchen.text.misc.byte_string_valid_encoding() |
yum.i18n.utf8_text_wrap() | kitchen.text.display.wrap() [4] |
yum.i18n.utf8_text_fill() | kitchen.text.display.fill() [4] |
yum.i18n.to_unicode() | kitchen.text.converters.to_unicode() [6] |
yum.i18n.to_unicode_maybe() | kitchen.text.converters.to_unicode() [6] |
yum.i18n.to_utf8() | kitchen.text.converters.to_bytes() [6] |
yum.i18n.to_str() | kitchen.text.converters.to_unicode() or kitchen.text.converters.to_bytes() [7] |
yum.i18n.str_eq() | kitchen.text.misc.str_eq() |
yum.misc.to_xml() | kitchen.text.converters.unicode_to_xml() or kitchen.text.converters.byte_string_to_xml() [8] |
yum.i18n._() | See: Initializing Yum i18n |
yum.i18n.P_() | See: Initializing Yum i18n |
yum.i18n.exception2msg() | kitchen.text.converters.exception_to_unicode() or kitchen.text.converter.exception_to_bytes() [9] |
>>> # Old way >>> utf8_width_chop(msg, 5) (5, 'く ku') >>> # New way >>> from kitchen.text.display import textual_width, textual_width_chop >>> (textual_width(msg), textual_width_chop(msg, 5)) (5, u'く ku')
>>> from kitchen.text.converters import to_unicode >>> to_unicode(5) u'5' >>> to_unicode(5, nonstring='passthru') 5
from kitchen.text.converters import EXCEPTION_CONVERTERS, \
BYTE_EXCEPTION_CONVERTERS, exception_to_unicode, \
exception_to_bytes def exception2umsg(e):
'''Return a unicode representation of an exception'''
c = [lambda e: e.value]
c.extend(EXCEPTION_CONVERTERS)
return exception_to_unicode(e, converters=c) def exception2bmsg(e):
'''Return a utf8 encoded str representation of an exception'''
c = [lambda e: e.value]
c.extend(BYTE_EXCEPTION_CONVERTERS)
return exception_to_bytes(e, converters=c)
The reason to define this wrapper is that many of the exceptions in yum put the message in the value attribute of the Exception instead of adding it to the args attribute. So the default EXCEPTION_CONVERTERS don't know where to find the message. The wrapper tells kitchen to check the value attribute for the message. The reason to define two wrappers may be less obvious. yum.i18n.exception2msg() can return a unicode string or a byte str depending on a combination of what attributes are present on the Exception and what locale the function is being run in. By contrast, kitchen.text.converters.exception_to_unicode() only returns unicode strings and kitchen.text.converters.exception_to_bytes() only returns byte str. This is much safer as it keeps code that can only handle unicode or only handle byte str correctly from getting the wrong type when an input changes but it means you need to examine the calling code when porting from yum.i18n.exception2msg() and use the appropriate wrapper.
Previously, yum had several pieces of code to initialize i18n. From the toplevel of yum/i18n.py:
try:.
'''
Setup the yum translation domain and make _() and P_() translation wrappers
available.
using ugettext to make sure translated strings are in Unicode.
'''
import gettext
t = gettext.translation('yum', fallback=True)
_ = t.ugettext
P_ = t.ungettext except:
'''
Something went wrong so we make a dummy _() wrapper there is just
returning the same text
'''
_ = dummy_wrapper
P_ = dummyP_wrapper
With kitchen, this can be changed to this:
from kitchen.i18n import easy_gettext_setup, DummyTranslations try:
_, P_ = easy_gettext_setup('yum') except:
translations = DummyTranslations()
_ = translations.ugettext
P_ = translations.ungettext
NOTE:
b_, bP_ = easy_gettext_setup('yum', use_unicode=False)
The second place where i18n is setup is in yum.YumBase._getConfig() in yum/__init_.py if gaftonmode is in effect:
if startupconf.gaftonmode:
global _
_ = yum.i18n.dummy_wrapper
This can be changed to:
if startupconf.gaftonmode:
global _
_ = DummyTranslations().ugettext()
At the moment, we're supporting python-2.4 and above. Understand that there's a lot of python features that we cannot use because of this.
Sometimes modules in the python standard library can be added to kitchen so that they're available. When we do that we need to be careful of several things:
def to_unicode(msg, encoding='utf8', errors='replace'):
return unicode(msg, encoding, errors) # Smoketest only. This will give 100% coverage for your code (it # tests all of the code inside of to_unicode) but it leaves a lot of # room for errors as it doesn't test all combinations of arguments # that are then passed to the unicode() function. tools.ok_(to_unicode('abc') == u'abc') # Better -- tests now cover non-ascii characters and that error conditions # occur properly. There's a lot of other permutations that can be # added along these same lines. tools.ok_(to_unicode(u'café', 'utf8', 'replace')) tools.assert_raises(UnicodeError, to_unicode, [u'cafè ñunru'.encode('latin1')])
We use sphinx to build our documentation. We use the sphinx autodoc extension to pull docstrings out of the modules for API documentation. This means that docstrings for subpackages and modules should follow a certain pattern. The general structure is:
Currently the kitchen library is in early stages of development. While we're in this state, the main kitchen library uses the following pattern for version information:
NOTE:
All strings that are used as feedback for users need to be translated. kitchen sets up several functions for this. _() is used for marking things that are shown to users via print, GUIs, or other "standard" methods. Strings for exceptions are marked with b_(). This function returns a byte str which is needed for use with exceptions:
from kitchen import _, b_ def print_message(msg, username):
print _('%(user)s, your message of the day is: %(message)s') % {
'message': msg, 'user': username}
raise Exception b_('Test message')
This serves several purposes:
NOTE:
paver <http://www.blueskyonmars.com/projects/paver/>_ and babel <http://babel.edgewall.org/>_ are used to extract the strings.
Kitchen strives to have a long deprecation cycle so that people have time to switch away from any APIs that we decide to discard. Discarded APIs should raise a DeprecationWarning and clearly state in the warning message and the docstring how to convert old code to use the new interface. An example of deprecating a function:
import warnings from kitchen import _ from kitchen.text.converters import to_bytes, to_unicode from kitchen.text.new_module import new_function def old_function(param):
'''**Deprecated**
This function is deprecated. Use
:func:`kitchen.text.new_module.new_function` instead. If you want
unicode strngs as output, switch to::
>>> from kitchen.text.new_module import new_function
>>> output = new_function(param)
If you want byte strings, use::
>>> from kitchen.text.new_module import new_function
>>> from kitchen.text.converters import to_bytes
>>> output = to_bytes(new_function(param))
'''
warnings.warn(_('kitchen.text.old_function is deprecated. Use'
' kitchen.text.new_module.new_function instead'),
DeprecationWarning, stacklevel=2)
as_unicode = isinstance(param, unicode)
message = new_function(to_unicode(param))
if not as_unicode:
message = to_bytes(message)
return message
If a particular API change is very intrusive, it may be better to create a new version of the subpackage and ship both the old version and the new version.
Update the NEWS file when you make a change that will be visible to the users. This is not a ChangeLog file so we don't need to list absolutely everything but it should give the user an idea of how this version differs from prior versions. API changes should be listed here explicitly. bugfixes can be more general:
----- 0.2.0 ----- * Relicense to LGPLv2+ * Add kitchen.text.format module with the following functions:
textual_width, textual_width_chop. * Rename the kitchen.text.utils module to kitchen.text.misc. use of the
old names is deprecated but still available. * bugfixes applied to kitchen.pycompat24.defaultdict that fixes some
tracebacks
Kitchen itself is a namespace. The kitchen sdist (tarball) provides certain useful subpackages.
SEE ALSO:
Each subpackage should have its own version information which is independent of the other kitchen subpackages and the main kitchen library version. This is used so that code that depends on kitchen APIs can check the version information. The standard way to do this is to put something like this in the subpackage's __init__.py:
from kitchen.versioning import version_tuple_to_string __version_info__ = ((1, 0, 0),) __version__ = version_tuple_to_string(__version_info__)
__version_info__ is documented in kitchen.versioning. The values of the first tuple should describe API changes to the module. There are at least three numbers present in the tuple: (Major, minor, micro). The major version number is for backwards incompatible changes (For instance, removing a function, or adding a new mandatory argument to a function). Whenever one of these occurs, you should increment the major number and reset minor and micro to zero. The second number is the minor version. Anytime new but backwards compatible changes are introduced this number should be incremented and the micro version number reset to zero. The micro version should be incremented when a change is made that does not change the API at all. This is a common case for bugfixes, for instance.
Version information beyond the first three parts of the first tuple may be useful for versioning but semantically have similar meaning to the micro version.
NOTE:
Supackages within kitchen should meet these criteria:
SEE ALSO:
Addon packages are very similar to subpackages integrated into the kitchen sdist. This section just lists some of the differences to watch out for.
Your setup.py should contain entries like this:
# It's suggested to use a dotted name like this so the package is easily # findable on pypi: setup(name='kitchen.config',
# Include kitchen in the keywords, again, for searching on pypi
keywords=['kitchen', 'configuration'],
# This package lives in the directory kitchen/config
packages=['kitchen.config'],
# [...] )
Create a kitchen directory in the toplevel. Place the addon subpackage in there. For example:
./ <== toplevel with README, setup.py, NEWS, etc kitchen/ kitchen/__init__.py kitchen/config/ <== subpackage directory kitchen/config/__init__.py
The :file::__init__.py in the kitchen directory is special. It won't be installed. It just needs to pull in the kitchen from the system so that you are able to test your module. You should be able to use this boilerplate:
# Fake module. This is not installed, It's just made to import the real # kitchen modules for testing this module import pkgutil # Extend the __path__ with everything in the real kitchen module __path__ = pkgutil.extend_path(__path__, __name__)
NOTE:
Your unittests should now be able to find both your submodule and the main kitchen module.
It is recommended that addon packages version similarly to Versioning. The __version_info__ and __version__ strings can be changed independently of the version exposed by setup.py so that you have both an API version (__version_info__) and release version that's easier for people to parse. However, you aren't required to do this and you could follow a different methodology if you want (for instance, Kitchen versioning)
SEE ALSO:
SEE ALSO:
SEE ALSO:
SEE ALSO:
SEE ALSO:
SEE ALSO:
More information about the project can be found on the project webpage
The latest published version of this documentation can be found on the documentation page
unknown
2022 Red Hat, Inc. and others
December 24, 2022 | 0.2 |