PEP: 218 Title: Adding a Built-In Set Object Type Version: $Revision$ Author: gvwilson@ddj.com (Greg Wilson) Status: Draft Type: Standards Track Python-Version: 2.2 Created: 31-Jul-2000 Post-History: Introduction This PEP proposes adding a Set module to the standard Python library, and to then make sets a built-in Python type if that module is widely used. After explaining why sets are desirable, and why the common idiom of using dictionaries in their place is inadequate, we describe how we intend built-in sets to work, and then how the preliminary Set module will behave. The penultimate section discusses the mutability (or otherwise) of sets and set elements, and the solution which the Set module will implement. The last section then looks at alternatives that were considered, but discarded. Rationale Sets are a fundamental mathematical structure, and are very commonly used in algorithm specifications. They are much less frequently used in implementations, even when they are the "right" structure. Programmers frequently use lists instead, even when the ordering information in lists is irrelevant, and by-value lookups are frequent. (Most medium-sized C programs contain a depressing number of start-to-end searches through malloc'd vectors to determine whether particular items are present or not...) Programmers are often told that they can implement sets as dictionaries with "don't care" values. Items can be added to these "sets" by assigning the "don't care" value to them; membership can be tested using "dict.has_key"; and items can be deleted using "del". However, the other main operations on sets (union, intersection, and difference) are not directly supported by this representation, since their meaning is ambiguous for dictionaries containing key/value pairs. Long-Term Proposal The long-term goal of this PEP is to add a built-in set type to Python. This type will be an unordered collection of unique values, just as a dictionary is an unordered collection of key/value pairs. Constant sets will be represented using the usual mathematical notation, so that "{1, 2, 3}" will be a set of three integers. In order to avoid ambiguity, the empty set will be written "{-}", rather than "{}" (which is already used to represent empty dictionaries). We feel that this notation is as reasonable as the use of "(3,)" to represent single-element tuples; a more radical strategy is discussed in the "Alternatives" section, and more readable than the earlier proposal "{,}". Iteration and comprehension will be implemented in the obvious ways, so that: for x in S: will step through the elements of S in arbitrary order, while: {x**2 for x in S} will produce a set containing the squares of all elements in S, Membership will be tested using "in" and "not in", and basic set operations will be implemented by a mixture of overloaded operators: | union & intersection ^ symmetric difference - asymmetric difference and methods: S.add(x) Add "x" to the set. S.update(s) Add all elements of sequence "s" to the set. S.remove(x) Remove "x" from the set. If "x" is not present, this method raises a LookupError exception. S.discard(x) Remove "x" from the set if it is present, or do nothing if it is not. S.popitem() Remove and return an arbitrary element, raising a LookupError if the element is not present. and one new built-in conversion function: set(x) Create a set containing the elements of the collection "x". Notes: 1. We propose using the bitwise operators "|&" for intersection and union. While "+" for union would be intuitive, "*" for intersection is not (very few of the people asked guessed what it did correctly). 2. We considered using "+" to add elements to a set, rather than "add". However, Guido van Rossum pointed out that "+" is symmetric for other built-in types (although "*" is not). Use of "add" will also avoid confusion between that operation and set union. 3. Sets raise "LookupError" exceptions, rather than "KeyError" or "ValueError", because set elements are neither keys nor values. Short-Term Proposal In order to determine whether there is enough demand for sets to justify making them a built-in type, and to give users a chance to try out the semantics we propose for sets, our short-term proposal is to add a "Set" class to the standard Python library. This class will have the operators and methods described above; it will also have named methods corresponding to all of the operations: a "union" method for "|", and a "union_update" method for "|=", and so on. This class will use a dictionary internally to contain set values. In order to avoid having to duplicate values (e.g. for iteration through the set), the class will rely on the iterators which are scheduled to appear in Python 2.2. Tim Peters believes that the class's constructor should take a single sequence as an argument, and populate the set with that sequence's elements. His argument is that in most cases, programmers will be created sets from pre-existing sequences, so that common case should be usual. However, this would require users to remember an extra set of parentheses when initializing a set with known values: >>> Set((1, 2, 3, 4)) # case 1 On the other hand, feedback from a small number of novice Python users (all of whom were very experienced with other languages) indicates that people will find a "parenthesis-free" syntax more natural: >>> Set(1, 2, 3, 4) # case 2 On the other, other hand, if Python does adopt a dictionary-like notation for sets in the future, then case 2 will become redundant. We have therefore adopted the first strategy, in which the initializer takes a single sequence argument. Mutability The most difficult question to resolve in this proposal was whether sets ought to be able to contain mutable elements. A dictionary's keys must be immutable in order to support fast, reliable lookup. While it would be easy to require set elements to be immutable, this would preclude sets of sets (which are widely used in graph algorithms and other applications). At Tim Peters' suggestion, we will implement the following compromise. A set may only contain immutable elements, but is itself mutable *until* its hash code is calculated. As soon as that happens, the set is "frozen", i.e. becomes immutable. Thus, a set may be used as a dictionary key, or as a set element, but cannot be updated after this is done. Peters reports that this behavior rarely causes problems in practice. Alternatives An alternative to the notation "{-}" for the empty set would be to re-define "{}" to be the empty collection, rather than the empty dictionary. Operations which made this object non-empty would silently convert it to either a dictionary or a set; it would then retain that type for the rest of its existence. This idea was rejected because of its potential impact on existing Python programs. A similar proposal to modify "dict.keys" and "dict.values" to return sets, rather than lists, was rejected for the same reasons. Copyright This document has been placed in the Public Domain. Local Variables: mode: indented-text indent-tabs-mode: nil End: