Sets and Frozen Sets Here are the bare bones. I will demonstrate sets in class, including some scripts and functions we can write using sets (and to a lesser extent frozensets). The reality of understanding sets/frozensets understanding the basic operations we can perform on them. Sets are mostly like lists (but share two properties with dictionaries: see 2-3), but with three differences (1) Sets do not contain duplicates; if we add a valute that already is in a set, the set remains unchanged; this means we can often add a value to a set without first checking if it is in the set: if it isn't in the set, it is added; if it is in the set, the set remains unchanged. (2) Sets are unordered: we cannot index different values, and when we iterate through them the order of the values produced is not fixed (like dictionaries) (3) All values in sets (like keys in dictionaries) must be immutable. So we can have sets of tuples, but not sets of lists There are also a large number of operators/methods that take sets as arguments and produce sets as results (discussed below). Frozensets are like immutatable sets: they have the two big properties listed above, but their methods are restricted to those that do not mutate the frozenset. So frozensets are to sets like tuples are to lists. As with tuples, we can use frozensets as keys to dictionaries because frozensets are immutable. Sets have literals: a positive number of values (1 or more) separated by commas, all in braces. So, these literals are like dicts, but there is no colon between key:value pairs (which is how Python tells the difference between a dict and a set. But there is one problem Python cannot tell whether {} is an empty dict or an empty set. That is why the rule above says "1 or more"; we write empty dictionaries as {}; we must write empty sets as set(). So a = set() is the empty set (no/0 values) b = {'a', b', 'c'} is a set of str c = {('ICS-31','MATH-2A','ICS-6B'), ('ICS-31','BIO-9','ICS-6D')} is a set of tuples Note that if we write the set {[]} Python raises: TypeError: unhashable type: 'list' Because list are mutable and we cannot have mutable values as keys in dictionaries or as values in sets. I would prefer is say TypeError: mutable type: 'list' but it doesn't; instead it says unhashable. Hashable means immutable, so unhashable means-un immutable, which means mutatable. Set operations: (1) len: we can compute the length of a set (# of values at the top-level) len(a) is 0; len(b) is 3 len(c) is 2. (2) No Indexing: the values in sets are unordered so it makes no sense to try to index them (3) No Slicing (likewise) (4) Checking containment: the in/not in operators These operators work on the values in a set 'a' in a is False; 'a' in b is True; 'ICS-31' in c is False, but ('ICS-31','MATH-2A','ICS-6B') in c is True; note that ('ICS-6B','MATH-2A','ICS-31') in c is False, because for two tuples to be considered the same, their values must be in the same order (5) No Catenation (6) No Multiplication (7) Iterability: for i in b: produces all the top-level values in a (there are len(a) of them): for i in b: print(i,end='') prints: abc Note that we can write iter(aset) for use in while loops, to eventually produce all the values stored in aset. Note that the functions max and sum work on lists, tuples, sets, and frozensets (and on adict.keys() and adict.values()) -so long as the values are numeric. We will write our own functions that will take arguments that are any iterable type of data, and thus work for all these different types of data. In fact, the constructors for all these types take arguments that are iterable. We have seen how to construct a list from a tuple and a tuple from a list. We can also construct list from sets and sets from lists, by writing aset = set(alist) # len(aset) <= len(alist): aset has no duplicated values alist = list(aset) # len(alist) == len(aset): it has no duplicated values # b ecause aset has no duplicated valuesl Note set('abc') constructs the set {'a','b','c'} because strings are iterable; of course if we print that set the values can appear in any order. (8) There are a variety of set operations (from mathematics) that appear in Python in both a method and operator form. There are mutation versions of these operators as well (much like + from mathematics and += in Python). aset1 == aset2 # set equality aset1 != aset2 # set inequality two sets are equal if they have exactly the same values, otherwise they are not equal: {1,2,3} == {3,2,1} is True because order in sets makes no difference (unlike strings, lists, and tuples; but like dictionaries). Note that sets are never equal to lists. For two objects to be == they must be the same data-type (two lists, two tuples, two dicts, two sets) and store the same values. aset1.isdisjoint(aset2) : do these sets have no common values aset1.issubset(aset2) : every value in aset1 is also in aset2 aset1 <= aset2 aset1 < aset2 : aset1 <= aset2 and aset1 != asets2 aset1.issuperset(aset2) : every value in aset2 is also in aset1 aset1>=aset2 sometimes if aset1 <= aset2 we say that aset1 is contained in aset2, and if aset1 >= aset2 we say that aset2 is contained in aset1 aset1.union(aset2, ..., asetn) aset1 | aset2 | ... | asetn produces a new set with the union of all the sets: the new set has one of every value in the other sets: {1,2} | {2,3,4} | {1,3,6} is {1,2,3,4,6} so unions construct new sets whose length are bigger aset1.intersection(aset2, ..., asetn) aset1 & aset2 & ... & asetn produces a new set with the intersection of all the sets: the new set has only values that are in every other set: {1,2} | {2,3,4} | {1,3,6} is {1} so intersections construct new sets whose length are smaller aset1.difference(aset2, ..., asetn) aset1 - aset2 - ... - asetn produces a new set with the difference between aset1 and all the other sets: the new set has all the values that are in aset1 but not in any of the other sets: {1,2,3,4,5,6} - {2,4} - {4,5} is {1,3,6}; so differences construct new sets smaller than the first aset1.symmetric_difference(aset2) aset1 ^ set2 produces a new set with the values in one set but not the other: {0,2,4,5,6} ^ {1,3,5,6} is {0,1,2,3,4}; so symmetic_differences produce sets smaller than each argument/operand. Symmetric difference produces all values not in the intersection: note {0,2,4,5,6} & {1,3,5,6} is {5,6} and the symmetric difference is all avalues in the set, except these. We can define a^b as (a|b) - (a&b). There is one big difference between methods and operators: the operators require sets for both operands, but the methods allow any iterables for their arguments. So we CANNOT write {1,2,3} | [2,3,4], but we CAN write {1,2,3}.union([2,3,4,2]): Python turns the list [2,3,4,2] into a set {2,3,4} and then performs the union operation, which constructs the new set {1,2,3,4}. Set (mutation) operations (a) aset.add(value): add value to set: does nothing if value is already in aset Adding to a set is fundamental, like appending to a list. Remember that we "append" to a list but "add" to a set. Suppose x = {1,2,3} After aset.add(2), the set is unchanged After aset.add('x'), the set is {1,2,3,'x'} (iterated in any order) aset.remove(value) : remove value from aset: if not in aset raise KeyError aset.discard(value): remove value from aset: if not in aset do nothing aset.pop() : remove random value from aset: if empty raise KeyError aset.clear() : remove all values from aset: make it empty (b) These update operations for sets are similar to update operations for numeric values: aset1 |= aset2 is like a += b; the former translates into aset1 = aset1 | aset 2 and the latter into a = a + b. aset1.update(aset2,...asetn) aset1 |= aset2 | ... | asetn mutates aset1 to include all the values found in aset1 and any other set aset1.intersection_update(aset2,...asetn) aset1 &= aset2 & ... & asetn mutates aset1 to include only the values found in aset1 and every other set aset1.difference_update(aset2,...asetn) aset1 -= aset2 - ... - asetn mutates aset1 to include only the values found aset1 and no other sets aset1.symmetric_difference_update(aset2) aset1 ^= aset2 mutates aset1 to include only the values found aset1 or aset2 but not both Frozensets are very similar to sets, but we cannot use any of the muation methods or operators. The constructor is named frozenset: frozenset() constructs an empty frozenset. We can use this constructor to convert back and forth easily bewteen sets and frozensets: frozenset(aset) constructs a frozenset with all the values in aset and set(afrozenset) constructs a set with all the values in afrozenset. ------------------------------------------------------------------------------ Comprehensions As with lists/tuples, we can build sets/frozensets via comprehensions as s = {comprehension} fs = frozenset({comprehension}) which constructs a frozenset from a set as above So, to create a set of words (no duplicates), split (by spaces) from a string, we could write words = {s for s in 'to be or not to be that is the question'.split(' ')} here words is now {'to', 'be', 'or', 'not', 'that', 'is', 'the', 'question'} If we wanted only the words of 3 or fewer characters, we could include the option and write: words={s for s in 'to be or not to be that is the question'.split(' ') if len(s)<=3} here words is now {'to', 'be', 'or', 'not', 'is', 'the'} Generally, we can translate a set comprehension as follows. comprehension = set() for i in iterable: if bool_expression-i: comprehension.add(i) Notice that we don't need to write if bool_expression-i and not i in comprehension: comprehension.add(i) because the add method automatically does the right thing. We shouldn't write such redundant checks. What add does is do that check first anyway, so if write such a check Python is doing it twice ------------------------------------------------------------------------------ A Quick use of Sets Recall that we discussed the following reverse method in the previous lecture. def reverse(adict): answer = {} for k,k_vals in adict.items(): for v in k_vals: answer.setdefault(v,[]).append(k) return answer But one problem with it was that the answer dictionary could contain duplicate values in the list associated with its keys. We solved the problem by writing code to not append the value to the list if it was already there. def reverse_distinct(adict): answer = {} for k,k_vals in adict.items(): for v in k_vals: where = answer.setdefault(v,[]) if k not in where: where.append(k) return answer But really, we should have chosen sets to use as the values in the answer dictionary. When using sets, there is a much easier solution: def reverse_distinct(adict): answer = {} for k,k_vals in adict.items(): for v in k_vals: answer.setdefault(v,set()).add(k) return answer Notice that the only change was to the line (in the original reverse_distinct) answer.setdefault(v,set()).add(k) Here we set the default (if v is not in aswer) to be the empty set (which recall we must write as set(), not {} which is an empty dictionary). Also we must substitute add (the method for adding a value to a set) for append (the method for appending a value to a list) When printed (with print_dict), the answer looks as follows AZ -> {'alex'} CA -> {'rich', 'alex', 'ellen', 'mark'} IL -> {'rich'} IN -> {'mark'} NY -> {'alex', 'david'} OR -> {'ellen', 'patty'} PA -> {'david', 'alex', 'rich', 'ellen', 'mark', 'patty'} RI -> {'david'} WA -> {'david', 'alex', 'rich', 'ellen', 'mark', 'patty'} ------------------------------------------------------------------------------ Default Dictionaries: A new kind of dictionary that is often simpler to use There is a special kind of dictionary, called a defaultdict, that makes the code above even simpler. It also makes the code for count simpler. Let's take a quick look at defaultdict and how to simplify the code for these two dictionaries. First, we must import it from the collections module: typically by from collections import defaultdict Finally (that was short!) when we define a defaultdict we specify an argument that is often just the name of the type to construct an object from if we look up a key that is not in the defaultdict: that is, when we define a defaultdict we specify what default value to use when a new key is used with a dictionary. Other than that, we use a defaultdict just like a dict (although it will print a bit differently). With this new kind of dictionary (and I use it a lot) the above code simplifies to def reverse_distinct(adict): answer = defaultdict(set) # key not in answer? use/put a set() in for k,k_vals in adict.items(): for v in k_vals: answer[v].add(k) # add it to current set, or a new one return answer Here, each time we lookup value v in the answer defaultdict (answer[v]), if it is not there it assocates this value with an empty set (set()) and then adds k to that empty set. Likewise, we can simplify the count function to def count(alist): answer = defaultdict(int) # key not in answer? use/put a int()/0 for v in alist: answer[v] += 1 # increment current value, or 0 return answer Note that int() returns a reference to the 0 int object (how convenient). Here, each time we lookup value v in the answer defaultdict (answer[v]), if it is not there it assocates this value with the value 0 (int()) and then increments that value to 1. Often when we build dictionaries, it is easier to use a defaultdict; but it is not much harder to specify a dict and use a setdefault method for it, or even use an if instead. Recall our original definition of cout was def count(alist): answer = {} for v in alist: if v in alist: # Check if key v is in dictionary answer[v] += 1 # Yes, increment its asscoated value else: # No, answer[v] = 1 # set its value to 1 return answer or def count(alist): answer = {} for v in alist: if v not in alist: # Check if key v is NOT in dictionary answer[v] = 0 # Not present set its value to 0 answer[v] += 1 # Increment it associated value, which # might be the 0 just put there return answer