Sets and Frozen Sets


Here are the bare bones. I will demonstrate sets in class, including some
scripts and functions we can write using sets (and to a lesser extent
frozensets). The reality of understanding sets/frozensets understanding the
basic operations we can perform on them.

Sets are mostly like lists (but share two properties with dictionaries: see
2-3), but with three differences

  (1) Sets do not contain duplicates; if we add a valute that already is in a
        set, the set remains unchanged; this means we can often add a value to
        a set without first checking if it is in the set: if it isn't in the
        set, it is added; if it is in the set, the set remains unchanged.

  (2) Sets are unordered: we cannot index different values, and when we iterate
        through them the order of the values produced is not fixed
        (like dictionaries)

  (3) All values in sets (like keys in dictionaries) must be immutable. So we
       can have sets of tuples, but not sets of lists

There are also a large number of operators/methods that take sets as arguments
and produce sets as results (discussed below).

Frozensets are like immutatable sets: they have the two big properties listed
above, but their methods are restricted to those that do not mutate the
frozenset. So frozensets are to sets like tuples are to lists. As with tuples,
we can use frozensets as keys to dictionaries because frozensets are immutable.

Sets have literals: a positive number of values (1 or more) separated by
commas, all in braces. So, these literals are like dicts, but there is no colon
between key:value pairs (which is how Python tells the difference between a
dict and a set.

But there is one problem Python cannot tell whether {} is an empty dict or an
empty set. That is why the rule above says "1 or more"; we write empty
dictionaries as {}; we must write empty sets as set().

So 

a = set() is the empty set (no/0 values)

b = {'a', b', 'c'} is a set of str

c = {('ICS-31','MATH-2A','ICS-6B'), ('ICS-31','BIO-9','ICS-6D')}
    is a set of tuples

Note that if we write the set {[]} Python raises:

  TypeError: unhashable type: 'list'

Because list are mutable and we cannot have mutable values as keys in
dictionaries or as values in sets. I would prefer is say

 TypeError: mutable type: 'list'

but it doesn't; instead it says unhashable. Hashable means immutable, so
unhashable means-un immutable, which means mutatable.


Set operations:

(1) len: we can compute the length of a set (# of values at the top-level)
    len(a) is 0; len(b) is 3 len(c) is 2.

(2) No Indexing: the values in sets are unordered so it makes no sense to try
      to index them

(3) No Slicing (likewise)

(4) Checking containment: the in/not in operators
    These operators work on the values in a set
    'a' in a is False; 'a' in b is True; 'ICS-31' in c is False, but
    ('ICS-31','MATH-2A','ICS-6B') in c is True; note that
    ('ICS-6B','MATH-2A','ICS-31') in c is False, because for two tuples to be
    considered the same, their values must be in the same order

(5) No Catenation

(6) No Multiplication

(7) Iterability: for i in b:  produces all the top-level values in a
       (there are len(a) of them):
    for i in b:
        print(i,end='')
    prints: abc

    Note that we can write iter(aset) for use in while loops, to eventually
    produce all the values stored in aset.

    Note that the functions max and sum work on lists, tuples, sets, and
    frozensets (and on adict.keys() and adict.values()) -so long as the values
    are numeric. We will write our own functions that will take arguments that
    are any iterable type of data, and thus work for all these different types
    of data.

    In fact, the constructors for all these types take arguments that are
    iterable. We have seen how to construct a list from a tuple and a tuple from
    a list. We can also construct list from sets and sets from lists, by writing

    aset  = set(alist)  # len(aset) <= len(alist): aset has no duplicated values
    alist = list(aset)  # len(alist) == len(aset): it has no duplicated values
                        # b  ecause aset has no duplicated valuesl

    Note set('abc') constructs the set {'a','b','c'} because strings are
    iterable; of course if we print that set the values can appear in any
    order.

(8) There are a variety of set operations (from mathematics) that appear in
    Python in both a method and operator form. There are mutation versions of
    these operators as well (much like + from mathematics and += in Python).

    aset1 == aset2       # set equality
    aset1 != aset2       # set inequality

    two sets are equal if they have exactly the same values, otherwise they
    are not equal: {1,2,3} == {3,2,1} is True because order in sets makes no
    difference (unlike strings, lists, and tuples; but like dictionaries).

    Note that sets are never equal to lists. For two objects to be == they must
    be the same data-type (two lists, two tuples, two dicts, two sets) and store
    the same values.

    aset1.isdisjoint(aset2)             : do these sets have no common values

    aset1.issubset(aset2)               : every value in aset1 is also in aset2
    aset1 <= aset2                      

    aset1 < aset2                       : aset1 <= aset2 and aset1 != asets2

    aset1.issuperset(aset2)             : every value in aset2 is also in aset1
    aset1>=aset2                        

    sometimes if aset1 <= aset2 we say that aset1 is contained in aset2, and if
      aset1 >= aset2 we say that aset2 is contained in aset1

    aset1.union(aset2, ..., asetn)
    aset1 | aset2 | ... | asetn
      produces a new set with the union of all the sets: the new set has one
      of every value in the other sets: {1,2} | {2,3,4} | {1,3,6} is {1,2,3,4,6}
      so unions construct new sets whose length are bigger

    aset1.intersection(aset2, ..., asetn)
    aset1 & aset2 & ... & asetn
      produces a new set with the intersection of all the sets: the new set has
      only values that are in every other set: {1,2} | {2,3,4} | {1,3,6} is {1}
      so intersections construct new sets whose length are smaller

    aset1.difference(aset2, ..., asetn)
    aset1 - aset2 - ... - asetn
      produces a new set with the difference between aset1 and all the other
      sets: the new set has all the values that are in aset1 but not in any of
      the other sets: {1,2,3,4,5,6} - {2,4} - {4,5} is {1,3,6}; so differences
      construct new sets smaller than the first

    aset1.symmetric_difference(aset2)
    aset1 ^ set2
      produces a new set with the values in one set but not the other:
      {0,2,4,5,6} ^ {1,3,5,6} is {0,1,2,3,4}; so symmetic_differences produce
      sets smaller than each argument/operand. Symmetric difference produces
      all values not in the intersection: note {0,2,4,5,6} & {1,3,5,6} is
      {5,6} and the symmetric difference is all avalues in the set, except
      these. We can define a^b as (a|b) - (a&b).

  There is one big difference between methods and operators: the operators
  require sets for both operands, but the methods allow any iterables for
  their arguments. So we CANNOT write {1,2,3} | [2,3,4], but we CAN write
  {1,2,3}.union([2,3,4,2]): Python turns the list [2,3,4,2] into a set {2,3,4}
  and then performs the union operation, which constructs the new set {1,2,3,4}.
  

Set (mutation) operations

(a) aset.add(value): add value to set: does nothing if value is already in aset
    Adding to a set is fundamental, like appending to a list.
    Remember that we "append" to a list but "add" to a set.

    Suppose x = {1,2,3}
    After aset.add(2), the set is unchanged
    After aset.add('x'), the set is {1,2,3,'x'} (iterated in any order)

    aset.remove(value) : remove value from aset: if not in aset raise KeyError
    aset.discard(value): remove value from aset: if not in aset do nothing
    aset.pop()         : remove random value from aset: if empty raise KeyError
    aset.clear()       : remove all values from aset: make it empty
   
(b) These update operations for sets are similar to update operations for
    numeric values: aset1 |= aset2 is like a += b; the former translates into
    aset1 = aset1 | aset 2 and the latter into a = a + b.

    aset1.update(aset2,...asetn)
    aset1 |= aset2 | ... | asetn
    mutates aset1 to include all the values found in aset1 and any other set

    aset1.intersection_update(aset2,...asetn)
    aset1 &= aset2 & ... & asetn
    mutates aset1 to include only the values found in aset1 and every other set

    aset1.difference_update(aset2,...asetn)
    aset1 -= aset2 - ... - asetn
    mutates aset1 to include only the values found aset1 and no other sets

    aset1.symmetric_difference_update(aset2)
    aset1 ^= aset2
    mutates aset1 to include only the values found aset1 or aset2 but not both


Frozensets are very similar to sets, but we cannot use any of the muation
methods or operators. The constructor is named frozenset: frozenset() constructs
an empty frozenset.

We can use this constructor to convert back and forth easily bewteen sets and
frozensets: frozenset(aset) constructs a frozenset with all the values in aset
and set(afrozenset) constructs a set with all the values in afrozenset.

------------------------------------------------------------------------------

Comprehensions

As with lists/tuples, we can build sets/frozensets via comprehensions as

s  = {comprehension}
fs = frozenset({comprehension}) which constructs a frozenset from a set as above

So, to create a set of words (no duplicates), split (by spaces) from a string,
we could write

words = {s for s in 'to be or not to be that is the question'.split(' ')}

here words is now {'to', 'be', 'or', 'not', 'that', 'is', 'the', 'question'}

If we wanted only the words of 3 or fewer characters, we could include the
option and write:

words={s for s in 'to be or not to be that is the question'.split(' ') if len(s)<=3}

here words is now {'to', 'be', 'or', 'not', 'is', 'the'}

Generally, we can translate a set comprehension as follows.

  comprehension = set()
  for i in iterable:
      if bool_expression-i:
          comprehension.add(i)

Notice that we don't need to write

      if bool_expression-i and not i in comprehension:
          comprehension.add(i)

because the add method automatically does the right thing. We shouldn't write
such redundant checks. What add does is do that check first anyway, so if write
such a check Python is doing it twice

------------------------------------------------------------------------------

A Quick use of Sets

Recall that we discussed the following reverse method in the previous lecture. 

def reverse(adict):
    answer = {}
    for k,k_vals in adict.items():
        for v in k_vals:
            answer.setdefault(v,[]).append(k)
    return answer

But one problem with it was that the answer dictionary could contain duplicate
values in the list associated with its keys. We solved the problem by writing
code to not append the value to the list if it was already there.

def reverse_distinct(adict):
    answer = {}
    for k,k_vals in adict.items():
        for v in k_vals:
            where = answer.setdefault(v,[])
            if k not in where:
                where.append(k)
    return answer

But really, we should have chosen sets to use as the values in the answer
dictionary. When using sets, there is a much easier solution:

def reverse_distinct(adict):
    answer = {}
    for k,k_vals in adict.items():
        for v in k_vals:
            answer.setdefault(v,set()).add(k)
    return answer

Notice that the only change was to the line (in the original reverse_distinct)

  answer.setdefault(v,set()).add(k)

Here we set the default (if v is not in aswer) to be the empty set (which recall
we must write as set(), not {} which is an empty dictionary). Also we must
substitute add (the method for adding a value to a set) for append (the method
for appending a value to a list)

When printed (with print_dict), the answer looks as follows

  AZ -> {'alex'}
  CA -> {'rich', 'alex', 'ellen', 'mark'}
  IL -> {'rich'}
  IN -> {'mark'}
  NY -> {'alex', 'david'}
  OR -> {'ellen', 'patty'}
  PA -> {'david', 'alex', 'rich', 'ellen', 'mark', 'patty'}
  RI -> {'david'}
  WA -> {'david', 'alex', 'rich', 'ellen', 'mark', 'patty'}

------------------------------------------------------------------------------

Default Dictionaries: A new kind of dictionary that is often simpler to use

There is a special kind of dictionary, called a defaultdict, that makes the
code above even simpler. It also makes the code for count simpler. Let's take a
quick look at defaultdict and how to simplify the code for these two
dictionaries.

First, we must import it from the collections module: typically by

  from collections import defaultdict

Finally (that was short!) when we define a defaultdict we specify an argument
that is often just the name of the type to construct an object from if we look
up a key that is not in the defaultdict: that is, when we define a defaultdict
we specify what default value to use when a new key is used with a dictionary.

Other than that, we use a defaultdict just like a dict (although it will print
a bit differently). With this new kind of dictionary (and I use it a lot) the
above code simplifies to

def reverse_distinct(adict):
    answer = defaultdict(set)		# key not in answer? use/put a set() in
    for k,k_vals in adict.items():
        for v in k_vals:
            answer[v].add(k)		# add it to current set, or a new one
    return answer

Here, each time we lookup value v in the answer defaultdict (answer[v]), if it
is not there it assocates this value with an empty set (set()) and then adds
k to that empty set.

Likewise, we can simplify the count function to

def count(alist):
    answer = defaultdict(int)		# key not in answer? use/put a int()/0
    for v in alist:
        answer[v] += 1			# increment current value, or 0
    return answer

Note that int() returns a reference to the 0 int object (how convenient).
Here, each time we lookup value v in the answer defaultdict (answer[v]), if it
is not there it assocates this value with the value 0 (int()) and then
increments that value to 1.

Often when we build dictionaries, it is easier to use a defaultdict; but it is
not much harder to specify a dict and use a setdefault method for it, or even
use an if instead. Recall our original definition of cout was

def count(alist):
    answer = {}
    for v in alist:
        if v in alist:			# Check if key v is in dictionary
            answer[v] += 1		#   Yes, increment its asscoated value
        else:	      	 		#   No,
            answer[v] = 1		#     set its value to 1
    return answer

or

def count(alist):
    answer = {}
    for v in alist:
        if v not in alist:		# Check if key v is NOT in dictionary
            answer[v] = 0		#   Not present set its value to 0
        answer[v] += 1	        	# Increment it associated value, which
		     			#   might be the 0 just put there
    return answer