enumerate() is a generator that returns a running index as well as the actual element or item in the container. Optinal argument includes the start parameter (number the index should start at)

list(enumerate([1,2,3]))
## [(0, 1), (1, 2), (2, 3)]
list(enumerate(range(3), start=10))
## [(10, 0), (11, 1), (12, 2)]

Fizz Buzz

Write a program that prints the numbers from 1 to 100 and for multiples of ‘3’ print “Fizz” instead of the number and for the multiples of ‘5’ print “Buzz”.

for i in range(1,101):
  if i % 3 == 0:
    if i % 5 == 0:
      print('Fizz Buzz')
      continue
    print('Fizz')
    continue
  if i % 5 == 0:
    if i % 3 == 0:
      print('Fizz Buzz')
      continue
    print('Buzz')
    continue
  else:
    print(i)
## 1
## 2
## Fizz
## 4
## Buzz
## Fizz
## 7
## 8
## Fizz
## Buzz
## 11
## Fizz
## 13
## 14
## Fizz Buzz
## 16
## 17
## Fizz
## 19
## Buzz
## Fizz
## 22
## 23
## Fizz
## Buzz
## 26
## Fizz
## 28
## 29
## Fizz Buzz
## 31
## 32
## Fizz
## 34
## Buzz
## Fizz
## 37
## 38
## Fizz
## Buzz
## 41
## Fizz
## 43
## 44
## Fizz Buzz
## 46
## 47
## Fizz
## 49
## Buzz
## Fizz
## 52
## 53
## Fizz
## Buzz
## 56
## Fizz
## 58
## 59
## Fizz Buzz
## 61
## 62
## Fizz
## 64
## Buzz
## Fizz
## 67
## 68
## Fizz
## Buzz
## 71
## Fizz
## 73
## 74
## Fizz Buzz
## 76
## 77
## Fizz
## 79
## Buzz
## Fizz
## 82
## 83
## Fizz
## Buzz
## 86
## Fizz
## 88
## 89
## Fizz Buzz
## 91
## 92
## Fizz
## 94
## Buzz
## Fizz
## 97
## 98
## Fizz
## Buzz

Write a program with the same requirements as above but mutates a list, not prints it. Use if and elif to not mutate the list so many times.

def fizz_buzz(numbers):
  '''
  Given a list of integers:
  1. replace all integers that are evenly divisible by 3 with 'fizz'
  2. replace all integers divisible by 5 with 'buzz'
  3. replace all integers divisible by both 3 and 5 with 'fizzbuzz'
  
  >>> numbers = [45, 22, 14, 65, 97, 72]
  >>> fizz_buzz(numbers)
  >>> numbers
  ['fizzbuzz', 22, 14, 'buzz', 97, 'fizz']
  '''
  for i in range(len(numbers)):
    num = numbers[i]
    if num % 3 == 0 and num % 5 == 0:
      numbers[i] = 'fizzbuzz'
    elif num % 3 == 0:
      numbers[i] = 'fizz'
    elif num % 5 == 0:
      numbers[i] = 'buzz'
numbers = [1,2,3,4,5,6,7,8,9,10,15,45, 22, 14, 65, 97, 72]
fizz_buzz(numbers)
numbers
## [1, 2, 'fizz', 4, 'buzz', 'fizz', 7, 8, 'fizz', 'buzz', 'fizzbuzz', 'fizzbuzz', 22, 14, 'buzz', 97, 'fizz']

Can also use enumerate() instead of range(len()). Difference is that eumerate() takes an iterable while range(len()) takes a number.

def fizz_buzz(numbers):
  '''
  Given a list of integers:
  1. replace all integers that are evenly divisible by 3 with 'fizz'
  2. replace all integers divisible by 5 with 'buzz'
  3. replace all integers divisible by both 3 and 5 with 'fizzbuzz'
  
  >>> numbers = [45, 22, 14, 65, 97, 72]
  >>> fizz_buzz(numbers)
  >>> numbers
  ['fizzbuzz', 22, 14, 'buzz', 97, 'fizz']
  '''
  for i, num in enumerate(numbers):
    num = numbers[i]
    if num % 3 == 0 and num % 5 == 0:
      numbers[i] = 'fizzbuzz'
    elif num % 3 == 0:
      numbers[i] = 'fizz'
    elif num % 5 == 0:
      numbers[i] = 'buzz'
numbers_1 = list(range(1,101))
fizz_buzz(numbers_1)
numbers_1
## [1, 2, 'fizz', 4, 'buzz', 'fizz', 7, 8, 'fizz', 'buzz', 11, 'fizz', 13, 14, 'fizzbuzz', 16, 17, 'fizz', 19, 'buzz', 'fizz', 22, 23, 'fizz', 'buzz', 26, 'fizz', 28, 29, 'fizzbuzz', 31, 32, 'fizz', 34, 'buzz', 'fizz', 37, 38, 'fizz', 'buzz', 41, 'fizz', 43, 44, 'fizzbuzz', 46, 47, 'fizz', 49, 'buzz', 'fizz', 52, 53, 'fizz', 'buzz', 56, 'fizz', 58, 59, 'fizzbuzz', 61, 62, 'fizz', 64, 'buzz', 'fizz', 67, 68, 'fizz', 'buzz', 71, 'fizz', 73, 74, 'fizzbuzz', 76, 77, 'fizz', 79, 'buzz', 'fizz', 82, 83, 'fizz', 'buzz', 86, 'fizz', 88, 89, 'fizzbuzz', 91, 92, 'fizz', 94, 'buzz', 'fizz', 97, 98, 'fizz', 'buzz']

Doctest

Doctest is a module that scans the Docstrings of functions looking for lines that look like input and output and uses them to test your function. The Doctest module can be either added into the script or run in concert with the script using command line arguments.

Adding to the script

if __name__ == "__main__":
    import doctest
    doctest.testmod()

Running from the command line

python -m doctest -v temp.py

The -v switch will print the tests and results that Doctest is running. Without the switch, it will run but not print anything to the console unless there is an error found.

Examples

def factorial(n):
    """Return the factorial of n, an exact integer >= 0.

    >>> [factorial(n) for n in range(6)]
    [1, 1, 2, 6, 24, 120]
    >>> factorial(30)
    265252859812191058636308480000000
    >>> factorial(-1)
    Traceback (most recent call last):
        ...
    ValueError: n must be >= 0

    Factorials of floats are OK, but the float must be an exact integer:
    >>> factorial(30.1)
    Traceback (most recent call last):
        ...
    ValueError: n must be exact integer
    >>> factorial(30.0)
    265252859812191058636308480000000

    It must also not be ridiculously large:
    >>> factorial(1e100)
    Traceback (most recent call last):
        ...
    OverflowError: n too large
    """

    import math
    if not n >= 0:
        raise ValueError("n must be >= 0")
    if math.floor(n) != n:
        raise ValueError("n must be exact integer")
    if n+1 == n:  # catch a value like 1e300
        raise OverflowError("n too large")
    result = 1
    factor = 2
    while factor <= n:
        result *= factor
        factor += 1
    return result


if __name__ == "__main__":
    import doctest
    doctest.testmod()

Differences between two lists

When comparing two lists, you can convert them into sets to eliminate any duplicate numbers. Also, lists can not be operated on using ‘-’ but sets can. Lists can however use the ‘+’ which just combines the two lists together.

list1 = [1,2,3,4]
list2 = [3,3,4,5,6,7,7]
set(list1) - set(list2)
## {1, 2}
list1 = [1,2,3,4]
list2 = [3,3,4,5,6,7,7]
set(list2) - set(list1)
## {5, 6, 7}
def diff(list1, list2):
  return list(set(list1) - set(list2)) + list(set(list2) - set(list1))
list1 = [1,2,3,4]
list2 = [3,3,4,5,6,7,7]
diff(list1, list2)
## [1, 2, 5, 6, 7]

Without using set() we can use list comprehension


def diff2(list1, list2):
    list_dif = [i for i in list1 + list2 if i not in list1 or i not in list2]
    return list_dif
list1 = [1,2,3,4]
list2 = [3,3,4,5,6,7,7]
list3 = diff2(list1, list2)
print(list3)
## [1, 2, 5, 6, 7, 7]

The difference between diff() and diff2() is that if you don’t convert the lists to sets then you keep both 7's in list2.

Underscore ’_’

The underscore can be used to ignore a single value


a, _, b = (1, 2, 3) # a = 1, b = 3
print(a, b)
## 1 3

It can also be used to ignore multiple values using the *(variable) used to assign multiple value to a variable as list while unpacking it’s called “Extended Unpacking”, only available in Python 3.x

a, *_, b = (7, 6, 5, 4, 3, 2, 1)
print(a, b)
## 7 1

Separating Digits Of Numbers

If you have a long digits number, you can separate the group of digits as you like for better understanding.

million = 1_000_000

List comprehension

lst = [1,2,-5,4]

def square(x):
  return x*x

Instead of doing the following:

list(map(square, lst))
## [1, 4, 25, 16]

You can use list comprehension:

[square(num) for num in lst]
## [1, 4, 25, 16]
def is_odd(x):
  return x % 2 == 1

We can use the filter() function to filter out elements of our list according to if they meet certain criteria.

list(filter(is_odd, lst))
## [1, -5]
[x for x in lst if is_odd(x)]
## [1, -5]

Let’s create a 2x3 matrix filled with zeros and define it using lists.

grid = [[0,0,0,
       0,0,0]]

Using for loop

num_rows = 2
num_columns = 3
grid = []

for _ in range(num_rows):
  curr_row = []
  for _ in range(num_columns):
    curr_row.append(0)
  grid.append(curr_row)

grid
## [[0, 0, 0], [0, 0, 0]]

Using list comprehension

num_rows = 2
num_columns = 3
grid = []
grid = [[0 for _ in range(num_columns)] for _ in range(num_rows)]
grid
## [[0, 0, 0], [0, 0, 0]]

Compare the above to this where the brackets are in different places.

num_rows = 2
num_columns = 3
grid = []

grid = [[0 for _ in range(num_columns) for _ in range(num_rows)]]
grid
## [[0, 0, 0, 0, 0, 0]]
[] = list
() = tuple
{} = dictionary/set

The function max() takes in some numbers and returns the max number

L = [1,2,3, -4]
t = (1,2,3, -4)
s = {1,2,3, -4}
max(L)
## 3
max(t)
## 3
max(s)
## 3
max(1,2,3)
## 3
max(L, key=lambda x: x*x)
## -4

min() is the same as max but returns the min

any() takes in an iterable and returns True if any of the values in the iterable are True and returns False if none of the values are True

any(L)
## True
any([False, False])
## False

We can’t use any() to see if any in our list is odd because it does not take arguments to key

any(L, key=lambda x: x % 2 == 1)

Instead, we have to use list comprehension. You can pass in an argument (num) by adding it after the lambda function.

[(lambda x:x % 2 == 1)(num) for num in L]
## [True, False, True, False]
any([(lambda x:x % 2 == 1)(num) for num in L])
## True

all() is the same as any() but only returns True if **all* of the items are True

all([(lambda x:x % 2 == 1)(num) for num in L])
## False

F-strings when creating a new class

class A(object):
  def __init__(self,name,age):
    self.name = name
    self.age = age
  def __repr__(self):
    return f"""
      My name is {self.name}.
      I am {self.age + 5} years old
      """
name = 'Bob'
age = 15
print(A(name,age))
## 
##       My name is Bob.
##       I am 20 years old
## 
A('Nathan', 31)
## 
##       My name is Nathan.
##       I am 36 years old
## 

Sorting

Can sort a list alphabetically.

animals = ["cat", "dog", "cheetah", "rhino"]

sorted(animals)
## ['cat', 'cheetah', 'dog', 'rhino']
sorted(animals, reverse=True)
## ['rhino', 'dog', 'cheetah', 'cat']

Here, we have a list of animals defined by a dictionary.

animals = [
  {'type': 'cat', 'name': 'Stephanie', 'age': 8},
  {'type': 'dog', 'name': 'Devon', 'age': 3},
  {'type': 'rhino', 'name': 'Moe', 'age': 5},
]

You can’t sort a dictionary but you can define a lambda to sort by.

sorted(animals, key = lambda animal: animal['age'])
## [{'type': 'dog', 'name': 'Devon', 'age': 3}, {'type': 'rhino', 'name': 'Moe', 'age': 5}, {'type': 'cat', 'name': 'Stephanie', 'age': 8}]

If you wanted to return the oldest animal, pass in the reverse=True parameter and slice into the first item in the dictionary.


sorted(animals, key = lambda animal: animal['age'], reverse=True)[0]
## {'type': 'cat', 'name': 'Stephanie', 'age': 8}

You can also use the .sort() method, you can do the same thing but mutating the dictionary

animals.sort(key = lambda animal: animal['age'], reverse=True)
animals
## [{'type': 'cat', 'name': 'Stephanie', 'age': 8}, {'type': 'rhino', 'name': 'Moe', 'age': 5}, {'type': 'dog', 'name': 'Devon', 'age': 3}]

Set()

  • Sets are unordered
  • Set elements are unique (no duplicates)
  • A set may be modified but the elements contained in the set must be of an immutable type
s = 'quux'
list(s)
## ['q', 'u', 'u', 'x']
set(s)
## {'q', 'u', 'x'}
x = {'foo', 'bar', 'baz', 'foo', 'qux'}
x
## {'foo', 'bar', 'qux', 'baz'}
set('foo')
## {'o', 'f'}
x = {}
type(x)
## <class 'dict'>
x = set()
type(x)
## <class 'set'>

Don’t forget that set elements must be immutable. For example, a tuple may be included in a set:

x = {42, 'foo', (1, 2, 3), 3.14159}
x
## {'foo', 42, (1, 2, 3), 3.14159}

But lists and dictionaries are mutable, so they can’t be set elements:

a = [1, 2, 3]
{a}

d = {'a': 1, 'b': 2}
{d}
x = {'foo', 'bar', 'baz'}

len(x)
## 3
'bar' in x
## True
'qux' in x
## False

In Python, set union can be performed with the | operator:

x1 = {'foo', 'bar', 'baz'}
x2 = {'baz', 'qux', 'quux'}
x1 | x2
## {'foo', 'baz', 'bar', 'quux', 'qux'}

Set union can also be obtained with the .union() method. The method is invoked on one of the sets, and the other is passed as an argument

x1.union(x2)
## {'foo', 'baz', 'bar', 'quux', 'qux'}

When you use the | operator, both operands must be sets. The .union() method, on the other hand, will take any iterable as an argument, convert it to a set, and then perform the union.

x1.union(('baz', 'qux', 'quux'))
## {'baz', 'bar', 'qux', 'foo', 'quux'}

More than two sets may be specified with either the operator or the method:

a = {1, 2, 3, 4}
b = {2, 3, 4, 5}
c = {3, 4, 5, 6}
d = {4, 5, 6, 7}

a.union(b, c, d)
## {1, 2, 3, 4, 5, 6, 7}
a | b | c | d
## {1, 2, 3, 4, 5, 6, 7}

The resulting set contains only elements that are present in all of the specified sets.

a = {1, 2, 3, 4}
b = {2, 3, 4, 5}
c = {3, 4, 5, 6}
d = {4, 5, 6, 7}

a.intersection(b, c, d)
## {4}
a & b & c & d
## {4}

x1.difference(x2) and x1 - x2 return the set of all elements that are in x1 but not in x2:


x1 = {'foo', 'bar', 'baz'}
x2 = {'baz', 'qux', 'quux'}

x1.difference(x2)
## {'foo', 'bar'}
x1 - x2
## {'foo', 'bar'}

Frozen Sets

Srozenset, which is in all respects exactly like a set, except that a frozenset is immutable. You can perform non-modifying operations on a frozenset but methods that attempt to modify a frozenset fail

How to Use Generators and yield in Python

Have you ever had to work with a dataset so large that it overwhelmed your machine’s memory? Or maybe you have a complex function that needs to maintain an internal state every time it’s called, but the function is too small to justify creating its own class. In these cases and more, generators and the Python yield statement are here to help.

Using Generators

Introduced with PEP 255, generator functions are a special kind of function that return a lazy iterator. These are objects that you can loop over like a list. However, unlike lists, lazy iterators do not store their contents in memory.

This is a reasonable explanation, but would this design still work if the file is very large? What if the file is larger than the memory you have available? To answer this question, let’s assume that csv_reader() just opens the file and reads it into an array:

def csv_reader(file_name):
    file = open(file_name)
    result = file.read().split("\n")
    return result

This function opens a given file and uses file.read() along with .split() to add each line as a separate element to a list. If you were to use this version of csv_reader() in the row counting code block you saw further up, then you’d get the following output: MemoryError

In this case, open() returns a generator object that you can lazily iterate through line by line. However, file.read().split() loads everything into memory at once, causing the MemoryError.

However you can turn csv_reader() into a generator function:

def csv_reader(file_name):
    for row in open(file_name, "r"):
        yield row

This version opens a file, loops through each line, and yields each row instead of returning it.

You can also define a generator expression (also called a generator comprehension), which has a very similar syntax to list comprehensions. In this way, you can use the generator without calling a function:

csv_gen = (row for row in open(file_name))
  • Using yield will result in a generator object.
  • Using return will result in the first line of the file only.

Generating an Infinite Sequence

In Python, to get a finite sequence, you call range() and evaluate it in a list context:

a = range(5)
list(a)
## [0, 1, 2, 3, 4]

Generating an infinite sequence, however, will require the use of a generator, since your computer memory is finite:

def infinite_sequence():
    num = 0
    while True:
        yield num
        num += 1

Detecting Palindromes

def is_palindrome(num):
    # Skip single-digit inputs
    # Floor division returns not the remainder like % but the whole number.
    if num // 10 == 0:
        return False
    temp = num
    reversed_num = 0

    while temp != 0:
        reversed_num = (reversed_num * 10) + (temp % 10)
        temp = temp // 10

    if num == reversed_num:
        return num
    else:
        return False

Don’t worry too much about understanding the underlying math in this code. Just note that the function takes an input number, reverses it, and checks to see if the reversed number is the same as the original. Now you can use your infinite sequence generator to get a running list of all numeric palindromes:

for i in infinite_sequence():
    pal = is_palindrome(i)
    if pal:
        print(pal)

Understanding Generators

yield indicates where a value is sent back to the caller, but unlike return, you don’t exit the function afterward.

Instead, the state of the function is remembered. That way, when next() is called on a generator object (either explicitly or implicitly within a for loop), the previously yielded variable num is incremented, and then yielded again.

You can create Generator Expressions without building and holding the entire object in memory before iteration. In other words, you’ll have no memory penalty when you use generator expressions. Take this example of squaring some numbers.

nums_squared_lc = [num**2 for num in range(5)]
nums_squared_lc
## [0, 1, 4, 9, 16]
nums_squared_gc = (num**2 for num in range(5))
nums_squared_gc
## <generator object <genexpr> at 0x0000000060651AC0>

Profiling Generator Performance

You learned earlier that generators are a great way to optimize memory. While an infinite sequence generator is an extreme example of this optimization, let’s amp up the number squaring examples you just saw and inspect the size of the resulting objects. You can do this with a call to sys.getsizeof():

import sys
nums_squared_lc = [i * 2 for i in range(10000)]
sys.getsizeof(nums_squared_lc)
## 85176
nums_squared_gc = (i ** 2 for i in range(10000))
print(sys.getsizeof(nums_squared_gc))
## 112

There is one thing to keep in mind, though. If the list is smaller than the running machine’s available memory, then list comprehensions can be faster to evaluate than the equivalent generator expression. To explore this, let’s sum across the results from the two comprehensions above. You can generate a readout with cProfile.run():

import cProfile

cProfile.run('sum([i * 2 for i in range(10000)])')
##          5 function calls in 0.001 seconds
## 
##    Ordered by: standard name
## 
##    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
##         1    0.001    0.001    0.001    0.001 <string>:1(<listcomp>)
##         1    0.000    0.000    0.001    0.001 <string>:1(<module>)
##         1    0.000    0.000    0.001    0.001 {built-in method builtins.exec}
##         1    0.000    0.000    0.000    0.000 {built-in method builtins.sum}
##         1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
cProfile.run('sum((i * 2 for i in range(10000)))')
##          10005 function calls in 0.002 seconds
## 
##    Ordered by: standard name
## 
##    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
##     10001    0.001    0.000    0.001    0.000 <string>:1(<genexpr>)
##         1    0.000    0.000    0.002    0.002 <string>:1(<module>)
##         1    0.000    0.000    0.002    0.002 {built-in method builtins.exec}
##         1    0.001    0.001    0.002    0.002 {built-in method builtins.sum}
##         1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

If I wasn’t running this in R, you could see that summing across all values in the list comprehension took about a third of the time as summing across the generator. If speed is an issue and memory isn’t, then a list comprehension is likely a better tool for the job.

Understanding the Python Yield Statement

When you call a generator function or use a generator expression, you return a special iterator called a generator. You can assign this generator to a variable in order to use it. When you call special methods on the generator, such as next(), the code within the function is executed up to yield.

When the Python yield statement is hit, the program suspends function execution and returns the yielded value to the caller. (In contrast, return stops function execution completely.) When a function is suspended, the state of that function is saved. This includes any variable bindings local to the generator, the instruction pointer, the internal stack, and any exception handling.

This allows you to resume function execution whenever you call one of the generator’s methods. In this way, all function evaluation picks back up right after yield. You can see this in action by using multiple Python yield statements:

def multi_yield():
    yield_str = "This will print the first string"
    yield yield_str
    yield_str = "This will print the second string"
    yield yield_str

multi_obj = multi_yield()
print(next(multi_obj))

print(next(multi_obj))

print(next(multi_obj))

The following is the output, which returns a traceback in the end.

'This will print the first string'
'This will print the second string'
*StopIteration:* 

Take a closer look at that last call to next(). You can see that execution has blown up with a traceback. This is because generators, like all iterators, can be exhausted. Unless your generator is infinite, you can iterate through it one time only. Once all values have been evaluated, iteration will stop and the for loop will exit. If you used next(), then instead you’ll get an explicit StopIteration exception.

Using Advanced Generator Methods

  • .send()
  • .throw()
  • .close()

Creating Data Pipelines With Generators

Data pipelines allow you to string together code to process large datasets or streams of data without maxing out your machine’s memory. Imagine that you have a large CSV file:

Let’s think of a strategy:

  1. Read every line of the file.
  2. Split each line into a list of values.
  3. Extract the column names.
  4. Use the column names and lists to create a dictionary.
  5. Filter out the rounds you aren’t interested in.
  6. Calculate the total and average values for the rounds you are interested in.

Normally, you can do this with a package like pandas, but you can also achieve this functionality with just a few generators. You’ll start by reading each line from the file with a generator expression:

file_name = "data/TechCrunchcontinentalUSA.csv"
lines = (line for line in open(file_name))
# Then, you’ll use another generator expression in concert with the previous one to split each line into a list:
list_line = (s.rstrip().split(",") for s in lines)

Here, you created the generator list_line, which iterates through the first generator lines. This is a common pattern to use when designing generator pipelines. Next, you’ll pull the column names out of techcrunch.csv. Since the column names tend to make up the first line in a CSV file, you can grab that with a short next() call:

cols = next(list_line)

To sum this up, you first create a generator expression lines to yield each line in a file. Next, you iterate through that generator within the definition of another generator expression called list_line, which turns each line into a list of values. Then, you advance the iteration of list_line just once with next() to get a list of the column names from your CSV file.

To help you filter and perform operations on the data, you’ll create dictionaries where the keys are the column names from the CSV:

company_dicts = (dict(zip(cols, data)) for data in list_line)

This generator expression iterates through the lists produced by list_line. Then, it uses zip() and dict() to create the dictionary as specified above. Now, you’ll use a fourth generator to filter the funding round you want and pull raisedAmt as well:

funding = (
    int(company_dict["raisedAmt"])
    for company_dict in company_dicts
    if company_dict["round"] == "a"
)

In this code snippet, your generator expression iterates through the results of company_dicts and takes the raisedAmt for any company_dict where the round key is “a”.

Remember, you aren’t iterating through all these at once in the generator expression. In fact, you aren’t iterating through anything until you actually use a for loop or a function that works on iterables, like sum(). In fact, call sum() now to iterate through the generators:

total_series_a = sum(funding)
print(f"Total series A fundraising: ${total_series_a}")
## Total series A fundraising: $4376015000
file_name = "techcrunch.csv"
lines = (line for line in open(file_name))
list_line = (s.rstrip()split(",") for s in lines)
cols = next(list_line)
company_dicts = (dict(zip(cols, data)) for data in list_line)
funding = (
    int(company_dict["raisedAmt"])
    for company_dict in company_dicts
    if company_dict["round"] == "a"
)
total_series_a = sum(funding)
print(f"Total series A fundraising: ${total_series_a}")

Putting this all together, you’ll produce the following script:

This script pulls together every generator you’ve built, and they all function as one big data pipeline. Here’s a line by line breakdown:

  • Line 2 reads in each line of the file.
  • Line 3 splits each line into values and puts the values into a list.
  • Line 4 uses next() to store the column names in a list.
  • Line 5 creates dictionaries and unites them with a zip() call:
  • The keys are the column names cols from line 4.
  • The values are the rows in list form, created in line 3.
  • Line 6 gets each company’s series A funding amounts. It also filters out any other raised amount.
  • Line 11 begins the iteration process by calling sum() to get the total amount of series A funding found in the CSV.

When you run this code on techcrunch.csv, you should find a total of $4,376,015,000 raised in series A funding rounds.

Note: The methods for handling CSV files developed in this tutorial are important for understanding how to use generators and the Python yield statement. However, when you work with CSV files in Python, you should instead use the csv module included in Python’s standard library. This module has optimized methods for handling CSV files efficiently.

Using zip() in Python

Python’s zip() function is defined as zip(*iterables). The function takes in iterables as arguments and returns an iterator. This iterator generates a series of tuples containing elements from each iterable. zip() can accept any type of iterable, such as files, lists, tuples, dictionaries, sets, and so on.

Passing n Arguments

If you use zip() with n arguments, then the function will return an iterator that generates tuples of length n. To see this in action, take a look at the following code block:

numbers = [1, 2, 3]
letters = ['a', 'b', 'c']
zipped = zip(numbers, letters)
zipped  # Holds an iterator object
## <zip object at 0x0000000060792E00>
type(zipped)
## <class 'zip'>
list(zipped)
## [(1, 'a'), (2, 'b'), (3, 'c')]

Here, you use zip(numbers, letters) to create an iterator that produces tuples of the form (x, y). In this case, the x values are taken from numbers and the y values are taken from letters. Notice how the Python zip() function returns an iterator. To retrieve the final list object, you need to use list() to consume the iterator.

If you’re working with sequences like lists, tuples, or strings, then your iterables are guaranteed to be evaluated from left to right. This means that the resulting list of tuples will take the form [(numbers[0], letters[0]), (numbers[1], letters[1]),…, (numbers[n], letters[n])]. However, for other types of iterables (like sets), you might see some weird results:

s1 = {2, 3, 1}
s2 = {'b', 'a', 'c'}
list(zip(s1, s2))
## [(1, 'b'), (2, 'c'), (3, 'a')]

When to use median

But then, what are the advantages of the median? To illustrate this, we return to the five systolic blood pressure values used before:

142, 124, 121, 151, 132.

We assume that 151 is a correct value, but that a device failure leads to the false measurement of 171. Let’s see what happens to mean and median?

The mean of the resulting five values now is 138 instead of 134, as calculated from the original data, thus showing a considerable effect of the incorrect measurement.

To derive the median, we sort the data again by size:

121, 124, 132, 142, 171.

As before, the value 132 is in the center of the data row, so the median actually is unaltered by the false measurement.

That is why the median is called “robust against outliers“, whereas the mean actually is “sensitive to outliers“.

