import numpy as np
import math
import matplotlib.pyplot as plt
This discussion is going to involve a bit more math than the previous. If I refer to a topic you don't know, please feel free to ask me to explain it. It will be hard to understand how Numpy can be useful if you aren't familiar with the concepts I'm pulling from math.
Numpy is often used for linear algebra. That is, matrix and vector operations like matrix multiplication and the dot product. Situations where we need to run these computations occur commonly in a variety of settings, including data science. Numpy gives you a way to perform these operations quickly and without having to program them yourself.
For example, consider matrix multiplication. Remember that a matrix is a two-dimensional array of numbers with its size denoted by (# of rows) x (# of columns). If we have two matrices, one with size L x N and the other with size N x M we can multiply them. An example is below.
#[1 2] [2] [1x2 + 2x1] [4]
#[2 1] X [1] = [2x2 + 1x1] = [5]
We could write something like this in Python ourselves, storing a matrix as a list of lists (each inner list will represent one row).
def mat_multiply(M1, M2):
result = []
for i in range(len(M1)):
row = []
for j in range(len(M2[0])):
component = 0
for k in range(len(M2)):
component += M1[i][k]*M2[k][j]
row.append(component)
result.append(row)
return result
M1 = [[1, 2], [2, 1]]
M2 = [[2], [1]]
print(mat_multiply(M1,M2))
Alternatively, we can use Numpy. The command np.matmul
is used to multiply two matrices. Here, the matrices must be stored as Numpy arrays, which we can easily create from the above.
numpy_M1 = np.array(M1)
numpy_M2 = np.array(M2)
print(np.matmul(numpy_M1, numpy_M2))
One immediate benefit is that the output above is formatted more like a matrix, and not just a list of lists. Another benefit, as mentioned before, is speed. Here's what happens when we multiply two random matrices:
import random
import time
M1 = [[random.random() for _ in range(200)] for _ in range(200)]
M2 = [[random.random() for _ in range(200)] for _ in range(200)]
numpy_M1 = np.array(M1)
numpy_M2 = np.array(M2)
t0 = time.time()
mat_multiply(M1, M2)
t1 = time.time()
print(f'Time for our multiplication: {t1-t0}')
t0 = time.time()
np.matmul(numpy_M1, numpy_M2)
t1 = time.time()
print(f'Time for Numpy multiplication: {t1-t0}')
The Numpy implementation is much faster, which is what we would expect.
Now, let's discuss other things we may wish to do. For starters, instead of using np.matmul
, we can use the @
operator. This does the same thing.
M1 = [[1, 2], [2, 1]]
M2 = [[2], [1]]
numpy_M1 = np.array(M1)
numpy_M2 = np.array(M2)
print(numpy_M1 @ numpy_M2)
We also have the dot product, which takes two vectors and returns the sum of their coordinates multiplied together.
v1 = np.array([1, 2, 3])
v2 = np.array([2, 1, 5])
print(np.dot(v1, v2)) # get the value of 1*2 + 2*1 + 3*5
Lastly, another useful command is np.sum
. By default, this takes an array and returns the sum of every entry of it.
print(np.sum(numpy_M1))
print(np.sum(v1 * v2)) # same as np.dot(v1, v2)), but a little slower
np.sum
can also take in a parameter called axis
. This allows us to sum over just one dimension of the array. For example, we could get the sum of every row in a matrix. The axis we sum over is indexed as normal: 0 is the first axis, 1 is the second, and so on.
For example, summing over axis 0 in a 2D matrix means we are summing over rows, so we will get the sum of each column. (The "over" part means that every part of the sum comes from a different row but the rest of the coordinates are the same.) This is a little tricky to get correct when you're starting out, and sometimes I still get confused about this. It's never a bad idea to check your results to make sure you're summing the way you want to.
M = np.array([[1, 1, 1], [2, 2, 2]])
print(M)
print('')
print(np.sum(M,axis=0))
print(np.sum(M,axis=1))
Numpy has its own way of generating randomness. This is mostly used if we want to have your randomness be immediately used to generate or sample a numpy array. For example, in the example above when we generated a random Numpy matrix, we could have instead done:
numpy_M1 = np.random.rand(100, 100)
numpy_M2 = np.random.rand(100, 100)
np.matmul(numpy_M1, numpy_M2)
The function np.random.rand(...)
returns an array of random values between 0 and 1 with the specified shape. Here we did 100 x 100, so it is a 2d array with 100 rows and 100 columns.
Similarly, we have np.random.randn(...)
. This does the same thing, except the random values come from the normal distribution.
numpy_M1 = np.random.randn(100, 100)
numpy_M2 = np.random.randn(100, 100)
np.matmul(numpy_M1, numpy_M2)
There are other methods in the np.random
submodule, these are just two examples. Other methods that may be useful is a method to generate Numpy array of random integers and a method to randomly shuffle a Numpy array. Like other Numpy methods, its not too important to have these memorized. What is important to know is that they exist, and can be looked up when needed.
Let's do one multi-part exercise to practice working with Numpy.
Our goal in this exercise is to estimate the value of pi. We will do this by randomly sampling points in the unit square (the part of the X-Y plane where both coordinates are between -1 and 1). This square has area 4. In this square is the unit circle, which covers all points of distance no more than 1 from 0. This circle has area pi. Therefore, if we randomly sample points in this square, we expect pi/4 of them to be in the circle.
Throughout this problem, N
will refer to the number of points we sample. This allows us to easily rerun the experiment again with more points if we wish.
N = 1000
Part 1: Generate N
random points in the unit square. This will be represented as an N x 2 array where every entry is a value between -1 and 1. (We will consider each row to be a single point.)
(solution below)
.
.
.
.
.
.
.
.
.
.
.
.
# solution
points = np.random.rand(N,2)*2 - 1
#print(points)
Part 2: Get the squared distance between each point and 0. Remember that for a point (x, y), the squared distance between it and 0 is x^2 + y^2.
(solution below)
.
.
.
.
.
.
.
.
.
.
.
.
# solution
square_dists = np.sum(points*points, axis=1)
# alternative solution
# we can work with the unsquared distances, in which case we can use
dists = np.linalg.norm(points, axis = 1)
# but the previous solution requires less knowledge of Numpy
Part 3: Get the number of points which lie inside the circle and compute our estimate for pi.
(solution below)
.
.
.
.
.
.
.
.
.
.
.
.
# solution
points_inside = np.sum(square_dists <= 1) # True counts as 1 in a sum, False as 0
# alternative solution
points_inside = len(points[square_dists <= 1])
print(f"Pi is approximately {4*points_inside/N}")
Part 4 (together): Plot the points on the X-Y plane, as well as the unit circle.
(solution below)
.
.
.
.
.
.
.
.
.
.
.
.
# we plot points inside and outside separately to make them different colors
plt.scatter(points[square_dists <= 1,0],points[square_dists <= 1,1])
plt.scatter(points[square_dists > 1,0],points[square_dists > 1,1])
# easy way of plotting a circle
theta = np.linspace(0, 2*np.pi, 100)
x = np.cos(theta)
y = np.sin(theta)
plt.plot(x, y)
# make plot square
plt.gca().set_aspect('equal')
plt.show()