QuantEcon
diff --git a/‎lectures/_static/lecture_specific/need_for_speed/dgx.png‎
432 KB b/‎lectures/_static/lecture_specific/need_for_speed/dgx.png‎
432 KB
diff --git a/‎lectures/_static/lecture_specific/need_for_speed/geforce.png‎
1.6 MB b/‎lectures/_static/lecture_specific/need_for_speed/geforce.png‎
1.6 MB
diff --git a/‎lectures/_toc.yml‎
Lines changed: 1 addition & 1 deletion b/‎lectures/_toc.yml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎lectures/jax_intro.md‎
Lines changed: 1 addition & 154 deletions b/‎lectures/jax_intro.md‎
Lines changed: 1 addition & 154 deletions
diff --git a/‎lectures/need_for_speed.md‎
Lines changed: 168 additions & 0 deletions b/‎lectures/need_for_speed.md‎
Lines changed: 168 additions & 0 deletions
@@ -23,7 +23,7 @@ parts:
   numbered: true
   chapters:
   - file: numba
-  - file: parallelization
+  - file: numpy_vs_numba_vs_jax
   - file: jax_intro
 - caption: Working with Data
   numbered: true
 
@@ -513,167 +513,14 @@ plt.show()
 
 We defer further exploration of automatic differentiation with JAX until {doc}`jax:autodiff`.
 
-## Writing vectorized code
-
-Writing fast JAX code requires shifting repetitive tasks from loops to array processing operations, so that the JAX compiler can easily understand the whole operation and generate more efficient machine code.
-
-This procedure is called **vectorization** or **array programming**, and will be
-familiar to anyone who has used NumPy or MATLAB.
-
-In most ways, vectorization is the same in JAX as it is in NumPy.
-
-But there are also some differences, which we highlight here.
-
-As a running example, consider the function
-
-$$
-    f(x,y) = \frac{\cos(x^2 + y^2)}{1 + x^2 + y^2}
-$$
-
-Suppose that we want to evaluate this function on a square grid of $x$ and $y$ points and then plot it.
-
-To clarify, here is the slow `for` loop version.
-
-```{code-cell} ipython3
-@jax.jit
-def f(x, y):
-    return jnp.cos(x**2 + y**2) / (1 + x**2 + y**2)
-
-n = 80
-x = jnp.linspace(-2, 2, n)
-y = x
-
-z_loops = np.empty((n, n))
-```
-
-```{code-cell} ipython3
-with qe.Timer():
-    for i in range(n):
-        for j in range(n):
-            z_loops[i, j] = f(x[i], y[j])
-```
-
-Even for this very small grid, the run time is extremely slow.
-
-(Notice that we used a NumPy array for `z_loops` because we wanted to write to it.)
-
-+++
-
-OK, so how can we do the same operation in vectorized form?
-
-If you are new to vectorization, you might guess that we can simply write
-
-```{code-cell} ipython3
-z_bad = f(x, y)
-```
-
-But this gives us the wrong result because JAX doesn't understand the nested for loop.
-
-```{code-cell} ipython3
-z_bad.shape
-```
-
-Here is what we actually wanted:
-
-```{code-cell} ipython3
-z_loops.shape
-```
-
-To get the right shape and the correct nested for loop calculation, we can use a `meshgrid` operation designed for this purpose:
-
-```{code-cell} ipython3
-x_mesh, y_mesh = jnp.meshgrid(x, y)
-```
-
-Now we get what we want and the execution time is very fast.
-
-```{code-cell} ipython3
-with qe.Timer():
-    z_mesh = f(x_mesh, y_mesh).block_until_ready()
-```
-
-Let's run again to eliminate compile time.
-
-```{code-cell} ipython3
-with qe.Timer():
-    z_mesh = f(x_mesh, y_mesh).block_until_ready()
-```
-
-Let's confirm that we got the right answer.
-
-```{code-cell} ipython3
-jnp.allclose(z_mesh, z_loops)
-```
-
-Now we can set up a serious grid and run the same calculation (on the larger grid) in a short amount of time.
-
-```{code-cell} ipython3
-n = 6000
-x = jnp.linspace(-2, 2, n)
-y = x
-x_mesh, y_mesh = jnp.meshgrid(x, y)
-```
-
-```{code-cell} ipython3
-with qe.Timer():
-    z_mesh = f(x_mesh, y_mesh).block_until_ready()
-```
-
-But there is one problem here: the mesh grids use a lot of memory.
-
-```{code-cell} ipython3
-x_mesh.nbytes + y_mesh.nbytes
-```
-
-By comparison, the flat array `x` is just
-
-```{code-cell} ipython3
-x.nbytes  # and y is just a pointer to x
-```
-
-This extra memory usage can be a big problem in actual research calculations.
-
-So let's try a different approach using [jax.vmap](https://docs.jax.dev/en/latest/_autosummary/jax.vmap.html)
-
-+++
-
-First we vectorize `f` in `y`.
-
-```{code-cell} ipython3
-f_vec_y = jax.vmap(f, in_axes=(None, 0))  
-```
-
-In the line above, `(None, 0)` indicates that we are vectorizing in the second argument, which is `y`.
-
-Next, we vectorize in the first argument, which is `x`.
-
-```{code-cell} ipython3
-f_vec = jax.vmap(f_vec_y, in_axes=(0, None))
-```
-
-With this construction, we can now call the function $f$ on flat (low memory) arrays.
-
-```{code-cell} ipython3
-with qe.Timer():
-    z_vmap = f_vec(x, y).block_until_ready()
-```
-
-The execution time is essentially the same as the mesh operation but we are using much less memory.
-
-And we produce the correct answer:
-
-```{code-cell} ipython3
-jnp.allclose(z_vmap, z_mesh)
-```
-
 ## Exercises
 
 
 ```{exercise-start}
 :label: jax_intro_ex2
 ```
 
-In the Exercise section of [a lecture on Numba and parallelization](https://python-programming.quantecon.org/parallelization.html), we used Monte Carlo to price a European call option.
+In the Exercise section of {doc}`a lecture on Numba <numba>`, we used Monte Carlo to price a European call option.
 
 The code was accelerated by Numba-based multithreading.
 
 
@@ -320,3 +320,171 @@ traditional vectorization and towards the use of [just-in-time compilers](https:
 In later lectures in this series, we will learn about how modern Python libraries exploit
 just-in-time compilers to generate fast, efficient, parallelized machine code.
 
+## Parallelization
+
+The growth of CPU clock speed (i.e., the speed at which a single chain of logic can
+be run) has slowed dramatically in recent years.
+
+This is unlikely to change in the near future, due to inherent physical
+limitations on the construction of chips and circuit boards.
+
+Chip designers and computer programmers have responded to the slowdown by
+seeking a different path to fast execution: parallelization.
+
+Hardware makers have increased the number of cores (physical CPUs) embedded in each machine.
+
+For programmers, the challenge has been to exploit these multiple CPUs by running many processes in parallel (i.e., simultaneously).
+
+This is particularly important in scientific programming, which requires handling
+
+* large amounts of data and
+* CPU intensive simulations and other calculations.
+
+In this lecture we discuss parallelization for scientific computing, with a focus on
+
+1. the best tools for parallelization in Python and
+1. how these tools can be applied to quantitative economic problems.
+
+Let's start with some imports:
+
+```{code-cell} ipython
+import numpy as np
+import quantecon as qe
+import matplotlib.pyplot as plt
+```
+
+### Parallelization on CPUs
+
+Large textbooks have been written on different approaches to parallelization but we will keep a tight focus on what's most useful to us.
+
+We will briefly review the two main kinds of CPU-based parallelization commonly used in
+scientific computing and discuss their pros and cons.
+
+#### Multiprocessing
+
+Multiprocessing means concurrent execution of multiple processes using more than one processor.
+
+In this context, a **process** is a chain of instructions (i.e., a program).
+
+Multiprocessing can be carried out on one machine with multiple CPUs or on a
+collection of machines connected by a network.
+
+In the latter case, the collection of machines is usually called a
+**cluster**.
+
+With multiprocessing, each process has its own memory space, although the
+physical memory chip might be shared.
+
+#### Multithreading
+
+Multithreading is similar to multiprocessing, except that, during execution, the threads all share the same memory space.
+
+Native Python struggles to implement multithreading due to some [legacy design
+features](https://wiki.python.org/moin/GlobalInterpreterLock).
+
+But this is not a restriction for scientific libraries like NumPy and Numba.
+
+Functions imported from these libraries and JIT-compiled code run in low level
+execution environments where Python's legacy restrictions don't apply.
+
+#### Advantages and Disadvantages
+
+Multithreading is more lightweight because most system and memory resources
+are shared by the threads.
+
+In addition, the fact that multiple threads all access a shared pool of memory
+is extremely convenient for numerical programming.
+
+On the other hand, multiprocessing is more flexible and can be distributed
+across clusters.
+
+For the great majority of what we do in these lectures, multithreading will
+suffice.
+
+### Hardware Accelerators
+
+While CPUs with multiple cores have become standard for parallel computing, a more dramatic shift has occurred with the rise of specialized hardware accelerators.
+
+These accelerators are designed specifically for the kinds of highly parallel computations that arise in scientific computing, machine learning, and data science.
+
+#### GPUs and TPUs
+
+The two most important types of hardware accelerators are
+
+* **GPUs** (Graphics Processing Units) and
+* **TPUs** (Tensor Processing Units).
+
+GPUs were originally designed for rendering graphics, which requires performing the same operation on many pixels simultaneously.
+
+Scientists and engineers realized that this same architecture --- many simple processors working in parallel --- is ideal for scientific computing tasks such as
+
+* matrix operations,
+* numerical simulation,
+* solving partial differential equations and
+* training machine learning models.
+
+TPUs are a more recent development, designed by Google specifically for machine learning workloads.
+
+Like GPUs, TPUs excel at performing massive numbers of matrix operations in parallel.
+
+#### Why GPUs Matter for Scientific Computing
+
+The performance gains from using GPUs can be dramatic.
+
+A modern GPU can contain thousands of small processing cores, compared to the 8-64 cores typically found in CPUs.
+
+When a problem can be expressed as many independent operations on arrays of data, GPUs can be orders of magnitude faster than CPUs.
+
+This is particularly relevant for scientific computing because many algorithms in
+
+* linear algebra,
+* optimization,
+* Monte Carlo simulation and
+* numerical methods for differential equations
+
+naturally map onto the parallel architecture of GPUs.
+
+#### Single GPUs vs GPU Servers
+
+There are two common ways to access GPU resources:
+
+**Single GPU Systems**
+
+Many workstations and laptops now come with capable GPUs, or can be equipped with them.
+
+```{figure} /_static/lecture_specific/need_for_speed/geforce.png
+:scale: 40
+```
+
+A single modern GPU can dramatically accelerate many scientific computing tasks.
+
+For individual researchers and small projects, a single GPU is often sufficient.
+
+Python libraries like JAX, PyTorch, and TensorFlow can automatically detect and use available GPUs with minimal code changes.
+
+**Multi-GPU Servers**
+
+For larger-scale problems, servers containing multiple GPUs (often 4-8 GPUs per server) are increasingly common.
+
+```{figure} /_static/lecture_specific/need_for_speed/dgx.png
+:scale: 23
+```
+
+These can be located
+
+* in local compute clusters,
+* in university or national lab computing facilities, or
+* in cloud computing platforms (AWS, Google Cloud, Azure, etc.).
+
+With appropriate software, computations can be distributed across multiple GPUs, either within a single server or across multiple servers.
+
+This enables researchers to tackle problems that would be infeasible on a single GPU or CPU.
+
+#### GPU Programming in Python
+
+The good news for Python users is that many scientific libraries now support GPU acceleration with minimal changes to existing code.
+
+For example, JAX code that runs on CPUs can often run on GPUs simply by ensuring the data is placed on the GPU device.
+
+We will explore GPU computing in more detail in later lectures, particularly when we discuss JAX.
+