Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion content/c-pointers-arrays-strings/pointers.md
Original file line number Diff line number Diff line change
Expand Up @@ -401,7 +401,6 @@ int main() {
}
```


:::{figure} images/array-indexing.png
:label: fig-ptr-indexing
:width: 80%
Expand Down
26 changes: 26 additions & 0 deletions content/floating-point/exercises.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
---
title: "Exercises"
subtitle: "Check your knowledge before section"
---

## Conceptual Review

1. Question

:::{note} Solution
:class: dropdown

Solution

<!--See: [Lecture 2 Slide 13](https://docs.google.com/presentation/d/1dmCk2fZz-P8VedzAXnVmJiYPKszVka5NKmTuLJ6hqZc/edit?slide=id.g2af3b38b3e2_1_154#slide=id.g2af3b38b3e2_1_154)-->
:::

## Short Exercises

1. **True/False**:

:::{note} Solution
:class: dropdown
**True.** Explanation
:::

145 changes: 145 additions & 0 deletions content/floating-point/fp-discussion.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
---
title: "Floating Point: More Discussion"
subtitle: "This content is not tested"
---

(sec-float-discussion)=
## Learning Outcomes

* Understand which floating point formats are used in practice

::::{note} 🎥 Lecture Video
:class: dropdown

:::{iframe} https://www.youtube.com/embed/VkLcogCQAho
:width: 100%
:title: "[CS61C FA20] Lecture 06.5 - Floating Point: Floating Point Discussion"
:::

::::

In a previous version of the course, we covered floating point in much more detail over multiple lectures. In recent semesters, we have reduced floating point topics to focus on the core of the standard, and we have not covered more advanced topics like arithmetic, casting, and other floating-point representations. For now, we leave this out-of-scope content below as general reference.

## Floating Point Addition

Let's consider arithmetic with floating point numbers.

Floating point addition is more complex than integer addition. We can't just add significands without considering the exponent value. In general:

* Denormalize to match exponents
* Add significands together
* Keep the matched exponent
* Normalize, possibly changing the exponent
* (Note: If signs differ, just perform a subtract instead.)

Because of how floating point numbers are stored, simple operations like addition are not always associative.

Define `x`, `y`, and `z` as $-1.5 \times 10^{38}$, `y`: $1.5 \times 10^{38}$, and $1.0$, respectively.

$$
\begin{align}
\texttt{x + (y + z)} &= -1.5 \times 10^{38} + (1.5 \times 10^{38} + 1.0) \\
&= -1.5 \times 10^{38} + (1.5 \times 10^{38}) \\
&= 0.0
\end{align}
$$

$$
\begin{align}
\texttt{(x + y) + z} &= (-1.5 \times 10^{38} + 1.5 \times 10^{38}) + 1.0 \\
&= 0.0 + 1.0\\
&= 1.0
\end{align}
$$


Remember, floating point effectively **approximates** real results. With bigger exponents, step size between floats gets bigger too. In this example, $1.5 \times 10^{38}$ is so much larger than $1.0$ that $1.5 \times 10^{38} + 1.0$ in floating point representation rounds to $1.5 \times 10^{38}$.

## Floating Point Rounding Modes

When we perform math on real numbers, we have to worry about rounding to fit the result in the significand field. The floating point hardware carries two extra bits of precision, and then rounds to get the proper value.

There are four primary rounding modes:

* **Round towards $+\infty$**. ALWAYS round “up”: 2.001 → 3, -2.001 → -2
* **Round towards $-\infty$**. ALWAYS round “down”: 1.999 → 1, -1.999 → -2
* **Truncate**. Just drop the last bits (round towards 0)
* **Unbiased**. If midway, round to even.

The unbiased mode is the default, though the others can be specified. Unbiased works _almost_ like normal rounding. Generally, we round to the nearest representable number, e.g., 2.4 rounds to 2, 2.6 to 3, 2.5 to 2, 3.5 to 4, etc. If the value is on the borderline, we round to the nearest even number. In other words, if there is a "tie", half the time we round up; the other half time we round down. This "unbiased" nature ensures fairness on calculation by balancing out inaccuracies.

## Casting and converting

Rounding also occurs when converting betwen numeric types. In C:

* **`int` to `float`**: There are large integers that a `float` cannot handle exactly because it lacks enough bits in the significand. For instance, $2^24 + 1$ will "snap" to the closest even float.
* **`float` to `int`**: Floating points with fractional components simply don't have integer representations. C uses **truncation** to coerce and convert floating point to the nearest integer. For example, `(int) 1.5` gets chopped off to `1`.

Double-casting therefore does not work as expected. Code A and Code B below may not always print `"true"`:

```c
/* Code A */
int i = …;
if (i == (int)((float) i)) {
printf("true\n");
}

/* Code B */
float f = …;
if (f == (float)((int) f)) {
printf("true\n");
}
```

## Other Floating Point Representations

### Precision vs. Accuracy

Recall from before:

* **Precision** is a count of the number of bits used to represent a value.
* **Accuracy** is the difference between the actual value of a number and its computer representation.

High precision permits high accuracy but doesn’t guarantee it.
It is possible to have high precision but low accuracy.

For example, consider `float pi = 3.14;`. `pi` will be represented using all 23 bits of the significand ("highly precise"), but it is only an approximation of $\pi$ ("not accurate").

Below, we discuss other floating point representations that can yield more accurate numbers in certain cases. However, because all of these representations are fixed precision (i.e., fixed bit-width) we cannot represent everything perfectly.

### Even More Floating Point Representations

Still more representations exist. Here are a few from the IEEE 754 standard:

* **Quad-precision**, or IEEE 754 quadruple-precision format binary128. Defined as 128 bits (15 exponent bits, 112 significand bits) with unbelievable range and precision.
* **Oct-Precision**, or IEEE 754 octuple-precision format binary256. Defined as 256 bits (19 exponent bits, 237 significand bits).
* **Half-Precision**, or IEEE 754 half-precision format binary16. Defined as 16 bits (5 exponent bits, 10 significand bits).

Domain-specific architectures demand different number formats (@tab-float-types). For example, the bfloat16[^bf16] on Google's Tensor Processing Unit (TPU) is defined over 16 bits (8 exponent bits, 7 significand bits); because of its wider exponent field, it covers the same range as IEEE 754 single-precision format at the expense of significand precision. This tradeoff is preferred given vanishing gradients towards zero for neural network training.

:::{table} Different domain accelerators support various integer and floating-point formats.
:label: tab-float-types
:align: center

| Accelerator | int4 | int8 | int16 | fp16 | bf16[^bf16] | fp32 | tf32[^tf32] |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Google TPU v1 | | x | | | | | |
| Google TPU v2 | | | | | x | | |
| Google TPU v3 | | | | | x | | |
| Nvidia Volta TensorCore | | x | | x | | x | |
| Nvidia Ampere TensorCore | x | x | x | x | x | x | x |
| Nvidia DLA | | x | x | x | | | |
| Intel AMX | | x | | | x | | |
| Amazon AWS Inferentia | | x | | x | x | | |
| Qualcomm Hexagon | | x | | | | | |
| Huawei Da Vinci | | x | | x | | | |
| MediaTek APU 3.0 | | x | x | x | | | |
| Samsung NPU | | x | | | | | |
| Tesla NPU | | x | | | | | |

:::

[^tf32]: See [Nvidia's TensorFloat-32](https://en.wikipedia.org/wiki/TensorFloat-32).
[^bf16]: See [Google's bfloat16](https://docs.cloud.google.com/tpu/docs/bfloat16).

For those interested, we recommend reading about the proposed [Unum format](https://en.wikipedia.org/wiki/Unum_%28number_format%29), which suggests using _variable_ field widths for the exponent and significand. This format adds a "u-bit" to tell whether the number is exact or in-between unums.
207 changes: 207 additions & 0 deletions content/floating-point/fp-examples.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,207 @@
---
title: "Normalized Numbers: Practice"
---

## Learning Outcomes

* Practice converting between IEEE 754 single-precision floating point formats to decimal numbers.


::::{note} 🎥 Lecture Video
:class: dropdown

:::{iframe} https://www.youtube.com/embed/7MRtSYK1IOI
:width: 100%
:title: "[CS61C FA20] Lecture 06.4 - Floating Point: Examples, Discussion"
:::

::::

In this course, we will expect that you should be able to translate between a number's IEEE 754 Floating Point format and their decimal representation. We give some examples below.

In practice, you can and should use floating point converters. [This web app](https://www.h-schmidt.net/FloatConverter/IEEE754.html) gives a fantastic converter that you can and should use to explore numbers beyond those discussed below!

## Example 1: Floating Point to Decimal

:::{tip} Example 1

What is the decimal number represented by this IEEE 754 single-precision binary floating point format?

| s | exponent | significand |
| :--: | :--: | :--: |
| `1` | `1000 0001` | `111 0000 0000 0000 0000 0000` |

* **A.** $-7 \times 2^{129}$
* **B.** -3.5
* **C.** -3.75
* **D.** 7
* **E.** -7.5
* **F.** Something else

:::

:::{note} Show Answer
:class: dropdown

**E.** -7.5.

We separate these 32 bits into the bit fields of sign (1 bit), exponent (8 bits), and significand (23 bits) first and translate each part separately.

* s: 1, so sign is negative
* exponent: `1000 0001` is $128 + 1 = 129$, so exponent value is $129 - 127 = 2$
* significand: `1110....0` is `111`, so mantissa value is `1.111` (in base 2)

Plug into our formula, noting that components are decimal unless otherwise noted with subscript:

$$
\begin{align}
(-1)^\text{s} \times (1 + \text{significand})_{\text{two}} \times 2^{(\text{exponent}-127)} \\
= (-1)^1 \times (1 + .111)_{\text{two}} \times 2^{(129-127)} \\
= -1 \times (1.111)_{\text{two}} \times 2^2 \\
= -111.1_{\text{two}} \\
= -7.5
\end{align}
$$

* Second to last line: $(1.111)_{\text{two}} \times 2^2$ involves moving the binary point to the left by two spots, e.g., $111.1_{\text{two}}$.
* Last line: Integer component `111` is $7$; fractional component `.1` is $\texttt{1} \times 2^{-1} = 1/2 = 0.5$.

For those interested, this means that writing the C statement `float x = -7.5;` results in `x` having the bit pattern `0xC0F00000`.

:::

## Example 2: Step size with limited precision

Because we have a fixed # of bits (precision), we cannot represent all numbers in a range. With floating point numbers, the exponent field informs our step size.

:::{tip} Example 2
Suppose `y` has the floating point format below. What is the **step size** around `y`?

| s | exponent | significand |
| :--: | :--: | :--: |
| `0` | `1000 0001` | `111 0000 0000 0000 0000 0000` |

_Hint_: Consider the difference between the bit patterns of `y` and the next representable number after (or before) `y`.

:::

:::{note} Show Answer
:class: dropdown

We consider the next representable number **after** `y`. This involves incrementing the significand by `1` in the least significant bit, which corresponds to the smallest possible increment ("step size"):

| s | exponent | significand |
| :--: | :--: | :--: |
| `0` | `1000 0001` | `111 0000 0000 0000 0000 0001` |

This new number is `y + z`, for some small step size `z`:

$$
= y + \left( (.0...01)_{\text{two}} \times 2^{(\text{exponent}-127)} \right)\\
$$

Instead of translating `y` and this new number `y + z`, then taking their difference, we instead note that we are trying to find `z` itself. Let's figure out exactly what power of 2 `z` represents, given the provided exponent.

The least significant bit `.0....01`, for any mantissa of a binary normalized form:

* Implicit leading 1 is not represent by any of 23 bits but corresponds to $2^0$
* bit 22 (most significant bit) of significand is $2^{-1}$
* bit 0 (least significant bit) of significand is $2^{-23}$

In other words, for an exponent value of $127 - 127 = 0$ (with no shifts), `z` would be $2^{-23}$. However, with our exponent field, we shift over this bit to the appropriate power.

The exponent value for `10000001` is $129 - 127 = 2$. This shifts over $2^{-23}$ right by two. Our step size `z` is therefore

$$\left(2^{-23} \times 2^2\right) = 2^{-21}.$$

:::

Bigger exponents mean bigger step sizes, and vice versa. This is actually the desired behavior: when we have super large numbers, fractional differences become infinitesimal. However, with tiny numbers, smaller step sizes are more valuable and our precision (as represented by the bits of the significand) must go towards representing differences.

## Example 3: Floating Point to Decimal

:::{tip} Example 3

What is the decimal number represented by this IEEE 754 single-precision binary floating point format?

| s | exponent | significand |
| :--: | :--: | :--: |
| `0` | `0110 1000` | `101 0101 0100 0011 0100 0010` |

:::

::::{note} Show Answer
:class: dropdown

$$1.666115 \times 2^{-23} \approx 1.986 \times 10^{-7}$$

<!-- TODO: change image to translated example -->

Explanation (for now) is by image (@fig-float-ex3):

:::{figure} images/float-ex3.png
:label: fig-float-ex3
:width: 100%
:alt: "TODO"

Example 3, explained
:::

::::

## Example 4: Decimal to Floating Point


:::{tip} Example 4
What is $-2.340625 \times 10^1$ in IEEE 754 single-precision binary floating point format?
:::

::::{note} Show Answer
:class: dropdown

| s | exponent | significand |
| :--: | :--: | :--: |
| `1` | `1000 0011` | `011 1011 0100 0000 0000 0000` |

<!-- TODO: change image to translated example -->

Explanation (for now) is by image (@fig-float-ex4):

:::{figure} images/float-ex4.png
:label: fig-float-ex4
:width: 100%
:alt: "TODO"

Example 4, explained
:::

::::

## Example 5: Decimal to Floating Point

This exercise shows the limitations of accurate representation using the fixed-_precision_ IEEE 754 standard. After all, fixed precision means we only have 32 bits, and binary representations sometimes fall short.

:::{tip} Example 5
What is $\frac{1}{3}$ in IEEE 754 single-precision binary floating point format?
:::

::::{note} Show Answer
:class: dropdown

| s | exponent | significand |
| :--: | :--: | :--: |
| `0` | `0111 1101` | `010 1010 1010 1010 1010 1010` |

<!-- TODO: change image to translated example -->

Explanation (for now) is by image (@fig-float-ex5):

:::{figure} images/float-ex5.png
:label: fig-float-ex5
:width: 100%
:alt: "TODO"

Example 5, explained
:::

::::
Loading