For :
You are familiar with 64-bit floating point and 32-bit floating point, and may have heard about 16-bit floating point (present in some GPUs), but there is actually work on 8-BIT floating-point!

arxiv.org/abs/2209.05433
developer.nvidia.com/blog/nvid

There is the "E5M2" variant, a "truncated IEEE FP16 format" (nice if lacking FP8). Although, at the miniscule 8-bit level, you don't necessarily need multiple NaNs or need infinities, so there is the "E4M3" variant as well.

Follow

Mastodon word count for this post: 499. (1 off, man 😁)

Sign in to participate in the conversation
Librem Social

Librem Social is an opt-in public network. Messages are shared under Creative Commons BY-SA 4.0 license terms. Policy.

Stay safe. Please abide by our code of conduct.

(Source code)

image/svg+xml Librem Chat image/svg+xml