graphics - When converting floating point to fixed point why have I seen some code do BIT_WIDTH ^ 2 MINUS 1? - Stack Overflow

admin•2025-04-19 23:10:19•questions•阅读4

In my mind intuitively, and even after some thought, if I want to convert a normalised float value back

In my mind intuitively, and even after some thought, if I want to convert a normalised float value back to fixed point I would multiply it by the max value able to be held by the fixed-point format. So multiply by 255 or 65535, or whatever. However in some cases I have seen some people or code insist that the correct way is to multiply by 255 - 1, or 65535 - 1. I don't know why this is or why it should be the case. Is this to deal overflow? I don't see what bad could happen if you multiplied 1.0f * 255. Even if 1.0f can't be perfectly represented in IEEE 754 and the multiplication yields 255.0001, then when casting that to an integer it still ends up as 255, no overflow.

Edit: Sorry, there are two mistakes in my question. I mean 2 ^ bit_width - 1, not bit_width ^ 2 - 1. Which is presumably correct, so for a 8 bit integer you would multiply by 255, and 16 bit integer you would multiply by 65535. My question is, I'm pretty sure I've seen code that multiplies by 255 - 1, that is 254. Is there any reason to multiply by 254 instead of 255?

Share Improve this question edited Mar 7 at 15:11 asked Mar 7 at 10:17 Zebrafish 15.1k3 gold badges66 silver badges153 bronze badges

"BIT_WIDTH ^ 2 MINUS 1" is like "256 - 1, or 65536 - 1", not "255 - 1, or 65535 - 1". Please post sample code of the conversion. I would expect a multiply by 256 or 65536. – chux Commented Mar 7 at 11:01
1 The question title is a bit confused. For an 8 bit entity the formula bit_width^2 -1 = 63 or 8 depending on your reading. I suspect the OP meant 2^(bit_width-1) which is the safe scaling used for signed integer conversions to avoid the awkward problem of -2^bitwidth having no corresponding positive representation (in twos complement arithmetic). – Martin Brown Commented Mar 7 at 12:10

Add a comment |

1 Answer 1

Sorted by: Reset to default 2

It is very unlikely you saw any code multiplying by BIT_WIDTH ^ 2 MINUS 1. For a width of, say, 16, this would multiply by 16²−1 = 256−1 = 255, while the maximum value in a 16-bit unsigned integer is 65,535. It is more likely you saw code multiplying by 2 ^ BIT_WDITH - 1, which would produce 65,535.

Consider an n-bit unsigned integer format. Its values range from 0 to 2ⁿ−1. Somebody mapping this to a floating-point interval [0, 1] might choose the mapping f: x ➝ x / (2ⁿ−1), as that maps 0 to 0, 2ⁿ−1 to 1, and every value between 0 and 2ⁿ−1 to a value between 0 and 1.

Once that choice is made, the inverse mapping is f⁻¹: y ➝ y•(2ⁿ−1).

However in some cases I have seen some people or code insist that the correct way is to multiply by 255 - 1, or 65535 - 1.

I see no reason for this. Show us the words of those people or the text of that code.

If the reverse map is y ➝ y•(2ⁿ−2), then the forward map is x ➝ x / (2ⁿ−2). For n = 8, this would map 255 to 255/254 = 1.045…, which is not within what would usually be used for a normalized interval, [0, 1].

One way this might make sense is if the fixed-point format reserved the value 2ⁿ−1 to denote some exceptional condition (such as that an error has occurred or data is missing), so its interval of numeric values were [0, 2ⁿ−2]. In that case, the maps would be x ➝ x / (y ➝ y•(2ⁿ−2) and y ➝ y•(2ⁿ−2).