-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A first sample version of FloatQuant
#159
base: feature/float_quant
Are you sure you want to change the base?
A first sample version of FloatQuant
#159
Conversation
…on can be found in the `Examples`. `±inf` are clipped to `±max_val`. `±NaN` are mapped to `±NaN`. The zero is always representable. I tested with subnormals (to be intended as subnormals for the output representation) and the quantizer represented the subnormals with no loss (I didn't extensively tested this part though). I tested the function against Brevitas `FloatQuant` implementation: they do not always match. For example I think `0.3125` should be representable (`x == xq`) by a float quantizer with 4bits for mantissa, 4bits for the exponent, 0 bias and 1bit for the sign. Brevitas `FloatQuant` implementation quantize it to `0.25`. Not sure what I should consider correct for this case.
Brevitas developer here, thanks for this concrete example - I will look into it ASAP! |
Hi @nickfraser |
Yes, please do - if you can provide minimal examples as well, this will make it much easier for us as well 🙏. Note, If you have proposed solutions, feel free to make PRs as well (pointing to |
Co-authored-by: Nicolo Ghielmetti <[email protected]>
… provided. Some other tests have been added
exponent_bias=None, | ||
max_val=None, | ||
rounding_mode="ROUND", | ||
lt_subnorm_to_zero=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maltanar , this name is terrible! Please, help me in finding a better one 🤦♂️
… quantization logic. Now QONNX and Brevitas float quantisers match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nghielme ! Looking good, but needs a few fixes before merging.
|
||
|
||
#### Sample Implementation | ||
TODO | ||
```python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add a comment that links this back to the source file, in case it changes in the future but we forget to update it here.
e.g.
# see src/qonnx/custom_op/general/floatquant.py for up-to-date implementation
|
||
import numpy as np | ||
|
||
from qonnx.custom_op.general.floatquant import compute_max_val, float_quantize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrong name imported? the impl seems to have been renamed to float_quant
(and no longer float_quantize
)
exponent_bias, | ||
signed, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we have a notion of a default exponent bias (as implemented by the compute_default_exponent_bias
function) I suggest to enable using the default by setting exponent_bias=None
. in that case signed
should be either moved up in the parameter list to come before the params with default values, or assigned some default value (True?) itself
assert np.all(float_quantize(testcase_a, unit_scale, 2, 3) == testcase_a) | ||
assert np.all(float_quantize(testcase_b, unit_scale, 2, 3) == testcase_b) | ||
assert np.all(float_quantize(testcase_c, unit_scale, 2, 3) == compute_max_val(2, 3)) | ||
assert np.all(float_quantize(testcase_d, unit_scale, 3, 2) == compute_max_val(3, 2)) | ||
assert np.all(float_quantize(testcase_e, unit_scale, 2, 1) == compute_max_val(2, 1)) | ||
assert np.all(float_quantize(testcase_f, unit_scale, 2, 3, lt_subnorm_to_zero=True) == 0.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing signed
and perhaps exponent_bias
args?
assert compute_max_val(2, 3) == 7.5 # FP6 E2M3 | ||
assert compute_max_val(3, 2) == 28.0 # FP6 E3M2 | ||
assert compute_max_val(2, 1) == 6.0 # FP4 E2M1 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please also add the test with the example executing QONNX exported from Brevitas & comparing against (pre-generated) reference value from Brevitas, which you shared with me previously
Please, consider that the function should be tested more extensively.
Sample
FloatQuant
function implemented. A sample use of the function can be found in theExamples
.±inf
are clipped to±max_val
.±NaN
are mapped toNaN
. The zero is always representable. I tested with subnormals (to be intended as subnormals for the output representation) and the quantizer represented the subnormals with no loss (I didn't extensively tested this part though). I tested the function against BrevitasFloatQuant
implementation: they do not always match. For example I think0.3125
should be representable (x == xq
) by a float quantizer with 4bits for mantissa, 4bits for the exponent, 0 bias and 1bit for the sign. BrevitasFloatQuant
implementation quantize it to0.25
. Not sure what I should consider correct for this case.