Refactor into C++ #56

SethMMorton · 2022-12-14T03:56:40Z

This will close #45.

Feature neutral, but the code is now a bit more object oriented which makes it more manageable and extendable.

The main entry-point file now uses Cython. It's basically a stright-translaton from the original C. The code is not idiomatic and is rather raw/ugly at this time. As more functionality is moved to Cython the main file will become cleaner.

This is to avoid name conflicts within Cython. On import in python-land the name is remapped to int and float, respectively.

Currently, only code to query an object type or see if an object is a number is implemented. The functions have been removed from the C code. Cython does not natively allow multiple .pyx files to comprise an .so file, so to workaround this we include the non-main .pyx files into main.pyx, rather than import them. It's a bit hacky, but it gets the job done.

We will migrate to using C++ for non-cython code going forward.

The goal now is that most logic will be done by C++ classes, and cython will be used to only handle conversions to python objects and error handling (which is challenging to do by hand).

This class is designed to accept python objects that are numeric, and determine the numeric type and how to best convert them if conversion is needed. This class is implemented as header-only since it is pretty small. A Payload class is also included in this commit, but not yet used. The NumericAnalyzer has been hooked into the query_types function - it will be used in other function in future commits.

There is now an Evaluator class, and a Parser hierarchy. The Evaluator constructs the correct Parser subclass, and then that is used to do the acual analysis. The Evaluator encodes some logic.

There are still LOTS of bugs and missing functionality, but this is a good skeleton and commit point.

Invalid unicode data should raise ValueError, not TypeError.

If base was not given, it was set to NULL, which would cause a segfault when converting to Py_ssize_t. Explict NULL checking is now done.

A handful of older C-files are no longer used in the current refactor. These have been removed to make it obvious what is actually going on.

If an overflow might occur, the parsing is dispatched to Python for more accurate handling.

There were two issues - If the base was explictly given it was not correctly disabling unicode parsing - It was converting numeric values to integers erroneously

If a class defines __int__ or __float__ or __index__, this is now supported. To clarify internal workings, "not_numeric" is now "not_float_or_int".

If a length was given to vector::reserve is too large, it caused a segfault. std::size_t was being used for the return of PyUnicode_GET_LEGNTH instead of Py_ssize_t, and these have a sign difference. The working hypothesis is there was an overflow happening that caused a very large number to be passed to vector::reserve. Using the correct types will hopefully avoid segfaults going forward.

Now taking advantage of templates and other C++ nicities. Awesome.

The PyNumberType enum was moved to evaluator.hpp and renamed to UserType. Some other enum values have been renamed to make all enum values unique in the Cython file.

This will make it easier to manager removing underscores if necessary.

If the input was of zero-length or a single whitespace, the previous algorithm would inadvertently store the end pointer before the start pointer. While we are at it, change the parser to store length internally instead of the end pointer, which is a bit cleaner.

Underscore parsing will allocate new memory, but it is assumed this is a low-probability event so this is deemed an OK happening.

For C++, we are now using clang-format instead of astyle

Instead of being a single class with switch statements used to choose correct behavior, a class hierarchy is used with overridden functions. This makes the functions a bit easier to understand, and will facilitate more flexible future enhancements.

In the previous C implementation, if a string was too long to convert naively the character array was passed to a python conversion funciotn. Prior to this commit in the C++ implemenation, if the character array was too long then the original object was given to the python conversion function. This was a waste of time because the character array then had to be re-extracted under the hood. Now, the extracted character array is used as it was in the C implementation. Furthermore, the code has actually been simplified because now we can return Python objects in the Payload, rather than sending an action signal indicating how to treat the python object. This removes a lot of unnecessary and ugly code.

Separate out the exception raising into a function so that excpetion messages are only built if the exception is going to be raised.

Numbers are pre-checked for correctness before sending to the python conversion functions because that is faster than creating an exception object when an error does occur. In the process, some code was refactored to remove copy/paste.

It turns out that Cython adds some overhead that makes fastnumbers just "numbers", lol.

The changes boil down to the following - Addition of the UserOptions object to store user options - Creation of the TextExtractor object to perform the actual extraction of text (pulled out of the Evaluator) - The Parser objects are now injected into the Evaluator, and are now created from the TextExtractor

The final details of what to return to the user are now owned by the Payload object, as part of the resolve() function.

Instead of storing a pointer so that we can use polymorphism, we are using templates and a reference to the Parser in the Evaluator class. Ideally, this will be slightly more performant because the compiler will no longer need to make indirect calls. As part of this, some of the methods of the Parser base class have been made pure virutal since there is now no chance of instantiating a Parser.

Remove the union, convert to PyObject* on construction, not on request.

codecov · 2022-12-14T04:01:26Z

Codecov Report

Base: 95.61% // Head: 97.60% // Increases project coverage by +1.99% 🎉

Coverage data is based on head (d2b4800) compared to base (96b798c).
Patch coverage: 97.60% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##           master      #56      +/-   ##
==========================================
+ Coverage   95.61%   97.60%   +1.99%     
==========================================
  Files           6        4       -2     
  Lines        1048      668     -380     
==========================================
- Hits         1002      652     -350     
+ Misses         46       16      -30

Impacted Files	Coverage Δ
src/cpp/parser.cpp	`94.11% <94.11%> (ø)`
src/cpp/extractor.cpp	`94.59% <94.59%> (ø)`
src/cpp/fastnumbers.cpp	`98.61% <98.61%> (ø)`
src/cpp/c_str_parsing.cpp	`99.46% <99.46%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

This is EOL, and is no longer supported by the testing infrastructure.

This is deprecated.

This reverts commit ebaae51.

SethMMorton added 30 commits December 13, 2022 19:59

Rename int and float to fn_int and fn_float in c-code

0904320

This is to avoid name conflicts within Cython. On import in python-land the name is remapped to int and float, respectively.

Use C++ to compile main file

62cff46

We will migrate to using C++ for non-cython code going forward.

Combine all cython code into main.pyx

33928ac

The goal now is that most logic will be done by C++ classes, and cython will be used to only handle conversions to python objects and error handling (which is challenging to do by hand).

Hook NumericAnalyzer class into the "is*" functions

51178e9

Checking functions use C++ code

05b1d5f

There is now an Evaluator class, and a Parser hierarchy. The Evaluator constructs the correct Parser subclass, and then that is used to do the acual analysis. The Evaluator encodes some logic.

Complete main logic for converting numbers

0c53f66

There are still LOTS of bugs and missing functionality, but this is a good skeleton and commit point.

Add missing handling for negative NaN values

eac8663

Fix invalid error messages for invalid unicode

c690d97

Invalid unicode data should raise ValueError, not TypeError.

Add NaN and infinity handling

f57ce9f

Fix segfault in fastnumbers.int

61a3263

If base was not given, it was set to NULL, which would cause a segfault when converting to Py_ssize_t. Explict NULL checking is now done.

Remove obviated files

3d78b7d

A handful of older C-files are no longer used in the current refactor. These have been removed to make it obvious what is actually going on.

Add overflow pre-detection

bf36ff6

If an overflow might occur, the parsing is dispatched to Python for more accurate handling.

Handle negative infinity for real conversions

99090c0

Fix issues with converting unicode in fast_int

e74535c

There were two issues - If the base was explictly given it was not correctly disabling unicode parsing - It was converting numeric values to integers erroneously

Enable integer conversion for non-base-10

69941b0

Add support for custom user classes

eef222f

If a class defines __int__ or __float__ or __index__, this is now supported. To clarify internal workings, "not_numeric" is now "not_float_or_int".

Add missing object decref in Parser destructor

57dd4c3

Disable using base with non-strings

566c715

Refactor raw C-str parsing code from C to C++

488b38f

Now taking advantage of templates and other C++ nicities. Awesome.

Remove last of the C-style files

d6fdeff

The PyNumberType enum was moved to evaluator.hpp and renamed to UserType. Some other enum values have been renamed to make all enum values unique in the Cython file.

Annotate enum values

d906c78

Use a Buffer class for the character buffer

6ae6d67

This will make it easier to manager removing underscores if necessary.

Add support for parsing underscores

2f2107e

Underscore parsing will allocate new memory, but it is assumed this is a low-probability event so this is deemed an OK happening.

Apply automated formatting

64433f1

For C++, we are now using clang-format instead of astyle

Remove unneeded long double cast in parse_float

1b674a6

SethMMorton added 14 commits December 13, 2022 20:00

Fix formatting checks in CI

5f38411

Refactor parser as a class hierarchy

afd26b5

Instead of being a single class with switch statements used to choose correct behavior, a class hierarchy is used with overridden functions. This makes the functions a bit easier to understand, and will facilitate more flexible future enhancements.

Clean up Cython exception handling

14e8ecf

Separate out the exception raising into a function so that excpetion messages are only built if the exception is going to be raised.

Make MSVC warnings less anal

06cbbf9

Move cython compiler directives in-file

db14369

Implement char detection with ctype functions

be3ddd8

Improve runtime by avoiding creating exceptions

e46cc45

Numbers are pre-checked for correctness before sending to the python conversion functions because that is faster than creating an exception object when an error does occur. In the process, some code was refactored to remove copy/paste.

No longer use Cython for the Python/C interface

5c1efe2

It turns out that Cython adds some overhead that makes fastnumbers just "numbers", lol.

Payload object now owns Python object resolution

16fbd2a

The final details of what to return to the user are now owned by the Payload object, as part of the resolve() function.

Use Py_UNREACHABLE for all unreachable blocks

c6ff8eb

Simplify Payload

d49d96c

Remove the union, convert to PyObject* on construction, not on request.

SethMMorton force-pushed the refactor-using-cython branch from d6b1cfd to d49d96c Compare December 14, 2022 04:01

SethMMorton force-pushed the refactor-using-cython branch 2 times, most recently from e148397 to cdc70db Compare December 14, 2022 05:06

SethMMorton added 3 commits December 13, 2022 21:12

Remove support for Python 3.6

aace345

This is EOL, and is no longer supported by the testing infrastructure.

Update changelog

b797006

Fix issues found in CI

30c80a3

SethMMorton force-pushed the refactor-using-cython branch from cdc70db to 30c80a3 Compare December 14, 2022 05:12

SethMMorton added 3 commits December 14, 2022 10:08

Switch to pyproject.toml

ebaae51

Stop using python setup.py install

5defbbc

This is deprecated.

Revert "Switch to pyproject.toml"

1a016b2

This reverts commit ebaae51.

SethMMorton force-pushed the refactor-using-cython branch from 12c6c6e to 1a016b2 Compare December 14, 2022 18:12

Potential macos/windows fix

d2b4800

SethMMorton force-pushed the refactor-using-cython branch from 365ef4e to d2b4800 Compare December 14, 2022 18:46

SethMMorton merged commit c43d274 into master Dec 14, 2022

SethMMorton deleted the refactor-using-cython branch December 14, 2022 18:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor into C++ #56

Refactor into C++ #56

SethMMorton commented Dec 14, 2022

codecov bot commented Dec 14, 2022 •

edited

Loading

Refactor into C++ #56

Refactor into C++ #56

Conversation

SethMMorton commented Dec 14, 2022

codecov bot commented Dec 14, 2022 • edited Loading

Codecov Report

codecov bot commented Dec 14, 2022 •

edited

Loading