Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor into C++ #56

Merged
merged 61 commits into from
Dec 14, 2022
Merged

Refactor into C++ #56

merged 61 commits into from
Dec 14, 2022

Conversation

SethMMorton
Copy link
Owner

This will close #45.

Feature neutral, but the code is now a bit more object oriented which makes it more manageable and extendable.

The main entry-point file now uses Cython. It's basically a
stright-translaton from the original C.

The code is not idiomatic and is rather raw/ugly at this time. As more
functionality is moved to Cython the main file will become cleaner.
This is to avoid name conflicts within Cython. On import in python-land
the name is remapped to int and float, respectively.
Currently, only code to query an object type or see if an object is a
number is implemented. The functions have been removed from the C code.

Cython does not natively allow multiple .pyx files to comprise an .so
file, so to workaround this we include the non-main .pyx files into
main.pyx, rather than import them. It's a bit hacky, but it gets the job
done.
We will migrate to using C++ for non-cython code going forward.
The goal now is that most logic will be done by C++ classes, and cython
will be used to only handle conversions to python objects and error
handling (which is challenging to do by hand).
This class is designed to accept python objects that are numeric, and
determine the numeric type and how to best convert them if conversion is
needed.

This class is implemented as header-only since it is pretty small.

A Payload class is also included in this commit, but not yet used.

The NumericAnalyzer has been hooked into the query_types function - it
will be used in other function in future commits.
There is now an Evaluator class, and a Parser hierarchy. The Evaluator
constructs the correct Parser subclass, and then that is used to do the
acual analysis. The Evaluator encodes some logic.
There are still LOTS of bugs and missing functionality, but this is a
good skeleton and commit point.
Invalid unicode data should raise ValueError, not TypeError.
If base was not given, it was set to NULL, which would cause a segfault
when converting to Py_ssize_t. Explict NULL checking is now done.
A handful of older C-files are no longer used in the current refactor.
These have been removed to make it obvious what is actually going on.
If an overflow might occur, the parsing is dispatched to Python for more
accurate handling.
There were two issues

- If the base was explictly given it was not correctly disabling unicode
  parsing
- It was converting numeric values to integers erroneously
If a class defines __int__ or __float__ or __index__, this is now
supported. To clarify internal workings, "not_numeric" is now
"not_float_or_int".
If a length was given to vector::reserve is too large, it caused a
segfault. std::size_t was being used for the return of
PyUnicode_GET_LEGNTH instead of Py_ssize_t, and these have a sign
difference. The working hypothesis is there was an overflow happening
that caused a very large number to be passed to vector::reserve.

Using the correct types will hopefully avoid segfaults going forward.
Now taking advantage of templates and other C++ nicities. Awesome.
The PyNumberType enum was moved to evaluator.hpp and renamed to
UserType.

Some other enum values have been renamed to make all enum values unique
in the Cython file.
This will make it easier to manager removing underscores if necessary.
If the input was of zero-length or a single whitespace, the previous
algorithm would inadvertently store the end pointer before the start
pointer.

While we are at it, change the parser to store length internally instead
of the end pointer, which is a bit cleaner.
Underscore parsing will allocate new memory, but it is assumed this is a
low-probability event so this is deemed an OK happening.
For C++, we are now using clang-format instead of astyle
Instead of being a single class with switch statements used to choose
correct behavior, a class hierarchy is used with overridden functions.
This makes the functions a bit easier to understand, and will facilitate
more flexible future enhancements.
In the previous C implementation, if a string was too long to convert
naively the character array was passed to a python conversion
funciotn.

Prior to this commit in the C++ implemenation, if the character
array was too long then the original object was given to the python
conversion function. This was a waste of time because the character
array then had to be re-extracted under the hood.

Now, the extracted character array is used as it was in the C
implementation.

Furthermore, the code has actually been simplified because now we
can return Python objects in the Payload, rather than sending an action
signal indicating how to treat the python object. This removes a lot of
unnecessary and ugly code.
Separate out the exception raising into a function so that excpetion
messages are only built if the exception is going to be raised.
Numbers are pre-checked for correctness before sending to the python
conversion functions because that is faster than creating an exception
object when an error does occur.

In the process, some code was refactored to remove copy/paste.
It turns out that Cython adds some overhead that makes fastnumbers just
"numbers", lol.
The changes boil down to the following

- Addition of the UserOptions object to store user options
- Creation of the TextExtractor object to perform the actual extraction
  of text (pulled out of the Evaluator)
- The Parser objects are now injected into the Evaluator, and are now
  created from the TextExtractor
The final details of what to return to the user are now owned by the
Payload object, as part of the resolve() function.
Instead of storing a pointer so that we can use polymorphism, we are
using templates and a reference to the Parser in the Evaluator class.
Ideally, this will be slightly more performant because the compiler will
no longer need to make indirect calls.

As part of this, some of the methods of the Parser base class have been
made pure virutal since there is now no chance of instantiating a
Parser.
Remove the union, convert to PyObject* on construction, not on request.
@SethMMorton SethMMorton force-pushed the refactor-using-cython branch from d6b1cfd to d49d96c Compare December 14, 2022 04:01
@codecov
Copy link

codecov bot commented Dec 14, 2022

Codecov Report

Base: 95.61% // Head: 97.60% // Increases project coverage by +1.99% 🎉

Coverage data is based on head (d2b4800) compared to base (96b798c).
Patch coverage: 97.60% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master      #56      +/-   ##
==========================================
+ Coverage   95.61%   97.60%   +1.99%     
==========================================
  Files           6        4       -2     
  Lines        1048      668     -380     
==========================================
- Hits         1002      652     -350     
+ Misses         46       16      -30     
Impacted Files Coverage Δ
src/cpp/parser.cpp 94.11% <94.11%> (ø)
src/cpp/extractor.cpp 94.59% <94.59%> (ø)
src/cpp/fastnumbers.cpp 98.61% <98.61%> (ø)
src/cpp/c_str_parsing.cpp 99.46% <99.46%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@SethMMorton SethMMorton force-pushed the refactor-using-cython branch 2 times, most recently from e148397 to cdc70db Compare December 14, 2022 05:06
This is EOL, and is no longer supported by the testing infrastructure.
@SethMMorton SethMMorton force-pushed the refactor-using-cython branch from cdc70db to 30c80a3 Compare December 14, 2022 05:12
@SethMMorton SethMMorton force-pushed the refactor-using-cython branch from 12c6c6e to 1a016b2 Compare December 14, 2022 18:12
@SethMMorton SethMMorton force-pushed the refactor-using-cython branch from 365ef4e to d2b4800 Compare December 14, 2022 18:46
@SethMMorton SethMMorton merged commit c43d274 into master Dec 14, 2022
@SethMMorton SethMMorton deleted the refactor-using-cython branch December 14, 2022 18:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Re-write using C++ and pybind11
1 participant