Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Singapore TIN Number #203

Closed
wants to merge 1 commit into from
Closed

Conversation

unho
Copy link
Contributor

@unho unho commented Mar 18, 2020

Fixes #111.

@unho unho force-pushed the sg-tin branch 6 times, most recently from af88095 to 36ad63d Compare March 20, 2020 08:23
@unho
Copy link
Contributor Author

unho commented Mar 20, 2020

@arthurdejong Ready for review.

@unho unho force-pushed the sg-tin branch 2 times, most recently from 3eb5fd1 to 997d3e4 Compare March 20, 2020 08:36
@unho unho changed the title Add support for Singapore Unique Entity Number Add support for Singapore TIN Number Apr 2, 2020
@arthurdejong
Copy link
Owner

Thanks for the PR. Sad about not being able to get the documentation on the check digit algorithm. However, since there is such a large dataset of valid numbers published it is actually not that hard to reverse engineer the algorithm.

First I looked at the distribution of the check digits across the numbers and found that they were not evenly distributed. However when filtering the numbers by type I found:

  • when looking only at the business numbers the check characters are evenly distributed across the letters ABCDEJKLMWX
  • when looking only at the local company numbers the check characters are evenly distributed across the letters CDEGHKMNRWZ
  • when looking only at the other numbers the check characters are evenly distributed across the letters ABCDEFGHJKL

This seems to suggest a mod 11 algorithm where the check digit alphabet is different based on the type. When assuming a simple weighted algorithm we can try to guess the weights.

Looking only at the business numbers for now we can generate groups of numbers that only differ in the last (before the check digit) number and check how the last digit changes the check digit:

sames = defaultdict(list)
for number in numbers:
   sames[number[:7] + 'x'].append(number)
complete = [number for number, values in sames.items() if len(values) == 10]
for i in range(5):
    number = random.choice(complete)
    print('%s %s' % (number, ''.join(x[-1] for x in sames[number])))
5286165x AWLJDBXMKE
5336500x CAWLJDBXMK
5310314x CAWLJDBXMK
5313062x LJDBXMKECA
5322613x ECAWLJDBXM

This shows that the check digit alphabet is MKECAWLJDBX or some rotation of it. Continuing to the second digit from right:

sames = defaultdict(list)
for number in numbers:
   sames[number[:6] + 'x' + number[7:8]].append(number)
complete = [number for number, values in sames.items() if len(values) == 10]
for i in range(5):
    number = random.choice(complete)
    print('%s %s' % (number, ', '.join(str(alphabet.index(x[-1])) for x in sames[number])))
530705x5 9, 5, 1, 8, 4, 0, 7, 3, 10, 6
531093x1 6, 2, 9, 5, 1, 8, 4, 0, 7, 3
533736x9 0, 7, 3, 10, 6, 2, 9, 5, 1, 8
528194x1 4, 0, 7, 3, 10, 6, 2, 9, 5, 1
532139x3 6, 2, 9, 5, 1, 8, 4, 0, 7, 3

This shows that every time the x goes up one the check digit goes down by 4, which implies the weight should be 7 (-4 mod 11).

Doing this for every digit (the first digit requires a bit of tweaks because only values from 0 to 5 are found) and shifting the alphabet a bit to get the correct offset we get:

def calc_business_check_digit(number):
    number = compact(number)
    weights = (10, 4, 9, 3, 8, 2, 7, 1)
    return 'XMKECAWLJDB'[sum(int(n) * w for n, w in zip(number, weights)) % 11]

Unleashing this function on the data set I found only 11 numbers where the check digit does not match:

50856857D
52737212B
52803596X
52804404A
52805118K
52813100D
52853385J
52856860B
52870338A
52882019E
52923950C

I have not tried the online validator for these numbers and I haven't looked at the other number types yet but I expect the analysis should be pretty simple to repeat with the approach above (perhaps with some tweaks for the numbers that have letters in them).

@unho
Copy link
Contributor Author

unho commented May 3, 2020

Wow!!!

@unho
Copy link
Contributor Author

unho commented May 4, 2020

@arthurdejong I have checked all those numbers that do not match and they all seem to be either terminated or cancelled in 2017. Maybe we should go with this algorithm?

@unho
Copy link
Contributor Author

unho commented May 4, 2020

Yep, verified with another website and all those are deregistered.

@arthurdejong
Copy link
Owner

I managed to reverse-engineer the local company and other checksums as well (the last one was a much bigger puzzle because of the letters). That only leaves "Foreign Company" numbers (the ones starting with F000).

Do you have some examples of valid numbers for these? The one in the tests doesn't pass the [online validator](https://www.iras.gov.sg/irashome/GST/GST-registered-businesses/Other-services/Checking-if-a-Business-is-GST-Registered/ and there do not seem to be many references to this flavour. Also note that the "other" flavour has a code (FC) for foreign companies so perhaps it has been replaced?

Do you have some more background and/or examples of valid "Foreign Company" numbers?

Thanks.

@unho
Copy link
Contributor Author

unho commented May 10, 2020

Sadly I have found no foreign company UEN numbers. If I correctly recall the examples used in testing for foreign companies were made up based on the documentation I have referenced in the ticket, while all the examples for the other types of UEN numbers are real examples.

@arthurdejong
Copy link
Owner

Are you OK if I merge it without the foreign company UEN numbers? If it is used and some valid numbers are not validated correctly someone will likely complain while no one will likely complain if an invalid number is considered valid.

@unho
Copy link
Contributor Author

unho commented May 16, 2020

@arthurdejong I am 100% OK with that. I would suggest keeping that particular code, but commented.

@unho unho deleted the sg-tin branch July 8, 2020 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for Singapore TIN
2 participants