Skip to content

Commit a0484dd

Browse files
authored
Replace MD5 with Blake2 (#1550)
1 parent e9df3be commit a0484dd

File tree

7 files changed

+35
-35
lines changed

7 files changed

+35
-35
lines changed

CHANGELOG.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -8,9 +8,9 @@ All notable changes to this project will be documented in this file.
88
- Fixed: Updated URL regex pattern to correctly exclude trailing single (') and double (") quotes from matched URLs.
99

1010
### Anonymizer
11+
- Changed: Deprecate `MD5` hash type option, defaulting into `sha256`. Added `blake2b` hash type instead, as a new option.
1112

1213
### Image Redactor
13-
1414
- Changed: Updated the return type annotation of `ocr_bboxes` in `verify_dicom_instance()` from `dict` to `list`.
1515

1616
### Presidio Structured

docs/anonymizer/index.md

+10-10
Original file line numberDiff line numberDiff line change
@@ -236,16 +236,16 @@ of the AnalyzerEngine"
236236

237237
## Built-in operators
238238

239-
| Operator type | Operator name | Description | Parameters |
240-
| --- | --- | --- | --- |
241-
| Anonymize | replace | Replace the PII with desired value | `new_value`: replaces existing text with the given value.<br> If `new_value` is not supplied or empty, default behavior will be: <entity_type\> e.g: <PHONE_NUMBER\> |
242-
| Anonymize | redact | Remove the PII completely from text | None |
243-
| Anonymize | hash | Hashes the PII text | `hash_type`: sets the type of hashing. Can be either `sha256`, `sha512` or `md5`. <br> The default hash type is `sha256`. |
244-
| Anonymize | mask | Replace the PII with a given character | `chars_to_mask`: the amount of characters out of the PII that should be replaced. <br> `masking_char`: the character to be replaced with. <br> `from_end`: Whether to mask the PII from it's end. |
245-
| Anonymize | encrypt | Encrypt the PII using a given key | `key`: a cryptographic key used for the encryption. |
246-
| Anonymize | custom | Replace the PII with the result of the function executed on the PII | `lambda`: lambda to execute on the PII data. The lambda return type must be a string. |
247-
| Anonymize | keep | Preserver the PII unmodified | None |
248-
| Deanonymize | decrypt | Decrypt the encrypted PII in the text using the encryption key | `key`: a cryptographic key used for the encryption is also used for the decryption. |
239+
| Operator type | Operator name | Description | Parameters |
240+
|---------------|---------------|---------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
241+
| Anonymize | replace | Replace the PII with desired value | `new_value`: replaces existing text with the given value.<br> If `new_value` is not supplied or empty, default behavior will be: <entity_type\> e.g: <PHONE_NUMBER\> |
242+
| Anonymize | redact | Remove the PII completely from text | None |
243+
| Anonymize | hash | Hashes the PII text | `hash_type`: sets the type of hashing. Can be either `sha256`, `sha512` or `blake2b`. <br> The default hash type is `sha256`. |
244+
| Anonymize | mask | Replace the PII with a given character | `chars_to_mask`: the amount of characters out of the PII that should be replaced. <br> `masking_char`: the character to be replaced with. <br> `from_end`: Whether to mask the PII from it's end. |
245+
| Anonymize | encrypt | Encrypt the PII using a given key | `key`: a cryptographic key used for the encryption. |
246+
| Anonymize | custom | Replace the PII with the result of the function executed on the PII | `lambda`: lambda to execute on the PII data. The lambda return type must be a string. |
247+
| Anonymize | keep | Preserver the PII unmodified | None |
248+
| Deanonymize | decrypt | Decrypt the encrypted PII in the text using the encryption key | `key`: a cryptographic key used for the encryption is also used for the decryption. |
249249

250250
!!! note "Note"
251251
When performing anonymization, if anonymizers map is empty or "DEFAULT" key is not stated, the default

docs/api-docs/api-docs.yml

+3-3
Original file line numberDiff line numberDiff line change
@@ -666,11 +666,11 @@ components:
666666
type: string
667667
description: "The hashing algorithm"
668668
enum:
669-
- md5
669+
- blake2b
670670
- sha256
671671
- sha512
672-
example: md5
673-
default: md5
672+
example: blake2b
673+
default: blake2b
674674

675675
Encrypt:
676676
title: Encrypt

presidio-anonymizer/README.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -30,10 +30,10 @@ Presidio anonymizer comes by default with the following anonymizers:
3030

3131
- **Redact**: Removes the PII completely from text.
3232
- Parameters: None
33-
- **Hash**: Hashes the PII using either sha256, sha512 or md5.
33+
- **Hash**: Hashes the PII using either sha256, sha512 or blake2b.
3434
- Parameters:
3535
- `hash_type`: Sets the type of hashing.
36-
Can be either `sha256`, `sha512` or `md5`.
36+
Can be either `sha256`, `sha512` or `blake2b` (`md5` is deprecated as of version 2.2.358).
3737
The default hash type is `sha256`.
3838
- **Mask**: Replaces the PII with a sequence of a given character.
3939
- Parameters:

presidio-anonymizer/presidio_anonymizer/operators/hash.py

+5-5
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,19 @@
11
"""Hashes the PII text entity."""
22

3-
from hashlib import md5, sha256, sha512
3+
from hashlib import blake2b, sha256, sha512
44
from typing import Dict
55

66
from presidio_anonymizer.operators import Operator, OperatorType
77
from presidio_anonymizer.services.validators import validate_parameter_in_range
88

99

1010
class Hash(Operator):
11-
"""Hash given text with sha256/sha512/md5 algorithm."""
11+
"""Hash given text with sha256/sha512/blake2b algorithm."""
1212

1313
HASH_TYPE = "hash_type"
1414
SHA256 = "sha256"
1515
SHA512 = "sha512"
16-
MD5 = "md5"
16+
BLAKE2B = "blake2b"
1717

1818
def operate(self, text: str = None, params: Dict = None) -> str:
1919
"""
@@ -25,14 +25,14 @@ def operate(self, text: str = None, params: Dict = None) -> str:
2525
hash_switcher = {
2626
self.SHA256: lambda s: sha256(s),
2727
self.SHA512: lambda s: sha512(s),
28-
self.MD5: lambda s: md5(s),
28+
self.BLAKE2B: lambda s: blake2b(s, digest_size=20),
2929
}
3030
return hash_switcher.get(hash_type)(text.encode()).hexdigest()
3131

3232
def validate(self, params: Dict = None) -> None:
3333
"""Validate the hash type is string and in range of allowed hash types."""
3434
validate_parameter_in_range(
35-
[self.SHA256, self.SHA512, self.MD5],
35+
[self.SHA256, self.SHA512, self.BLAKE2B],
3636
self._get_hash_type_or_default(params),
3737
self.HASH_TYPE,
3838
str,

presidio-anonymizer/tests/integration/test_anonymize_engine.py

+7-7
Original file line numberDiff line numberDiff line change
@@ -158,13 +158,13 @@ def test_given_intersecting_the_same_entities_then_we_anonymize_correctly():
158158
# fmt: off
159159
"hash_type,result",
160160
[
161-
("md5",
162-
'{"text": "hello world, my name is 1c272047233576d77a9b9a1acfdf741c. '
163-
'My number is: e7706047f07bf68a5dd73e8c47db3a30", "items": [{"start": 72, '
164-
'"end": 104, "entity_type": "PHONE_NUMBER", "text": '
165-
'"e7706047f07bf68a5dd73e8c47db3a30", "operator": "hash"}, '
166-
'{"start": 24, "end": 56, "entity_type": "NAME", "text": '
167-
'"1c272047233576d77a9b9a1acfdf741c", "operator": "hash"}]}'),
161+
("blake2b",
162+
'{"text": "hello world, my name is 9784349bcc3a48a6fe6e344c0701d31ee9ec1bd4. '
163+
'My number is: 4c43f855362f7ab55108b2a83d50aae7cac3acec", "items": [{"start": 80, '
164+
'"end": 120, "entity_type": "PHONE_NUMBER", "text": '
165+
'"4c43f855362f7ab55108b2a83d50aae7cac3acec", "operator": "hash"}, '
166+
'{"start": 24, "end": 64, "entity_type": "NAME", "text": '
167+
'"9784349bcc3a48a6fe6e344c0701d31ee9ec1bd4", "operator": "hash"}]}'),
168168
("sha256",
169169
'{"text": "hello world, my name is '
170170
'01332c876518a793b7c1b8dfaf6d4b404ff5db09b21c6627ca59710cc24f696a. '

presidio-anonymizer/tests/operators/test_hash.py

+7-7
Original file line numberDiff line numberDiff line change
@@ -70,14 +70,14 @@ def test_when_given_valid_value_without_hash_type_then_expected_sha256_string_re
7070
"8500ce5af27e4db23f533e54c8c1ad74de62e93ca77c05e8de90e9eb27c7abe155"
7171
"e01d47868eded3106ccf6ac1f5c33bbaa95d55d40e9d89091c3d4617cc6d60",
7272
), # Sha512 Hash 'Unicode EmojiSources' character
73-
("123456", "md5", "e10adc3949ba59abbe56e057f20f883e"), # MD5 Hash 123456
74-
("54321", "md5", "01cfcd4f6b8770febfb40cb906715822"), # MD5 Hash 54321
73+
("123456", "blake2b", "a0f92ddfdea4892ff18a48f7e0f9fcffc55745f5"), # blake2b Hash 123456
74+
("54321", "blake2b", "d0aaf44602cea8280a6bbdc62edc32762028183c"), # blake2b Hash 54321
7575
(
7676
"😈😈😈😈",
77-
"md5",
78-
"5bf45eeaade2060ac6cdd532e1c35eef",
77+
"blake2b",
78+
"70d6bc6f072a94ddb04ed81a04d65f2a7304b021",
7979
),
80-
# MD5 Hash 'Unicode EmojiSources' character
80+
# blake2b Hash 'Unicode EmojiSources' character
8181
# fmt: on
8282
],
8383
)
@@ -99,7 +99,7 @@ def test_when_hash_type_not_in_range_then_ipe_raised():
9999
with pytest.raises(
100100
InvalidParamError,
101101
match="Parameter hash_type value not_a_hash is not in range of values"
102-
" \\['sha256', 'sha512', 'md5'\\]",
102+
" \\['sha256', 'sha512', 'blake2b'\\]",
103103
):
104104
Hash().validate(params)
105105

@@ -118,7 +118,7 @@ def test_when_hash_type_is_empty_string_then_ipe_raised():
118118
with pytest.raises(
119119
InvalidParamError,
120120
match="Parameter hash_type value is not in range of values"
121-
" \\['sha256', 'sha512', 'md5'\\]",
121+
" \\['sha256', 'sha512', 'blake2b'\\]",
122122
):
123123
Hash().validate(params)
124124

0 commit comments

Comments
 (0)