-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store arrays offsets for numeric fields natively with synthetic source #124594
base: main
Are you sure you want to change the base?
Store arrays offsets for numeric fields natively with synthetic source #124594
Conversation
Nested contexts don't work right with the current offset context logic. For now, we can disable native synthetic source support and fall back to the ignored source mechanism. We can revisit to support native synthetic source within nested contexts at a later date.
This patch removes the context.isImmediateParentAnArray() check when deciding whether to store an array offset. This is necessary in case we are parsing fields within an object array. F.e. for the document `{"path":[{"int_value":10},{"int_value":20}]}` with `synthetic_source_keep: arrays` on the `int_value` field, we'd want to store the offsets for `int_value` even though the immediate parent is an object and not an array.
Single-element arrays can be unwraped into arrayless field values, so we need to handle that case.
Hi @jordan-powers, I've created a changelog YAML for you. |
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
} else { | ||
value = fieldType().nullValue; | ||
} | ||
if (offsetsFieldName != null && context.canAddIgnoredField()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't check context.isImmediateParentAnArray()
because we also need to record the offsets for object arrays.
There was a failing test IgnoredSourceFieldMapperTests#testIndexStoredArraySourceSingleLeafElementInObjectArray
It had the following:
mapping: {"path": {"type": "object", "properties": { "int_value": { "type": "long", "synthetic_source_keep": "arrays" }}}}
document: {"path":[{"int_value":10},{"int_value":20}]}
The values for int_value
were not immediate children of arrays, and so the DocumentParser
would fall back to the ignored source mechanism. Then, when the synthetic _source
was generated, the resultant document would look like this:
{"path":{"int_value":[10, 20]}}
Then, when this document was round-tripped and indexed in a new synthetic source index, the values would be immediate children of arrays, so the offsets would be recorded.
This caused a mismatch in the index writers (since one had the offsets field and the other did not), and so the test would fail.
My solution is to always just always record the offsets, even when the immediate parent is not an array.
This has a couple of drawbacks:
- This does create an inefficiency where we're potentially storing the offsets even for single-value arrays. We could probably fix that by adding a check to
FieldArrayContext#addToLuceneDocument
to skip recording the offsets if there's only one non-null value. - In the source loader, we can't rely anymore on the presence of the offset field to differentiate between a single value (f.e.
field: 5
) and a single-value array (f.e.field: [5]
). As such, my current implementation of the loader always unwraps single-value arrays into a single value. I think this is fine because it matches the ignored source implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to check for context.isImmediateParentAnArray()
here. The current offset encoding doesn't work well if there the immediate parent is not an array. If there is an object array higher up in the tree, then the array is pushed down to the leaf when synthesizing. This is incorrect, as this is not how the document was provided during indexing. This is why we fall back to ignored source in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect that the reason the test fails (with context.isImmediateParentAnArray()
check) is that initially the document gets index using an array, but SortedNumericWithOffsetsDocValuesSyntheticFieldLoader
normalizes that array to a value, then the round trip indexing doesn't have an array and therefor there is no offset field. (because context.isImmediateParentAnArray()
returns false)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is actually the other way around.
Here's the error:
java.lang.AssertionError: round trip {"path":{"int_value":[10,20]}} expected:<[_seq_no, _version, _primary_term, path.int_value]> but was:<[_seq_no, path.int_value.offsets, _version, _primary_term, path.int_value]>
The first time the document is indexed, the document has the structure {"path":[{"int_value":10},{"int_value":20}]}
. Since 10
and 20
are not immediate children of arrays, instead the fallback ignored source mechanism kicks in (via the check in DocumentParser#parseObjectOrField()
).
However, one of the modifications made by synthetic source is that arrays are moved to leaf fields. So when the synthetic source is returned, instead the document has the structure {"path":{"int_value":[10, 20]}}
.
When this modified document is indexed, the values 10
and 20
are now immediate children of an array, and so the offset encoding happens.
This means that the offsets field does not exist the first time the document is indexed, but it exists the second time after the roundtrip.
} else { | ||
assertThat(actualArray, Matchers.contains(expected)); | ||
} | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to update this check since single-value arrays are unwrapped to single-values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So single slot arrays should be retained and now be normalized to a single value. This works for ip and keyword field as well. So this should work for number field types too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was also part of the change removing the check for context.isImmediateParentAnArray(), since the change caused single-slot arrays to be returned as single values.
Since we're rethinking that solution, I'll probably end up reverting this too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jordan-powers, I did my first review round. This looks into the right direction.
@@ -49,7 +48,7 @@ public void testSynthesizeEmptyArray() throws Exception { | |||
public void testSynthesizeArrayRandom() throws Exception { | |||
var arrayValues = new Object[randomInt(64)]; | |||
for (int j = 0; j < arrayValues.length; j++) { | |||
arrayValues[j] = NetworkAddress.format(randomIp(true)); | |||
arrayValues[j] = getRandomValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops :)
if (offsetsFieldName != null && context.canAddIgnoredField()) { | ||
if (value != null) { | ||
final long sortableLongValue = type.sortableLongValue(value); | ||
context.getOffSetContext().recordOffset(offsetsFieldName, sortableLongValue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that casting the value
variable to Comparable
works too?
context.getOffSetContext().recordOffset(offsetsFieldName, sortableLongValue); | |
context.getOffSetContext().recordOffset(offsetsFieldName, (Comparable<?>) value); |
This way we don't need the type.sortableLongValue(...)
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll give it a shot
|| (sourceKeepMode == Mapper.SourceKeepMode.ARRAYS && context.inArrayScope()) | ||
|| (sourceKeepMode == Mapper.SourceKeepMode.ARRAYS | ||
&& context.inArrayScope() | ||
&& fieldMapper.supportStoringArrayOffsets() == false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why this change is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was part of the change removing the check for context.isImmediateParentAnArray()
. Adding this to the check disabled the fallback source for that object array so that we could use the offset encoding.
Since we're rethinking that solution, I'll probably end up reverting this.
@@ -97,6 +97,7 @@ static String getOffsetsFieldName( | |||
&& sourceKeepMode == Mapper.SourceKeepMode.ARRAYS | |||
&& hasDocValues | |||
&& isStored == false | |||
&& context.isInNestedContext() == false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense. Can you add a test to NativeArrayIntegrationTestCase
, which checks that if leaf array field has nested parent field, then we always fall back to ignored source?
} else { | ||
assertThat(actualArray, Matchers.contains(expected)); | ||
} | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So single slot arrays should be retained and now be normalized to a single value. This works for ip and keyword field as well. So this should work for number field types too?
@@ -159,25 +159,27 @@ public void testOffsetArrayWithNulls() throws Exception { | |||
} | |||
|
|||
public void testOffsetArrayRandom() throws Exception { | |||
StringBuilder values = new StringBuilder(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change required to make tests pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original implementation converted everything to quoted strings (f.e. ["4", "-1", "20"]
). However, elasticsearch would convert them to numbers when returning the synthetic source (f.e. [4, -1, 20]
), causing tests to fail. I needed some way to convert the random values into quoted or unquoted strings depending on the type. I figured that since XContentBuilder already has that logic, I'd just pass it down.
Alternatively I could change the return type of getRandomValue()
to Object
, then use the array(String name, Object... values)
method on XContentBuilder.
|
||
import java.io.IOException; | ||
|
||
class SortedNumericWithOffsetsDocValuesSyntheticFieldLoader extends SourceLoader.DocValuesBasedSyntheticFieldLoader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should instead implement this is a CompositeSyntheticFieldLoader.DocValuesLayer
. (like SortedSetWithOffsetsDocValuesSyntheticFieldLoaderLayer
)
return; | ||
} | ||
|
||
int count = count(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this class implements CompositeSyntheticFieldLoader.DocValuesLayer
then we don't need the logic here that determine whether something needs to be serialized as array. And we can serialize regular number values as regular number values (instead of wrapping them in an array)
} else { | ||
value = fieldType().nullValue; | ||
} | ||
if (offsetsFieldName != null && context.canAddIgnoredField()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to check for context.isImmediateParentAnArray()
here. The current offset encoding doesn't work well if there the immediate parent is not an array. If there is an object array higher up in the tree, then the array is pushed down to the leaf when synthesizing. This is incorrect, as this is not how the document was provided during indexing. This is why we fall back to ignored source in this case.
} else { | ||
value = fieldType().nullValue; | ||
} | ||
if (offsetsFieldName != null && context.canAddIgnoredField()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect that the reason the test fails (with context.isImmediateParentAnArray()
check) is that initially the document gets index using an array, but SortedNumericWithOffsetsDocValuesSyntheticFieldLoader
normalizes that array to a value, then the round trip indexing doesn't have an array and therefor there is no offset field. (because context.isImmediateParentAnArray()
returns false)
This patch builds on the work in #122999 and #113757 to natively store array offsets for numeric fields instead of falling back to ignored source when
source_keep_mode: arrays
.