Improper truncation when reading from UTF-8 CHAR(n) fields containing characters outside of the Basic Multilingual Plane #1213

YetNothingThunders · 2025-02-27T13:54:23Z

When using the ADO.NET provider to read a UTF-8 CHAR(n) field containing at least one character outside of the Basic Multilingual Plane (e.g. any emoji), the result will be improperly truncated. As an example, reading a CHAR(1) field containing the character '😊' (code point 0x1F60A) will result in a string value containing only the high surrogate (0xD83D). If this same character is stored in a VARCHAR(1) field, reading it works as expected.

I believe the cause of this issue can be found in GdsStatement.ReadRawValue:

NETProvider/src/FirebirdSql.Data.FirebirdClient/Client/Managed/Version10/GdsStatement.cs

Lines 1534 to 1539 in 4230c1e

    
           var s = xdr.ReadString(innerCharset, field.Length); 
        
           if ((field.Length % field.Charset.BytesPerCharacter) == 0 && 
        
           	s.Length > field.CharCount) 
        
           { 
        
           	return s.Substring(0, field.CharCount); 
        
           }

After reading the string value from the IXdrReader, that value is truncated to remove the extra characters that were present in the buffer as padding. However, this truncation combines usage of the DbField.CharCount property (the number of Unicode code points stored in the field) with the .NET string.Length property and string.Substring method (which are based on the number of UTF-16 code units), leading to incorrect behavior when a single code point is encoded using multiple code units.

The text was updated successfully, but these errors were encountered:

cincuranet · 2025-03-13T08:30:32Z

That's an interesting one. :) I'll look at it.

mrotteveel · 2025-03-13T09:29:01Z

I had a similar issue in Jaybird (see FirebirdSQL/jaybird#760 and FirebirdSQL/jaybird@45ad2eb)

cincuranet self-assigned this Mar 7, 2025

cincuranet added component: ado.net provider type: bug labels Mar 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improper truncation when reading from UTF-8 CHAR(n) fields containing characters outside of the Basic Multilingual Plane #1213

Improper truncation when reading from UTF-8 CHAR(n) fields containing characters outside of the Basic Multilingual Plane #1213

YetNothingThunders commented Feb 27, 2025

cincuranet commented Mar 13, 2025

mrotteveel commented Mar 13, 2025

Improper truncation when reading from UTF-8 CHAR(n) fields containing characters outside of the Basic Multilingual Plane #1213

Improper truncation when reading from UTF-8 CHAR(n) fields containing characters outside of the Basic Multilingual Plane #1213

Comments

YetNothingThunders commented Feb 27, 2025

cincuranet commented Mar 13, 2025

mrotteveel commented Mar 13, 2025