2021-12-04

MySQL character ordering: numbers before question mark

I have recently upgraded a MySQL data store from some ungodly many-years-out-of-date version to 8.0.26.

In one particular table I store dates associated with each record, but occasionally there are as-yet-unknown future dates. These have always been stored in the format YYY-MM-??, so the field type is VARCHAR(10) rather than DATE, as would be expected if it was possible to always be exact. The field data is otherwise reliably YYY-MM-DD.

However, queries to order this data have recently stopped working as expected, with MySQL reckoning that such an unknown date should be ordered BEFORE an exact date.

A query boils down to something like this: SELECT * FROM table WHERE date_field <= CURDATE()

(Today is 3rd December, so CURDATE is evaluating as 2021-12-03. The same occurs when using the literal string value 2021-12-03 rather than the CURDATE function, so it's definitely a sorting issue rather than clash between data types.)

In those old MySQL versions previously running, 2021-12-?? would evaluate higher/greater than an exact date like 03, and thus not be returned. This would also be expected in line with ASCII sort ordering. Now, however, any such ?? records are also returned, the question mark character apparently being sorted as before/less than a digit.

For the moment I can force the correct and expected behaviour by utilising REPLACE in my query, but this is process-heavy, ugly and inconvenient: SELECT * FROM table WHERE REPLACE(date_field , '??', '99') <= CURDATE()

Can anyone shed some light on why this is occurring and how I might correct it? It is presumably a MySQL bug, given the standard ASCII ordering and the previous experience (of many years standing) of it working correctly?


EDIT: Thanks to the initial replies pointing me to collation. The database uses almost entirely plain English with only occasional accents (etc), so I've rarely had to touch the default settings in the past.

As per ProGu and Álvaro González's responses, I've begun digging around and test queries without the real table/database involved do indeed return as suggested. However, as soon as I attempt to run anything on the real table, it's still not behaving as expected.

The table is on InnoDB, and all tables and (textual) fields across the database are utf8mb4/utf8mb4_0900_ai_ci. I have tried forcing the collation both at query level and by changing the actual table and field collation, yet that pesky 2021-12-?? is always returned, no matter which I choose. I have attempted various query formats to no avail:

SELECT * FROM table WHERE date_field <= CURDATE() ORDER BY date_field COLLATE utf8mb4_0900_ai_ci DESC

SELECT * FROM table WHERE date_field COLLATE utf8mb4_0900_ai_ci <= CURDATE() COLLATE utf8mb4_0900_ai_ci ORDER BY date_field COLLATE utf8mb4_0900_ai_ci DESC

Test based on Álvaro's code, correctly returning 2021-12-03:

with sample_data (sample_value) as (
    select '2021-12-??'
    union all select '2021-12-03'
)
select *
from sample_data
where sample_value <= CURDATE()
order by sample_value COLLATE utf8mb4_0900_ai_ci DESC LIMIT 1;

Is my collation inexperience showing; have I missed something really obvious?



from Recent Questions - Stack Overflow https://ift.tt/3dmPbE8
https://ift.tt/eA8V8J

No comments:

Post a Comment