Skip to content

[BUG] wildcard query with case_insensitive: true fails for some languages (Turkish and Ukrainian for example) #470

@rongothait

Description

@rongothait

What is the bug?

Retrieving documents using a wildcard query with case_insensitive: true fails for some languages (Turkish and Ukrainian for example), while working correctly for others (English and German for example). This failure occurs even when the query value and casing match the stored data exactly.

How can one reproduce the bug?

  1. create the index:
curl -X PUT "http://localhost:9200/test_idx" -H 'Content-Type: application/json' -d'
{
  "mappings": {
    "dynamic": false,
    "properties": {
      "name": {
        "type": "wildcard",
        "doc_values": false
      }
    }
  }
}
'
  1. Insert test documents (English, Turkish, Ukranian, German)
# English (Control)
curl -X POST "http://localhost:9200/test_idx/_doc/" -H 'Content-Type: application/json' -d'{"name": "Alice Wonderland"}'

# German 
curl -X POST "http://localhost:9200/test_idx/_doc/" -H 'Content-Type: application/json' -d'{"name": "Heinz Meißner"}'

# Turkish
curl -X POST "http://localhost:9200/test_idx/_doc/" -H 'Content-Type: application/json' -d'{"name": "Gökçe İrmak"}'

# Ukrainian
curl -X POST "http://localhost:9200/test_idx/_doc/" -H 'Content-Type: application/json' -d'{"name": "Олександр Зінченко"}'
  1. Run the GET queries with wildcard + case_insensitive:
    use this query template
curl -X GET "http://localhost:9200/test_idx/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "wildcard": {
      "name": {
        "value": "*<NameString>*",
        "case_insensitive": true
      }
    }
  }
}
'

TEST RESULTS SUMMARY:

Image

for example the failed Turkish query looks like this:

curl -X GET "http://localhost:9200/test_idx/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "wildcard": {
      "name": {
        "value": "*Gökçe İrmak*",
        "case_insensitive": true
      }
    }
  }
}
'
{
  "took" : 44,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

What is the expected behavior?

All documents should be returned when the wildcard value matches the stored string. The character set should not impact the retrieval capability of the wildcard type, especially when using exact-case strings.

What is your host/environment?

  • OpenSearch Version: 3.5.0 (latest)
  • Deployment method: Docker Compose

Do you have any additional context?

  • removing the case_insensitive (or setting it to false returns these languages as expected:
curl -X GET "http://localhost:9200/test_idx/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "wildcard": {
      "name": {
        "value": "*Gökçe İrmak*",
        "case_insensitive": false
      }
    }
  }
}
'
{
  "took" : 24,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "test_idx",
        "_id" : "_I5m3ZwBBv_NcRY78s9B",
        "_score" : 1.0,
        "_source" : {
          "name" : "Gökçe İrmak"
        }
      }
    ]
  }
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions