SQLite: Case-insensitive matching of Unicode characters

DaltonST · 07-29-2014, 04:04 PM

According to SQLite Frequently Asked Question #18 at http://sqlite.org/faq.html#q18 :

(18) Case-insensitive matching of Unicode characters does not work.
The default configuration of SQLite only supports case-insensitive comparisons of ASCII characters. The reason for this is that doing full Unicode case-insensitive comparisons and case conversions requires tables and logic that would nearly double the size of the SQLite library. The SQLite developers reason that any application that needs full Unicode case support probably already has the necessary tables and functions and so SQLite should not take up space to duplicate this ability. Instead of providing full Unicode case support by default, SQLite provides the ability to link against external Unicode comparison and conversion routines. The application can overload the built-in NOCASE collating sequence (using sqlite3_create_collation()) and the built-in like(), upper(), and lower() functions (using sqlite3_create_function()). The SQLite source code includes an "ICU" extension that does these overloads.

So, COLLATE NOCASE in a SQLite table definition or in a SELECT is only good for pure ASCII comparisons. Unless, of course, what is described above has been implemented.

Does anyone know if Calibre's SQLite has already been implemented with Unicode UTF-8 case insensitive matching as described above?

For example, Calibre would need this capability when searching for Tags in Unicode UTF-8 that have very non-ASCII characters, such as in the German word sachbüch, the Hindi word NAHĪMṀ, the Spanish word noficción, the Turkish word gerçek, and so forth. Ditto for Authors, Title, and Series.

Thanks in advance.

kovidgoyal · 07-29-2014, 11:11 PM

calibre only uses sqlite as a disk storage format, not a database. All sorting/searching is performed using ICU on an in memory normalized view of the data from the database.

rApeNB · 09-15-2014, 03:48 AM

When I update the cc.db, error prone.
no such collation sequence: icu
How can I get rid of it?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Calibre in mixed case sensitive/insensitive environments	Xwang	Development	6	08-16-2022 03:06 PM
find unicode characters	Sunlite	Editor	12	01-05-2014 07:04 AM
¿Convert unicode decomposed characters to unique/normal characters?	JohnQwerty	Calibre	3	04-05-2012 12:08 PM
Search filters: accented characters not matching plain ones	riki	Calibre	4	11-26-2011 07:38 AM
Small bug? Case-insensitive tags.	Arrghus	Calibre	9	07-12-2011 01:03 AM

07-29-2014, 04:04 PM	#1
DaltonST Deviser Posts: 2,265 Karma: 2090983 Join Date: Aug 2013 Location: Texas Device: none	SQLite: Case-insensitive matching of Unicode characters According to SQLite Frequently Asked Question #18 at http://sqlite.org/faq.html#q18 : (18) Case-insensitive matching of Unicode characters does not work. The default configuration of SQLite only supports case-insensitive comparisons of ASCII characters. The reason for this is that doing full Unicode case-insensitive comparisons and case conversions requires tables and logic that would nearly double the size of the SQLite library. The SQLite developers reason that any application that needs full Unicode case support probably already has the necessary tables and functions and so SQLite should not take up space to duplicate this ability. Instead of providing full Unicode case support by default, SQLite provides the ability to link against external Unicode comparison and conversion routines. The application can overload the built-in NOCASE collating sequence (using sqlite3_create_collation()) and the built-in like(), upper(), and lower() functions (using sqlite3_create_function()). The SQLite source code includes an "ICU" extension that does these overloads. So, COLLATE NOCASE in a SQLite table definition or in a SELECT is only good for pure ASCII comparisons. Unless, of course, what is described above has been implemented. Does anyone know if Calibre's SQLite has already been implemented with Unicode UTF-8 case insensitive matching as described above? For example, Calibre would need this capability when searching for Tags in Unicode UTF-8 that have very non-ASCII characters, such as in the German word sachbüch, the Hindi word NAHĪMṀ, the Spanish word noficción, the Turkish word gerçek, and so forth. Ditto for Authors, Title, and Series. Thanks in advance. Last edited by DaltonST; 07-29-2014 at 08:51 PM.

07-29-2014, 11:11 PM	#2
kovidgoyal creator of calibre Posts: 44,377 Karma: 23764838 Join Date: Oct 2006 Location: Mumbai, India Device: Various	calibre only uses sqlite as a disk storage format, not a database. All sorting/searching is performed using ICU on an in memory normalized view of the data from the database.

09-15-2014, 03:48 AM	#3
rApeNB Junior Member Posts: 7 Karma: 10 Join Date: Sep 2014 Device: Kindle Touch	When I update the cc.db, error prone. no such collation sequence: icu How can I get rid of it?

Advert