unaccent
is a text search dictionary that removes accents
(diacritic signs) from lexemes.
It's a filtering dictionary, which means its output is
always passed to the next dictionary (if any), unlike the normal
behavior of dictionaries. This allows accent-insensitive processing
for full text search.
The current implementation of unaccent
cannot be used as a
normalizing dictionary for the thesaurus
dictionary.
An unaccent
dictionary accepts the following options:
RULES
is the base name of the file containing the list of
translation rules. This file must be stored in
$SHAREDIR/tsearch_data/
(where $SHAREDIR
means
the PostgreSQL™ installation's shared-data directory).
Its name must end in .rules
(which is not to be included in
the RULES
parameter).
The rules file has the following format:
Each line represents a pair, consisting of a character with accent followed by a character without accent. The first is translated into the second. For example,
À A Á A Â A Ã A Ä A Å A Æ A
A more complete example, which is directly useful for most European
languages, can be found in unaccent.rules
, which is installed
in $SHAREDIR/tsearch_data/
when the unaccent
module is installed.
Installing the unaccent
extension creates a text
search template unaccent
and a dictionary unaccent
based on it. The unaccent
dictionary has the default
parameter setting RULES='unaccent'
, which makes it immediately
usable with the standard unaccent.rules
file.
If you wish, you can alter the parameter, for example
mydb=# ALTER TEXT SEARCH DICTIONARY unaccent (RULES='my_rules');
or create new dictionaries based on the template.
To test the dictionary, you can try:
mydb=# select ts_lexize('unaccent','Hôtel'); ts_lexize ----------- {Hotel} (1 row)
Here is an example showing how to insert the
unaccent
dictionary into a text search configuration:
mydb=# CREATE TEXT SEARCH CONFIGURATION fr ( COPY = french ); mydb=# ALTER TEXT SEARCH CONFIGURATION fr ALTER MAPPING FOR hword, hword_part, word WITH unaccent, french_stem; mydb=# select to_tsvector('fr','Hôtels de la Mer'); to_tsvector ------------------- 'hotel':1 'mer':4 (1 row) mydb=# select to_tsvector('fr','Hôtel de la Mer') @@ to_tsquery('fr','Hotels'); ?column? ---------- t (1 row) mydb=# select ts_headline('fr','Hôtel de la Mer',to_tsquery('fr','Hotels')); ts_headline ------------------------ <b>Hôtel</b> de la Mer (1 row)
The unaccent()
function removes accents (diacritic signs) from
a given string. Basically, it's a wrapper around the
unaccent
dictionary, but it can be used outside normal
text search contexts.
unaccent([dictionary
, ]string
) returns text
For example:
SELECT unaccent('unaccent', 'Hôtel'); SELECT unaccent('Hôtel');