uchardet
library is the encoding detector, which takes a sequence of bytes in an unknown character encoding without any additional information, and attempts to determine the encoding of the text. Returned encoding names are iconv-compatible.
uchardet
package solves 3 types of tasks:
The uchardet
package includes demo files. You could get their paths with this command:
dir(system.file("examples", package = "uchardet"), recursive = TRUE, full.names = TRUE)
To detect encoding of the strings you should use detect_str_enc()
function. It is vectorized and accepts the character vector. Missing values will be skipped.
Simple example. Detection of the ASII symbols:
detect_str_enc("Hello, useR!")
#> [1] "ASCII"
All strings in R could be only in three encodings - UTF-8
, Latin1
and native
. It means that we could not read file with WINDOWS-1252 encoding and got/print it with the same encoding, it will be converted into one of the basic encodings (usually ASCII or UTF-8).
Due to this limitations if we want to test detect_str_enc()
we should convert string with UTF-8 encoding into another encoding and then use the detect_str_enc()
.
Let’s define function for reading demo file to UTF-8 string:
function(path, enc) {
read_char <-# get file path
system.file("examples", path, package = "uchardet")
file <-# create the file connection with the encoding
file(file, encoding = enc)
con <-# close connection on exit
on.exit(close(con))
# read file content
paste(readLines(con, warn = FALSE), collapse = "\n")
}
Detction of the UTF-8 character string.
# read file into the working env
read_char("zh/big5.txt", "BIG-5")
zh_utf8 <-# print content
print(zh_utf8)
#> [1] "繁體中文繁體中文繁體中文繁體中文繁體中文繁體中文繁體中文繁體中文繁體中文繁體中文繁體中文繁體中文繁體中文繁體中文繁體中文繁體中文繁體中文"
# check the encoding of the created object
Encoding(zh_utf8)
#> [1] "UTF-8"
# detection result
detect_str_enc(zh_utf8)
#> [1] "UTF-8"
Detection of the unusual encodings:
# convert zh_utf8 from UTF-8 into unusual encodings
iconv(zh_utf8, "UTF-8", "BIG-5")
zh_big5 <-print(zh_big5)
#> [1] "\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5"
iconv(zh_utf8, "UTF-8", "GB18030")
zh_gb <-print(zh_gb)
#> [1] "\xb7\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xce\xc4"
# detect encoding
detect_str_enc(c(zh_utf8, zh_big5, zh_gb))
#> [1] "UTF-8" "BIG5" "GB18030"
Basic Encoding()
function returns unknown
encoding:
Encoding(c(zh_utf8, zh_big5, zh_gb))
#> [1] "UTF-8" "unknown" "unknown"
Sometimes file can’t be read as a string, for example, when it includes embedded nul (\000
). In such cases it would be right to read the file as raw byte vector and detect encoding with detect_raw_enc()
function.
Detection of the ASII symbols from the byte vector:
detect_raw_enc(charToRaw("Hello, useR!"))
#> [1] "ASCII"
Let’s define the function for reading demo files as raw bytes vector:
function(path) {
read_raw <-# get file path
system.file("examples", path, package = "uchardet")
file <-# read file to raw vector
readBin(file, raw(), file.size(file))
}
# print first 5 bytes
read_raw("de/iso-8859-1.txt")[1:5]
#> [1] 49 53 4f 20 38
Unusual encodings (each file has it’s own encoding) detection:
detect_raw_enc(read_raw("de/iso-8859-1.txt"))
#> [1] "ISO-8859-1"
detect_raw_enc(read_raw("de/windows-1252.txt"))
#> [1] "WINDOWS-1252"
detect_raw_enc(read_raw("fr/utf-16.be"))
#> [1] "UTF-16"
detect_raw_enc(read_raw("zh/big5.txt"))
#> [1] "BIG5"
Function detect_file_enc()
will be helpful for detection files encoding without importing these files into the working environment. detect_file_enc()
uses the sliding window with the 65536 bytes width, in result there is no need to import the entire file.
Function is vectorized and accepts the character vector of file paths. Non existing files will be skipped.
# paths to examples files
system.file("examples", package = "uchardet")
ex_path <- Sys.glob(file.path(ex_path, "*", "*"))
ex_files <-# detect encoding
detect_file_enc(ex_files) res <-
Let’s compare results with the original files encodings:
# regex pattern
".*/examples/((.*)/(.*)\\.(?:.*))$"
pattern <- list(file = character(1L), lang = character(1L), original = character(1L))
proto <- strcapture(pattern, ex_files, proto)
cmp <-$lang <- toupper(cmp$lang)
cmp$original <- toupper(cmp$original)
cmp$uchardet <- res
cmphead(cmp, n = 15)
#> file lang original uchardet
#> 1 ar/iso-8859-6.txt AR ISO-8859-6 ISO-8859-6
#> 2 ar/utf-8.txt AR UTF-8 UTF-8
#> 3 ar/windows-1256.txt AR WINDOWS-1256 WINDOWS-1256
#> 4 bg/windows-1251.txt BG WINDOWS-1251 WINDOWS-1251
#> 5 cs/ibm852.txt CS IBM852 IBM852
#> 6 cs/iso-8859-2.txt CS ISO-8859-2 ISO-8859-2
#> 7 cs/mac-centraleurope.txt CS MAC-CENTRALEUROPE MAC-CENTRALEUROPE
#> 8 cs/utf-8.txt CS UTF-8 UTF-8
#> 9 cs/windows-1250.txt CS WINDOWS-1250 WINDOWS-1250
#> 10 da/iso-8859-1.txt DA ISO-8859-1 ISO-8859-15
#> 11 da/iso-8859-15.txt DA ISO-8859-15 ISO-8859-15
#> 12 da/utf-8.txt DA UTF-8 UTF-8
#> 13 da/windows-1252.txt DA WINDOWS-1252 WINDOWS-1252
#> 14 de/iso-8859-1.txt DE ISO-8859-1 ISO-8859-1
#> 15 de/windows-1252.txt DE WINDOWS-1252 WINDOWS-1252