PHP手册 - Detect character encoding

PHP手册 - N: Detect character encoding

用户评论:

yinmj (11-Jan-2012 04:27)

For Chinese developers: please note that the second argument of this function DOES NOT include 'GB2312' and 'GBK' and the return value is 'EUC-CN' when it is detected as a GB2312 string.

Gerg Tisza (18-Feb-2011 11:43)

If you try to use mb_detect_encoding to detect whether a string is valid UTF-8, use the strict mode, it is pretty worthless otherwise. <?php $str = 'áéóú'; // ISO-8859-1 mb_detect_encoding($str, 'UTF-8'); // 'UTF-8' mb_detect_encoding($str, 'UTF-8', true); // false ?>

jimmy at powerzone dot dk (31-Jan-2010 08:15)

I was in the need of a function capable of detecting whether a string was ISO-8859-1 compatible or not. Unfortunately I was unable to get mb_detect_encoding to work. The following function does the trick. <?php function IsLatin1($str) { return (preg_match("/^[\\x00-\\xFF]*$/u", $str) === 1); } var_dump(IsLatin1("abc ABC 123")); // true var_dump(IsLatin1("abc € 123")); // false (because of €) ?>

nat3738 at gmail dot com (22-May-2009 11:58)

A simple way to detect UTF-8/16/32 of file by its BOM (not work with string or file without BOM) <?php // Unicode BOM is U+FEFF, but after encoded, it will look like this. define ('UTF32_BIG_ENDIAN_BOM' , chr(0x00) . chr(0x00) . chr(0xFE) . chr(0xFF)); define ('UTF32_LITTLE_ENDIAN_BOM', chr(0xFF) . chr(0xFE) . chr(0x00) . chr(0x00)); define ('UTF16_BIG_ENDIAN_BOM' , chr(0xFE) . chr(0xFF)); define ('UTF16_LITTLE_ENDIAN_BOM', chr(0xFF) . chr(0xFE)); define ('UTF8_BOM' , chr(0xEF) . chr(0xBB) . chr(0xBF)); function detect_utf_encoding($filename) { $text = file_get_contents($filename); $first2 = substr($text, 0, 2); $first3 = substr($text, 0, 3); $first4 = substr($text, 0, 3); if ($first3 == UTF8_BOM) return 'UTF-8'; elseif ($first4 == UTF32_BIG_ENDIAN_BOM) return 'UTF-32BE'; elseif ($first4 == UTF32_LITTLE_ENDIAN_BOM) return 'UTF-32LE'; elseif ($first2 == UTF16_BIG_ENDIAN_BOM) return 'UTF-16BE'; elseif ($first2 == UTF16_LITTLE_ENDIAN_BOM) return 'UTF-16LE'; } ?>

prgss at bk dot ru (30-Mar-2009 10:16)

Another light way to detect character encoding: <?php function detect_encoding($string) { static $list = array('utf-8', 'windows-1251'); foreach ($list as $item) { $sample = iconv($item, $item, $string); if (md5($sample) == md5($string)) return $item; } return null; } ?>

matthijs at ischen dot nl (28-Mar-2009 06:33)

I seriously underestimated the importance of setlocale... <?php $strings = array( "mais coisas a pensar sobre diário ou dois!", "plus de choses à penser à journalier ou à deux !", "?más cosas a pensar en diario o dos!", "più cose da pensare circa giornaliere o due!", "flere ting ? tenke p? hver dag eller to!", "Dal?í věcí, p?em??let o ka?d? den nebo dva!", "mehr über Spa? sp?t sch?nen", "m? von? gjat? fun bukur", "t?bb mint szórakozás kés? csodálatos kenyér" ); $convert = array(); setlocale(LC_CTYPE, 'de_DE.UTF-8'); foreach( $strings as $string ) $convert[] = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string); ?> Produces the following: Array ( [0] => mais coisas a pensar sobre diario ou dois! [1] => plus de choses a penser a journalier ou a deux ! [2] => ?mas cosas a pensar en diario o dos! [3] => piu cose da pensare circa giornaliere o due! [4] => flere ting aa tenke paa hver dag eller to! [5] => Dalsi veci, premyslet o kazdy den nebo dva! [6] => mehr ueber Spass spaet schoenen [7] => me vone gjate fun bukur [8] => toebb mint szorakozas keso csodalatos kenyer ) whereas <?php $convert = array(); setlocale(LC_CTYPE, 'nl_NL.UTF-8'); foreach( $strings as $string ) $convert[] = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string); ?> produces: Array ( [0] => mais coisas a pensar sobre di?rio ou dois! [1] => plus de choses ? penser ? journalier ou ? deux ! [2] => ?m?s cosas a pensar en diario o dos! [3] => pi? cose da pensare circa giornaliere o due! [4] => flere ting ? tenke p? hver dag eller to! [5] => Dal?? v?c?, p?em??let o ka?d? den nebo dva! [6] => mehr ?ber Spass sp?t sch?nen [7] => m? von? gjat? fun bukur [8] => t?bb mint sz?rakoz?s k?s? csod?latos keny?r ) This might be of interest when trying to convert utf-8 strings into ASCII suitable for URL's, and such. this was never obvious for me since I've used locales for us and nl.

dennis at nikolaenko dot ru (06-Oct-2008 05:18)

Beware of bug to detect Russian encodings http://bugs.php.net/bug.php?id=38138

hmdker at gmail dot com (24-Aug-2008 05:58)

Function to detect UTF-8, when mb_detect_encoding is not available it may be useful. <?php function is_utf8($str) { $c=0; $b=0; $bits=0; $len=strlen($str); for($i=0; $i<$len; $i++){ $c=ord($str[$i]); if($c > 128){ if(($c >= 254)) return false; elseif($c >= 252) $bits=6; elseif($c >= 248) $bits=5; elseif($c >= 240) $bits=4; elseif($c >= 224) $bits=3; elseif($c >= 192) $bits=2; else return false; if(($i+$bits) > $len) return false; while($bits > 1){ $i++; $b=ord($str[$i]); if($b < 128 || $b > 191) return false; $bits--; } } } return true; } ?>

yaqy at qq dot com (21-Jul-2008 06:14)

<?php /* *QQ: 290359552 * conver to Utf8 if $str is not equals to 'UTF-8' */ function convToUtf8($str) { if( mb_detect_encoding($str,"UTF-8, ISO-8859-1, GBK")!="UTF-8" ) { return iconv("gbk","utf-8",$str); } else { return $str; } } ?>

hoermann dot j at gmail dot com (20-Mar-2008 01:35)

referring to the bug in mb_detect_encoding decribed by telemach http://de2.php.net/manual/de/function.mb-detect-encoding.php#55228 I want to give a simple solution. Because <?php mb_detect_encoding('accentué' , 'UTF-8, ISO-8859-1'); ?> will lead to a wrong result (UTF-8) but <?php mb_detect_encoding('accentuée' , 'UTF-8, ISO-8859-1'); ?> will not, you should always add a ISO-8859-1 character to your string for this check. Do this: <?php mb_detect_encoding($myVal . 'a' , 'UTF-8, ISO-8859-1'); ?> This will suppress the situation where the error occurs and will not modify your variable. And it will still work if the error in the function will be fixed one day.

mark at kinoko dot fr (12-Oct-2007 03:56)

For: rl at itfigures dot nl Just note that your Euro symbol being \x80 is NOT standard for ISO-8859-1 or ISO-8859-15 as \x80 is a reserved character. It is however "common practice" for windows developpers to mix windows-1252 and ISO-8859-1. Just convert to windows-1252 instead of ISO-8859-1 and you'll get your € symbol at the right place.

rl at itfigures dot nl (04-Sep-2007 10:00)

I used Chris's function "detectUTF8" to detect the need from conversion from utf8 to 8859-1, which works fine. I did have a problem with the following iconv-conversion. The problem is that the iconv-conversion to 8859-1 (with //TRANSLIT) replaces the euro-sign with EUR, although it is common practice that \x80 is used as the euro-sign in the 8859-1 charset. I could not use 8859-15 since that mangled some other characters, so I added 2 str_replace's: if(detectUTF8($str)){ $str=str_replace("\xE2\x82\xAC","€",$str); $str=iconv("UTF-8","ISO-8859-1//TRANSLIT",$str); $str=str_replace("€","\x80",$str); } If html-output is needed the last line is not necessary (and even unwanted).

sunggsun (15-Aug-2006 08:26)

from PHPDIG function isUTF8($str) { if ($str === mb_convert_encoding(mb_convert_encoding($str, "UTF-32", "UTF-8"), "UTF-8", "UTF-32")) { return true; } else { return false; } }

chris AT w3style.co DOT uk (03-Aug-2006 10:22)

Based upon that snippet below using preg_match() I needed something faster and less specific. That function works and is brilliant but it scans the entire strings and checks that it conforms to UTF-8. I wanted something purely to check if a string contains UTF-8 characters so that I could switch character encoding from iso-8859-1 to utf-8. I modified the pattern to only look for non-ascii multibyte sequences in the UTF-8 range and also to stop once it finds at least one multibytes string. This is quite a lot faster. <?php function detectUTF8($string) { return preg_match('%(?: [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte |\xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte |\xED[\x80-\x9F][\x80-\xBF] # excluding surrogates |\xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3 |[\xF1-\xF3][\x80-\xBF]{3} # planes 4-15 |\xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16 )+%xs', $string); } ?>

telemach (28-Jul-2005 02:48)

beware : even if you need to distinguish between UTF-8 and ISO-8859-1, and you the following detection order (as chrigu suggests) mb_detect_encoding('accentue' , 'UTF-8, ISO-8859-1') returns ISO-8859-1, while mb_detect_encoding('accentu' , 'UTF-8, ISO-8859-1') returns UTF-8 bottom line : an ending '' (and probably other accentuated chars) mislead mb_detect_encoding

Chrigu (29-Mar-2005 03:32)

If you need to distinguish between UTF-8 and ISO-8859-1 encoding, list UTF-8 first in your encoding_list: mb_detect_encoding($string, 'UTF-8, ISO-8859-1'); if you list ISO-8859-1 first, mb_detect_encoding() will always return ISO-8859-1.

php-note-2005 at ryandesign dot com (17-Feb-2005 03:57)

Much simpler UTF-8-ness checker using a regular expression created by the W3C: <?php // Returns true if $string is valid UTF-8 and false otherwise. function is_utf8($string) { // From http://w3.org/International/questions/qa-forms-utf-8.html return preg_match('%^(?: [\x09\x0A\x0D\x20-\x7E] # ASCII | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3 | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15 | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16 )*$%xs', $string); } // function is_utf8 ?>

jaaks at playtech dot com (14-Jan-2005 08:27)

Last example for verifying UTF-8 has one little bug. If 10xxxxxx byte occurs alone i.e. not in multibyte char, then it is accepted although it is against UTF-8 rules. Make following replacement to repair it. Replace } // goto next char with } else { return false; // 10xxxxxx occuring alone } // goto next char

maarten (12-Jan-2005 11:55)

Sometimes mb_detect_string is not what you need. When using pdflib for example you want to VERIFY the correctness of utf-8. mb_detect_encoding reports some iso-8859-1 encoded text as utf-8. To verify utf 8 use the following: // // utf8 encoding validation developed based on Wikipedia entry at: // http://en.wikipedia.org/wiki/UTF-8 // // Implemented as a recursive descent parser based on a simple state machine // copyright 2005 Maarten Meijer // // This cries out for a C-implementation to be included in PHP core // function valid_1byte($char) { if(!is_int($char)) return false; return ($char & 0x80) == 0x00; } function valid_2byte($char) { if(!is_int($char)) return false; return ($char & 0xE0) == 0xC0; } function valid_3byte($char) { if(!is_int($char)) return false; return ($char & 0xF0) == 0xE0; } function valid_4byte($char) { if(!is_int($char)) return false; return ($char & 0xF8) == 0xF0; } function valid_nextbyte($char) { if(!is_int($char)) return false; return ($char & 0xC0) == 0x80; } function valid_utf8($string) { $len = strlen($string); $i = 0; while( $i < $len ) { $char = ord(substr($string, $i++, 1)); if(valid_1byte($char)) { // continue continue; } else if(valid_2byte($char)) { // check 1 byte if(!valid_nextbyte(ord(substr($string, $i++, 1)))) return false; } else if(valid_3byte($char)) { // check 2 bytes if(!valid_nextbyte(ord(substr($string, $i++, 1)))) return false; if(!valid_nextbyte(ord(substr($string, $i++, 1)))) return false; } else if(valid_4byte($char)) { // check 3 bytes if(!valid_nextbyte(ord(substr($string, $i++, 1)))) return false; if(!valid_nextbyte(ord(substr($string, $i++, 1)))) return false; if(!valid_nextbyte(ord(substr($string, $i++, 1)))) return false; } // goto next char } return true; // done } for a drawing of the statemachine see: http://www.xs4all.nl/~mjmeijer/unicode.png and http://www.xs4all.nl/~mjmeijer/unicode2.png

mb_detect_encoding

说明

参数

返回值

范例

参见

用户评论: