Perl's Encoding::FixLatin equivalent in PHP
By: squeegee
I think this is a reasonable port of Perl's Encoding::FixLatin by Grant McLean, which converts a string with mixed encodings (ASCII, ISO-8859-1, CP1252, and UTF-8) to UTF-8.
<?php
function init_byte_map(){
global $byte_map;
for($x=128;$x<256;++$x){
$byte_map[chr($x)]=utf8_encode(chr($x));
}
$cp1252_map=array(
"x80"=>"xE2x82xAC", // EURO SIGN
"x82" => "xE2x80x9A", // SINGLE LOW-9 QUOTATION MARK
"x83" => "xC6x92", // LATIN SMALL LETTER F WITH HOOK
"x84" => "xE2x80x9E", // DOUBLE LOW-9 QUOTATION MARK
"x85" => "xE2x80xA6", // HORIZONTAL ELLIPSIS
"x86" => "xE2x80xA0", // DAGGER
"x87" => "xE2x80xA1", // DOUBLE DAGGER
"x88" => "xCBx86", // MODIFIER LETTER CIRCUMFLEX ACCENT
"x89" => "xE2x80xB0", // PER MILLE SIGN
"x8A" => "xC5xA0", // LATIN CAPITAL LETTER S WITH CARON
"x8B" => "xE2x80xB9", // SINGLE LEFT-POINTING ANGLE QUOTATION MARK
"x8C" => "xC5x92", // LATIN CAPITAL LIGATURE OE
"x8E" => "xC5xBD", // LATIN CAPITAL LETTER Z WITH CARON
"x91" => "xE2x80x98", // LEFT SINGLE QUOTATION MARK
"x92" => "xE2x80x99", // RIGHT SINGLE QUOTATION MARK
"x93" => "xE2x80x9C", // LEFT DOUBLE QUOTATION MARK
"x94" => "xE2x80x9D", // RIGHT DOUBLE QUOTATION MARK
"x95" => "xE2x80xA2", // BULLET
"x96" => "xE2x80x93", // EN DASH
"x97" => "xE2x80x94", // EM DASH
"x98" => "xCBx9C", // SMALL TILDE
"x99" => "xE2x84xA2", // TRADE MARK SIGN
"x9A" => "xC5xA1", // LATIN SMALL LETTER S WITH CARON
"x9B" => "xE2x80xBA", // SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
"x9C" => "xC5x93", // LATIN SMALL LIGATURE OE
"x9E" => "xC5xBE", // LATIN SMALL LETTER Z WITH CARON
"x9F" => "xC5xB8" // LATIN CAPITAL LETTER Y WITH DIAERESIS
);
foreach($cp1252_map as $k=>$v){
$byte_map[$k]=$v;
}
}
function fix_latin($instr){
if(mb_check_encoding($instr,'UTF-8'))return $instr; // no need for the rest if it's all valid UTF-8 already
global $nibble_good_chars,$byte_map;
$outstr='';
$char='';
$rest='';
while((strlen($instr))>0){
if(1==preg_match($nibble_good_chars,$input,$match)){
$char=$match[1];
$rest=$match[2];
$outstr.=$char;
}elseif(1==preg_match('@^(.)(.*)[email protected]',$input,$match)){
$char=$match[1];
$rest=$match[2];
$outstr.=$byte_map[$char];
}
$instr=$rest;
}
return $outstr;
}
$byte_map=array();
init_byte_map();
$ascii_char='[x00-x7F]';
$cont_byte='[x80-xBF]';
$utf8_2='[xC0-xDF]'.$cont_byte;
$utf8_3='[xE0-xEF]'.$cont_byte.'{2}';
$utf8_4='[xF0-xF7]'.$cont_byte.'{3}';
$utf8_5='[xF8-xFB]'.$cont_byte.'{4}';
$nibble_good_chars = "@^($ascii_char+|$utf8_2|$utf8_3|$utf8_4|$utf8_5)(.*)[email protected]";
?>
Then just call fix_latin wherever you need it.
Archived Comments
Comment on this tutorial
- Data Science
- Android
- AJAX
- ASP.net
- C
- C++
- C#
- Cocoa
- Cloud Computing
- HTML5
- Java
- Javascript
- JSF
- JSP
- J2ME
- Java Beans
- EJB
- JDBC
- Linux
- Mac OS X
- iPhone
- MySQL
- Office 365
- Perl
- PHP
- Python
- Ruby
- VB.net
- Hibernate
- Struts
- SAP
- Trends
- Tech Reviews
- WebServices
- XML
- Certification
- Interview
categories
Related Tutorials
PHP convert string to lower case
PHP code to write to a CSV file for Microsoft Applications
PHP code to write to a CSV file from MySQL query
PHP code to import from CSV file to MySQL
Password must include both numeric and alphabetic characters - Magento
Error: Length parameter must be greater than 0
PHP file upload prompts authentication for anonymous users
PHP file upload with IIS on windows XP/2000 etc
Multiple File Upload in PHP using IFRAME
Resume or Pause File Uploads in PHP
Exception in module wampmanager.exe at 000F15A0 in Windows 8