If you’re working with strings encoded as UTF-8 you may lose characters when you try to get a part of them using the PHP substr function. If the string is cut in the middle of a non-ASCII character you could end up getting question mark characters in your resulting substring.
Here’s an example:
$str1 = utf8_encode("Feliz día");$str2 = substr($str1, 0, 9);echo utf8_decode($str2); // will output Feliz d�
This happens because in UTF-8 characters are not restricted to one byte, they have variable length to match Unicode characters, between 1 and 4 bytes.
A safe way of cutting these strings without losing anything is by using the mb_substr PHP function instead. It works almost the same way as substr but the difference is that you can add a new parameter to specify the encoding type, whether is UTF-8 or a different encoding.
$str3 = mb_substr($str1, 0, 9, 'UTF-8');echo utf8_decode($str3); // will output Feliz dí
As of PHP >= 5.3 you can also declare the encoding directive and use the substr function
declare(encoding='UTF-8');$str4 = "Feliz día";$str5 = substr($str4, 0, 9);echo $str5; // will output Feliz dí