Extracting a substring from a UTF-8 string in PHP

If you’re working with strings encoded as UTF-8 you may lose characters when you try to get a part of them using the PHP substr function. If the string is cut in the middle of a non-ASCII character you could end up getting question mark characters in your resulting substring.

Here’s an example:

$str1 = utf8_encode("Feliz día");$str2 = substr($str1, 0, 9);echo utf8_decode($str2); // will output Feliz d�

This happens because in UTF-8 characters are not restricted to one byte, they have variable length to match Unicode characters, between 1 and 4 bytes.

A safe way of cutting these strings without losing anything is by using the mb_substr PHP function instead. It works almost the same way as substr but the difference is that you can add a new parameter to specify the encoding type, whether is UTF-8 or a different encoding.

Example #2

$str3 = mb_substr($str1, 0, 9, 'UTF-8');echo utf8_decode($str3); // will output Feliz dí

As of PHP >= 5.3 you can also declare the encoding directive and use the substr function

Example #3

declare(encoding='UTF-8');$str4 = "Feliz día";$str5 = substr($str4, 0, 9);echo $str5; // will output Feliz dí

Facebook Comments

comments

About

I am super cool guy that likes to code anywhere I go, That pic you see of me is me in the Mountain Cerro de Muerto with my laptop, hacking away on the latest HTML5 Project

Tagged with: , , ,
Posted in post
One comment on “Extracting a substring from a UTF-8 string in PHP
  1. jiguro says:

    Many thanks for this post!
    For some reason, mb_substr was returning the whole string, not the substring it was defined to returned, but mb_strcut did the trick.
    You really helped me out with this, cheers! :)

Leave a Reply