Textcube : Brand yourself!

Cyrus Hackford · 2009-03-04 01:07:39

Cyrus Hackford
익숙한 사용자
오프라인

지역 Korea
가입일자 2009-03-04
글: 5

주제: UTF8::lessenAsByte 메소드의 더 빠른 버전을 소개합니다.

반갑습니다, Cyrus입니다(구 Ikaris C. Faust).

현재 Textcube에서 사용하는 UTF8 클래스의 lessenAsByte의 더 빠른 버전을 소개합니다.
개인적으로 쓰려고 만든 것이지만, 그냥 이곳에 한 번 던져 봅니다.

이 메소드와 lessenAsByte 메소드의 속도 비교는 http://cyrush.com/4db/library/test.php 에서 보실 수 있습니다.

public static function Slice($siclTextStr,$siclLengthInt=255) {
    if(strlen($siclTextStr)<=$siclLengthInt)
        return $siclTextStr;
    
    $siclVerifyInt=ord($siclTextStr[$siclLengthInt-1])>>6;
    if(($siclVerifyInt>>1)===0) // 1byte
        $siclLocationSubtInt=0;
    elseif($siclVerifyInt===3) // Head byte of multi-bytes characters.
        $siclLocationSubtInt=1;
    elseif($siclVerifyInt===2) { // Middle of multi-bytes character.
        if(isset($siclTextStr[$siclLengthInt])===false) { // End of the string
            for($siclLoopInt=2;true;$siclLoopInt++) { // Seeking for head byte.
                if((ord($siclTextStr[$siclLengthInt-$siclLoopInt])>>6)===2)
                    continue;
                else
                    break;
            }
            $siclVerifyLengthInt=ord($siclTextStr[$siclLengthInt-$siclLoopInt])>>4;
            if($siclVerifyLengthInt>=0 && $siclVerifyLengthInt<=7) // Broken byte.
                $siclLocationSubtInt=$siclLoopInt-1;
            else {
                switch($siclVerifyLengthInt) { // Identify the length of current character.
                    case 12:
                    case 13:
                        $siclVerifiedLengthInt=2;
                        break;
                    case 14:
                        $siclVerifiedLengthInt=3;
                        break;
                    case 15:
                        $siclVerifiedLengthInt=4;
                        break;
                }
                if($siclLoopInt!==$siclVerifiedLengthInt) // We're in the middle of the character.
                    $siclLocationSubtInt=$siclLoopInt;
                else // The byte we're verifying is the last byte of the character.
                    $siclLocationSubtInt=0;
            }
            unset($siclLoopInt,$siclVerifiedLengthInt);
        } elseif((ord($siclTextStr[$siclLengthInt])>>6)!==2) // Last byte of the character.
            $siclLocationSubtInt=0;
        else {
            for($siclLoopInt=2;true;$siclLoopInt++) { // Seeking for head byte.
                if((ord($siclTextStr[$siclLengthInt-$siclLoopInt])>>6)===2)
                    continue;
                else
                    break;
            }
            $siclLocationSubtInt=$siclLoopInt;
            unset($siclLoopInt);
        }
    }
    $siclSlicedStr=substr($siclTextStr,0,$siclLengthInt-$siclLocationSubtInt);
    unset($siclTextStr,$siclLengthInt,$siclLocationSubtInt);
    
    return $siclSlicedStr;
}

inureyes · 2009-03-05 11:27:44

inureyes
산지기
오프라인

지역 대한민국 > 포항
가입일자 2006-04-03
글: 4,471

답글: UTF8::lessenAsByte 메소드의 더 빠른 버전을 소개합니다.

오오 감사합니다^^ 코드 리뷰후 적용하도록 하겠습니다!

"Everything looks different on the other side."

-Ian Malcomm, from Michael Crichton's 'The Jurassic Park'

Cyrus Hackford · 2009-03-05 23:41:04

Cyrus Hackford
익숙한 사용자
오프라인

지역 Korea
가입일자 2009-03-04
글: 5

답글: UTF8::lessenAsByte 메소드의 더 빠른 버전을 소개합니다.

잠깐만요!

여기 새 코드가 있습니다!

if가 몇 개 줄어서 더 간단한 버전입니다. 굳이 if를 더 달 필요가 없더군요.

public static function Slice($siclTextStr,$siclLengthInt=255) {
    if(strlen($siclTextStr)<=$siclLengthInt)
        return $siclTextStr;
    
    $siclVerifyInt=ord($siclTextStr[$siclLengthInt-1])>>6;
    if(($siclVerifyInt>>1)===0) // 1byte
        $siclLocationSubtInt=0;
    elseif($siclVerifyInt===3) // Head byte of multi-bytes characters.
        $siclLocationSubtInt=1;
    elseif($siclVerifyInt===2) { // Middle of multi-bytes character.
        for($siclLoopInt=2;true;$siclLoopInt++) { // Seeking for head byte.
            if((ord($siclTextStr[$siclLengthInt-$siclLoopInt])>>6)!==2)
                break;
        }
        switch(ord($siclTextStr[$siclLengthInt-$siclLoopInt])>>4) { // Identify the length of current character.
            case 12:
            case 13:
                $siclVerifiedLengthInt=2;
                break;
            case 14:
                $siclVerifiedLengthInt=3;
                break;
            case 15:
                $siclVerifiedLengthInt=4;
                break;
        }
        if($siclLoopInt===$siclVerifiedLengthInt) // The byte we're verifying is the last byte of the character.
            $siclLocationSubtInt=0;
        else
            $siclLocationSubtInt=$siclLoopInt;
        unset($siclLoopInt,$siclVerifiedLengthInt);
    }
    $siclSlicedStr=substr($siclTextStr,0,$siclLengthInt-$siclLocationSubtInt);
    unset($siclTextStr,$siclLengthInt,$siclLocationSubtInt);
    
    return $siclSlicedStr;
}

gendoh · 2009-03-06 16:39:48

gendoh
육종학자
오프라인

가입일자 2006-04-05
글: 625

답글: UTF8::lessenAsByte 메소드의 더 빠른 버전을 소개합니다.

forloop의 언더플로우 문제 빼고는 문제빼고는 괜찮아 보이는군요.

Cyrus Hackford · 2009-03-07 00:07:10

Cyrus Hackford
익숙한 사용자
오프라인

지역 Korea
가입일자 2009-03-04
글: 5

답글: UTF8::lessenAsByte 메소드의 더 빠른 버전을 소개합니다.

$siclLengthInt-$siclLoopInt

이 부분 말씀이신가요? 그럼 for에서 조건을 true 대신 $siclLengthInt>=$siclLoopInt로 수정해야겠군요.

// FROM
for($siclLoopInt=2;true;$siclLoopInt++) { // Seeking for head byte.
    if((ord($siclTextStr[$siclLengthInt-$siclLoopInt])>>6)!==2)
        break;
}

// TO
for($siclLoopInt=2;$siclLengthInt>=$siclLoopInt;$siclLoopInt++) { // Seeking for head byte.
    if((ord($siclTextStr[$siclLengthInt-$siclLoopInt])>>6)!==2)
        break;
}

Cyrus Hackford (2009-03-07 00:12:28)에 의해 마지막으로 수정

inureyes · 2009-04-17 14:27:40

inureyes
산지기
오프라인

지역 대한민국 > 포항
가입일자 2006-04-03
글: 4,471

답글: UTF8::lessenAsByte 메소드의 더 빠른 버전을 소개합니다.

코드를 보고 있습니다. 질문이 하나 있습니다. 이 방식으로 하면 한글/영문이 섞여 있을때 문제가 되지 않나요?

"Everything looks different on the other side."

-Ian Malcomm, from Michael Crichton's 'The Jurassic Park'

Cyrus Hackford · 2010-01-05 22:33:44

Cyrus Hackford
익숙한 사용자
오프라인

지역 Korea
가입일자 2009-03-04
글: 5

답글: UTF8::lessenAsByte 메소드의 더 빠른 버전을 소개합니다.

inureyes 작성:

코드를 보고 있습니다. 질문이 하나 있습니다. 이 방식으로 하면 한글/영문이 섞여 있을때 문제가 되지 않나요?

구시대의 유물인 이 스레드를 다시 발굴해서 죄송합니다.

UTF-8의 장점은 바이트 구별이 확실해서 섞이지 않는다는 것이죠.
255번째 바이트가 헤더 바이트인지 바디 바이트인지 구분해서 자릅니다. 255번째가 헤더 바이트라면 바로 앞에서 자르면 될 것이고, 바디 바이트라면 해당 바디 바이트가 문자의 끝인지 아닌지를 판단하여 다시금 그에 걸맞는 행동을 취합니다.

결과적으로, 맨 처음부터 문자 수를 세 가며 자르는 것 보다 더 빠르지요.

Cyrus Hackford (2010-01-05 22:35:04)에 의해 마지막으로 수정

inureyes · 2010-01-08 11:28:04

inureyes
산지기
오프라인

지역 대한민국 > 포항
가입일자 2006-04-03
글: 4,471

답글: UTF8::lessenAsByte 메소드의 더 빠른 버전을 소개합니다.

옙 1.8.2에 반영해 보도록 하겠습니다.

덧) 사실 1.8부터는 PHP 5 이상의 조건이 명시되어 있기 때문에, 유니코드를 제대로 지원하는 서버에서는 여부를 판단하여 PHP 내부 명령을 쓰도록 분기를 해 볼까 생각중입니다. 한 번 만들어보고 퍼포먼스를 재어 보아야겠네요^^

"Everything looks different on the other side."

-Ian Malcomm, from Michael Crichton's 'The Jurassic Park'

Cyrus Hackford · 2010-01-10 11:49:56

Cyrus Hackford
익숙한 사용자
오프라인

지역 Korea
가입일자 2009-03-04
글: 5

답글: UTF8::lessenAsByte 메소드의 더 빠른 버전을 소개합니다.

inureyes 작성:

PHP 내부 명령을 쓰도록 분기를 해 볼까 생각중입니다. 한 번 만들어보고 퍼포먼스를 재어 보아야겠네요^^

아니, 그런 경우에는 당연히 PHP built-in function을 사용하셔야죠. C와 PHP의 속도 차이란.......

inureyes · 2010-01-21 15:37:30

inureyes
산지기
오프라인

지역 대한민국 > 포항
가입일자 2006-04-03
글: 4,471

답글: UTF8::lessenAsByte 메소드의 더 빠른 버전을 소개합니다.

이거 모 프로젝트 코드에 반영했구요, 이제 1.8에도 옮겨서 적용하도록 하겠습니다^^
(PHP 함수로 가능한 부분은 그냥 그거 쓰도록 짰습니다~)

"Everything looks different on the other side."

-Ian Malcomm, from Michael Crichton's 'The Jurassic Park'

Textcube : Brand yourself!

메인메뉴

안내

최근소식

UTF8::lessenAsByte 메소드의 더 빠른 버전을 소개합니다.

글 [ 10 ]

1 Cyrus Hackford이 작성한 주제 2009-03-04 01:07:39

주제: UTF8::lessenAsByte 메소드의 더 빠른 버전을 소개합니다.

2 inureyes이 작성한 답글 2009-03-05 11:27:44

답글: UTF8::lessenAsByte 메소드의 더 빠른 버전을 소개합니다.

3 Cyrus Hackford이 작성한 답글 2009-03-05 23:41:04

답글: UTF8::lessenAsByte 메소드의 더 빠른 버전을 소개합니다.

4 gendoh이 작성한 답글 2009-03-06 16:39:48

답글: UTF8::lessenAsByte 메소드의 더 빠른 버전을 소개합니다.

5 Cyrus Hackford이 작성한 답글 2009-03-07 00:07:10

답글: UTF8::lessenAsByte 메소드의 더 빠른 버전을 소개합니다.

6 inureyes이 작성한 답글 2009-04-17 14:27:40

답글: UTF8::lessenAsByte 메소드의 더 빠른 버전을 소개합니다.

7 Cyrus Hackford이 작성한 답글 2010-01-05 22:33:44

답글: UTF8::lessenAsByte 메소드의 더 빠른 버전을 소개합니다.

8 inureyes이 작성한 답글 2010-01-08 11:28:04

답글: UTF8::lessenAsByte 메소드의 더 빠른 버전을 소개합니다.

9 Cyrus Hackford이 작성한 답글 2010-01-10 11:49:56

답글: UTF8::lessenAsByte 메소드의 더 빠른 버전을 소개합니다.

10 inureyes이 작성한 답글 2010-01-21 15:37:30

답글: UTF8::lessenAsByte 메소드의 더 빠른 버전을 소개합니다.

글 [ 10 ]