[Pharo-project] invalid utf8 input detected

Adrian Lienhard adi at netstyle.ch
Sat May 23 20:49:06 CEST 2009

Wow, great analysis, Nicolas!

I was trying to find the cause for several hours now. Your third track  
exactly matches my findings.

For example in Object>>#doesNotUnderstand: prior to the condensing,  
the source contained a non-ASCII character (UTF8 encoded as the two  
bytes: 192 160). This gets correctly transferred during the condensing  
into the new changes file. When you don't save the image (and hence  
have the standard stream without UTF8 encoder) what you see in the  
source is the character  (this is 192). That is, we suddenly have two  
characters, 192 and 160 where before there was just one. If you load a  
package, MC will compare methods and think this is a change. When  
loading the method from the MC file, the source is UTF8 encoded,  
producing a unicode character 160. When storing this source to the  
file (still without the encoder), it will just directly put 160 there.  
At this point we have lost the leading  byte 192. Next time we start  
or save the image and have the right encoder again, it will choke  
because 160 is an invalid first byte in UTF8.

I think it's save to fix the invalid methods by overriding their  
source. So we don't have to backtrack to version 10297.


On May 23, 2009, at 19:57 , Nicolas Cellier wrote:

> I confirm the scenario:
> 1) update10298 condenseChanges that let (SourceFiles at: 2) class =
> StandardFileStream
>   This is the seed of further problems, because further changes will
> be encoded in latin1 (or MacRoman I don't really wnt to know)
> 2) update10302 changes the methods with non ASCII characters
> 3) Stef save the image after update10304, that does reopen
> (SourceFiles at: 2) in UTF-8, but that's too late, the worm is in the
> apple.
> If you save the image just after the condenseChanges, no problem
> because (SourceFiles at: 2) is opened in Latin1 AFTER all the changes
> have gotten into it, and reopened UTF-8 before any changes got into
> it.
> We must track undue usage of StandardFileStream such as  
> #condenseChanges.
> 2009/5/23 Nicolas Cellier <nicolas.cellier.aka.nice at gmail.com>:
>> What happened exactly is very hard to trace because these FileStream
>> are a can of worms...
>> Here are some of my perigrinations:
>> All methods were changed in 10305.
>> Monticello snapshot/source.st is not UTF-8.
>> If the file is opened UTF-8, then we get decompiledCode, I don't  
>> know why yet...
>> But the changes still go into the change log in correct UTF-8 form,  
>> so
>> that's just another bug, but not the real source of the problem.
>> For getting some worms out of the can just browse inst var defs of
>> converter in MultiByteFileStream:
>> The accessor #converter initialize converter with TextConverter
>> defaultSystemConverter which depends on LanguageEnvironment.
>> That is a Latin1TextConverter in my latin image.
>> Unless #reset is called first, in which case it will initialize  
>> with a
>> UTF8TextConverter.
>> Yes, but open: fileName forWrite: writeMode, does the job too with a
>> UTF8TextConverter.
>> You still follow? me neither.
>> A better behaved is #setConverterForCode that should let non UTF-8
>> .mcz work in UTF-8 environment, but not sure if called where
>> required...
>> I think Yoshiki changes are necessary only for writing source code
>> with character code > 255.
>> This was not the case of incriminated methods.
>> Everything going to the change log pass thru the MultiByteFileStream,
>> so how did non UTF-8 characters went in?
>> I tried to follow two other clues:
>> 1) There are senders of #primWrite:from:startingAt:count: not
>> redefined in MultiByteFileStream...
>> for example, using #next:putAll:startingAt: will bypass the  
>> converter.
>> 2) using nextPutAll: with a ByteArray argument also does bypass the
>> converter (See MultiByteFileStream>>#nextPutAll:)
>> I did not find the senders (you really believe senders of nextPutAll:
>> can be analyzed?).
>> I tried to instrument code with Notification, but I'm unable to
>> reproduce the problem, so that was vain...
>> http://gforge.inria.fr/frs/download.php/22283/ 
>> Pharo0.1Core-10304cl.zip
>> has the invalid UTF-8 problem, just before 10305 changes that
>> introduced decompiled code...
>> So we might attack the problem with another code snippet:
>> (SystemNavigation default browseAllCallsOn: (Smalltalk associationAt:
>> #SourceFiles))...
>> Hmm, I might have a better clue now.
>> The problem might possibly come from the condenseChanges in  
>> update10298.
>> What happen in a condenseChanges?
>> Changes are copied to this file:
>> f := FileStream fileNamed: 'ST80.temp'.
>> So far, so good, because the concreteStream is a MultiByteFileStream.
>> But the end finishes with:
>>       SourceFiles
>>               at: 2
>>               put: (StandardFileStream oldFileNamed: oldChanges name)
>> Waouh, no MultiByteFileStream here, so no more UTF-8.
>> But hey, that would be the inverse problem: reading UTF-8 text with
>> latin1 reader: I can't get an error doing this, only some strange
>> sequence of characters... (The UTF-8 encoding)...
>> Unless incriminated methods are further changed in #script376 or any
>> other method... In which case they are written in latin1 in the
>> changeLog...
>> Hmm... That could be the case eventually. We must restart update
>> process from http://gforge.inria.fr/frs/download.php/22167/Pharo0.1Core-10296cl-2.zip
>> One thing is sure, at next returnFromSnapshot, FileDirectory
>> class>>startup will reopen changes UTF-8.
>> So saving the image will reopen UTF-8...
>> But wait... Maybe we get enough pieces of the puzzle:
>> Analyzing the Pharo0.1Core-10304cl.changes tells that Stephane  
>> applied
>> several updates before snapshoting the image. So if Kernel and
>> System-Support are changed between 10298 and 10304, then we get the
>> explanation:
>> - condense changes put all in the .changes in UTF-8 but reopen the
>> changes in latin1
>> - further updates up to 10304 write changes in latin1
>> - image snapshot reopen changes in UTF-8 and thus we get further
>> invalid UTF-8...
>> That's easy to reproduce. Stef, can you confirm?
>> That also explain why I did not get the problem at home: I update
>> early and always save my image after.
>> After that we still have to detect and clean while Monticello sources
>> are interpreted UTF-8 when they should not (FIRST TRACK) , and
>> eventually make source code go UTF-8 in Monticello, so that non latin
>> programmers can use their favourite language eventually...
>> Nicolas
>> 2009/5/23 Stéphane Ducasse <stephane.ducasse at inria.fr>:
>>> No problem I never interpreted it like that.
>>> Me too I want a system that is working
>>> Adrian I will publish a fix for DNU now
>>> and I will try later to check the fixes proposed by yoshiki
>>> stef
>>> On May 23, 2009, at 1:29 PM, Tudor Girba wrote:
>>>> Actually, the fix is even simpler: if you find a method that raises
>>>> "invalid utf8 input detected", just browse to it with a class  
>>>> browser,
>>>> and re-accept it :).
>>>> With my previous mail, I was not implying that someone should fix  
>>>> it
>>>> for me, I was merely asking for what could a quick solution be,
>>>> because I was a bit lost (scared) :). Now, I am happy. Thanks for
>>>> discussing it.
>>>> Cheers,
>>>> Doru
>>>> On 23 May 2009, at 13:07, Tudor Girba wrote:
>>>>> Hi,
>>>>> I attached here a DNU implementation I took from an older image.
>>>>> After filing this one in, I can debug DNU problems.
>>>>> Cheers,
>>>>> Doru
>>>>> <Object-doesNotUnderstand.st>
>>>>> On 23 May 2009, at 13:04, Stéphane Ducasse wrote:
>>>>>> I did the following
>>>>>> (Object>>#doesNotUNderstand) getSourceFromFile and I get an
>>>>>> invalid....
>>>>>> Now when I take another method
>>>>>> (BalloonFontTest>>#testDefaultFont) I do not get problem.
>>>>>> I will reread carefully the mails of nicolas to try to  
>>>>>> understand,
>>>>>> I do not know if the fixes of yoh
>>>>>>    http://bugs.squeak.org/view.php?id=5996
>>>>>> is related.
>>>>>> Nicolas
>>>>>>>> {Object>>#doesNotUnderstand:.
>>>>>>>> SystemNavigation>>#browseMethodsWhoseNamesContain:.
>>>>>>>> Utilities class>>#changeStampPerSe.
>>>>>>>> Utilities class>>#methodsWithInitials:} collect: [:e | (e
>>>>>>>> getSourceFromFile select: [:s | s charCode > 127]) asArray
>>>>>>>> collect:
>>>>>>>> [:c | c charCode]]
>>>>>> I cannot get that code running it break before with me.
>>>>>> Stef
>>>>>> _______________________________________________
>>>>>> Pharo-project mailing list
>>>>>> Pharo-project at lists.gforge.inria.fr
>>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>>>> --
>>>>> www.tudorgirba.com
>>>>> "Not knowing how to do something is not an argument for how it
>>>>> cannot be done."
>>>>> _______________________________________________
>>>>> Pharo-project mailing list
>>>>> Pharo-project at lists.gforge.inria.fr
>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo- 
>>>>> project
>>>> --
>>>> www.tudorgirba.com
>>>> "Problem solving efficiency grows with the abstractness level of
>>>> problem understanding."
>>>> _______________________________________________
>>>> Pharo-project mailing list
>>>> Pharo-project at lists.gforge.inria.fr
>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>> _______________________________________________
>>> Pharo-project mailing list
>>> Pharo-project at lists.gforge.inria.fr
>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
> _______________________________________________
> Pharo-project mailing list
> Pharo-project at lists.gforge.inria.fr
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

More information about the Pharo-project mailing list