[Pharo-project] TextConverter handling of binary streams

Sven Van Caekenberghe sven at beta9.be
Tue Nov 30 11:50:54 CET 2010


Philippe,

I would even go further, the test UTF8TextConverterTest>>#testPutSingleCharacter is plain wrong, it seems to have been written as an afterthought:

testPutSingleCharacter
	| actual |
	actual := ByteArray streamContents: [ :stream |
		| converter |
		converter := UTF8TextConverter new.
		converter
			nextPut: $a
			toStream: stream.
		converter
			nextPut: (Unicode value: 16r20AC)
			toStream: stream ].
	self assert: actual = #[97 0 0 32 172]

The correct result is #[97 226 130 172] !!

If you would give these bytes to any other system, they would never be able to decode them as UTF-8.

Sven

On 30 Nov 2010, at 08:46, Philippe Marschall wrote:

> On 11/29/2010 03:23 PM, Sven Van Caekenberghe wrote:
>> Hi,
>> 
>> TextConverter and its subclasses seem to break the contract of #nextFromStream: and #nextPut:toStream: when the stream #isBinary. Consider the following two examples:
>> 
>> ByteArray streamContents: [ :stream | | encoder |
>> 	encoder := UTF8TextConverter new.
>> 	'élève en français' do: [ :each | encoder nextPut: each toStream: stream ] ].
>> 
>> #[233 108 232 118 101 32 101 110 32 102 114 97 110 231 97 105 115]
>> 
>> (String streamContents: [ :stream | | encoder |
>> 	encoder := UTF8TextConverter new.
>> 	'élève en français' do: [ :each | encoder nextPut: each toStream: stream ] ]) asByteArray.
>> 
>> #[195 169 108 195 168 118 101 32 101 110 32 102 114 97 110 195 167 97 105 115]
>> 
>> The first answer is incorrect, the second is correct (as far as I understand it).
> 
> I would agree.
> 
>> This is apparently on purpose, from the implementation of, for example, UTF8TextConverter>>#nextPut:toStream:
>> 
>> nextPut: aCharacter toStream: aStream 
>> 	| leadingChar nBytes mask shift ucs2code |
>> 	aStream isBinary ifTrue: [^aCharacter storeBinaryOn: aStream].
>> 	leadingChar := aCharacter leadingChar.
>> 	(leadingChar = 0 and: [aCharacter asciiValue < 128]) ifTrue: [
>> 		aStream basicNextPut: aCharacter.
>> 		^ aStream.
>> 	].
>> 
>> 	"leadingChar > 3 ifTrue: [^ aStream]."
>> 
>> 	ucs2code := aCharacter asUnicode.
>> 	ucs2code ifNil: [^ aStream].
>> 
>> 	nBytes := ucs2code highBit + 3 // 5.
>> 	mask := #(128 192 224 240 248 252 254 255) at: nBytes.
>> 	shift := nBytes - 1 * -6.
>> 	aStream basicNextPut: (Character value: (ucs2code bitShift: shift) + mask).
>> 	2 to: nBytes do: [:i | 
>> 		shift := shift + 6.
>> 		aStream basicNextPut: (Character value: ((ucs2code bitShift: shift) bitAnd: 63) + 128).
>> 	].
>> 
>> 	^ aStream.
>> 
>> I would say that the contract of #nextPut:toStream: is to take a Character object and write a binary representation using a specific encoding to a stream. However, when given a #isBinary stream, it does no longer do any encoding at all !
> 
> There are many cases where Squeak/Pharo do silent coercion, where
> obviously something went wrong (you pass a ByteArray where a String is
> expected or vice versa) and the system tries to be clever about it and
> do something else than raising an exception and forcing you to fix your
> code. The result is often wrong because the system can not guess what
> you wanted to do.
> 
>> The same is true for the other converters as well as for #nextFromStream:. 
>> 
>> Does anyone know why that is the case ?
>> 
>> And if it is by design, how should one do UTF8 encoding on a binary stream ??
>> 
>> Thx,
>> 
>> Sven
>> 
>> PS: If others also think this is strange, I could make an issue, I am just not sure this is a bug.
> 
> I find that strange as well.
> 
> Cheers
> Philippe





More information about the Pharo-project mailing list