[Pharo-project] TextConverter handling of binary streams

Philippe Marschall philippe.marschall at netcetera.ch
Tue Nov 30 08:46:30 CET 2010


On 11/29/2010 03:23 PM, Sven Van Caekenberghe wrote:
> Hi,
> 
> TextConverter and its subclasses seem to break the contract of #nextFromStream: and #nextPut:toStream: when the stream #isBinary. Consider the following two examples:
> 
> ByteArray streamContents: [ :stream | | encoder |
> 	encoder := UTF8TextConverter new.
> 	'élève en français' do: [ :each | encoder nextPut: each toStream: stream ] ].
> 
>  #[233 108 232 118 101 32 101 110 32 102 114 97 110 231 97 105 115]
> 
> (String streamContents: [ :stream | | encoder |
> 	encoder := UTF8TextConverter new.
> 	'élève en français' do: [ :each | encoder nextPut: each toStream: stream ] ]) asByteArray.
> 
>  #[195 169 108 195 168 118 101 32 101 110 32 102 114 97 110 195 167 97 105 115]
> 
> The first answer is incorrect, the second is correct (as far as I understand it).

I would agree.

> This is apparently on purpose, from the implementation of, for example, UTF8TextConverter>>#nextPut:toStream:
> 
> nextPut: aCharacter toStream: aStream 
> 	| leadingChar nBytes mask shift ucs2code |
> 	aStream isBinary ifTrue: [^aCharacter storeBinaryOn: aStream].
> 	leadingChar := aCharacter leadingChar.
> 	(leadingChar = 0 and: [aCharacter asciiValue < 128]) ifTrue: [
> 		aStream basicNextPut: aCharacter.
> 		^ aStream.
> 	].
> 
> 	"leadingChar > 3 ifTrue: [^ aStream]."
> 
> 	ucs2code := aCharacter asUnicode.
> 	ucs2code ifNil: [^ aStream].
> 
> 	nBytes := ucs2code highBit + 3 // 5.
> 	mask := #(128 192 224 240 248 252 254 255) at: nBytes.
> 	shift := nBytes - 1 * -6.
> 	aStream basicNextPut: (Character value: (ucs2code bitShift: shift) + mask).
> 	2 to: nBytes do: [:i | 
> 		shift := shift + 6.
> 		aStream basicNextPut: (Character value: ((ucs2code bitShift: shift) bitAnd: 63) + 128).
> 	].
> 
> 	^ aStream.
> 
> I would say that the contract of #nextPut:toStream: is to take a Character object and write a binary representation using a specific encoding to a stream. However, when given a #isBinary stream, it does no longer do any encoding at all !

There are many cases where Squeak/Pharo do silent coercion, where
obviously something went wrong (you pass a ByteArray where a String is
expected or vice versa) and the system tries to be clever about it and
do something else than raising an exception and forcing you to fix your
code. The result is often wrong because the system can not guess what
you wanted to do.

> The same is true for the other converters as well as for #nextFromStream:. 
> 
> Does anyone know why that is the case ?
> 
> And if it is by design, how should one do UTF8 encoding on a binary stream ??
> 
> Thx,
> 
> Sven
> 
> PS: If others also think this is strange, I could make an issue, I am just not sure this is a bug.

I find that strange as well.

Cheers
Philippe





More information about the Pharo-project mailing list