[Pharo-project] TextConverter handling of binary streams

Stéphane Ducasse stephane.ducasse at inria.fr
Mon Nov 29 20:12:34 CET 2010


sven 

I'm terribly and more than that busy until mid or more dec.
Now did you check if the behavior is the same in squeak?
S.

On Nov 29, 2010, at 3:23 PM, Sven Van Caekenberghe wrote:

> Hi,
> 
> TextConverter and its subclasses seem to break the contract of #nextFromStream: and #nextPut:toStream: when the stream #isBinary. Consider the following two examples:
> 
> ByteArray streamContents: [ :stream | | encoder |
> 	encoder := UTF8TextConverter new.
> 	'élève en français' do: [ :each | encoder nextPut: each toStream: stream ] ].
> 
> #[233 108 232 118 101 32 101 110 32 102 114 97 110 231 97 105 115]
> 
> (String streamContents: [ :stream | | encoder |
> 	encoder := UTF8TextConverter new.
> 	'élève en français' do: [ :each | encoder nextPut: each toStream: stream ] ]) asByteArray.
> 
> #[195 169 108 195 168 118 101 32 101 110 32 102 114 97 110 195 167 97 105 115]
> 
> The first answer is incorrect, the second is correct (as far as I understand it).
> 
> This is apparently on purpose, from the implementation of, for example, UTF8TextConverter>>#nextPut:toStream:
> 
> nextPut: aCharacter toStream: aStream 
> 	| leadingChar nBytes mask shift ucs2code |
> 	aStream isBinary ifTrue: [^aCharacter storeBinaryOn: aStream].
> 	leadingChar := aCharacter leadingChar.
> 	(leadingChar = 0 and: [aCharacter asciiValue < 128]) ifTrue: [
> 		aStream basicNextPut: aCharacter.
> 		^ aStream.
> 	].
> 
> 	"leadingChar > 3 ifTrue: [^ aStream]."
> 
> 	ucs2code := aCharacter asUnicode.
> 	ucs2code ifNil: [^ aStream].
> 
> 	nBytes := ucs2code highBit + 3 // 5.
> 	mask := #(128 192 224 240 248 252 254 255) at: nBytes.
> 	shift := nBytes - 1 * -6.
> 	aStream basicNextPut: (Character value: (ucs2code bitShift: shift) + mask).
> 	2 to: nBytes do: [:i | 
> 		shift := shift + 6.
> 		aStream basicNextPut: (Character value: ((ucs2code bitShift: shift) bitAnd: 63) + 128).
> 	].
> 
> 	^ aStream.
> 
> I would say that the contract of #nextPut:toStream: is to take a Character object and write a binary representation using a specific encoding to a stream. However, when given a #isBinary stream, it does no longer do any encoding at all !
> 
> The same is true for the other converters as well as for #nextFromStream:. 
> 
> Does anyone know why that is the case ?
> 
> And if it is by design, how should one do UTF8 encoding on a binary stream ??
> 
> Thx,
> 
> Sven
> 
> PS: If others also think this is strange, I could make an issue, I am just not sure this is a bug.
> 
> 
> 





More information about the Pharo-project mailing list