[Pharo-project] Issue 3360 in pharo: TextConverter handling of binary streams is wrong

Tue Nov 30 13:00:58 CET 2010

New issue 3360 by sven.van.caekenberghe: TextConverter handling of binary  
streams is wrong

It seems that the way binary (#isBinary true) streams are handled by  
TextConverter and its subclasses is wrong. When given a binary stream, the  
core text converter methods (#nextPut:toStream and #nextFromStream:) simply  
do no longer encode or decode at all.

Moreover, the unit test UTF8TextConverter>>#testPutSingleCharacter seems  
plain wrong. The actual encoded bytes should be #[97 226 130 172].

However, this behavior seems to be added by design, so it is hard to  
estimate the impact of changing this.

It is currently very ugly to get a binary UTF-8 encoding, one has to write  
to a character stream and then turn those characters into bytes.

I wrote an alternative UTF-8 encoder as a support class to the Zinc HTTP  
Components (http://www.squeaksource.com/ZincHTTPComponents.html) together  
with the following unit test:

	"The examples are taken from  
	| encoder inputBytes outputBytes inputString outputString |
	encoder := ZnUTF8Encoder new.
	inputString := String with: $$ with: (Unicode value: 16r00A2) with:  
(Unicode value: 16r20AC) with: (Unicode value: 16r024B62).
	inputBytes := #[16r24 16rC2 16rA2 16rE2 16r82 16rAC 16rF0 16rA4 16rAD  
	outputBytes := self encodeString: inputString with: encoder.
	self assert: outputBytes = inputBytes.
	outputString := self decodeBytes: inputBytes with: encoder.
	self assert: outputString = inputString

based on the helper methods:

encodeString: string with: encoder
	^ ByteArray streamContents: [ :stream |
		string do: [ :each |
			encoder nextPut: each toStream: stream ] ]

decodeBytes: bytes with: encoder
	| input |
	input := bytes readStream.
	^ String streamContents: [ :stream |
		[ input atEnd ] whileFalse: [
			stream nextPut: (encoder nextFromStream: input) ] ]

The new encoder code is simpler, but might not handle everything that is  
needed (leading chars, language codes), but is all that still needed ?


