[Pharo-project] [Moose-dev] Re: how to deal with string position in relation to cr/crlf

Toon Verwaest toon.verwaest at gmail.com
Thu Apr 28 12:03:17 CEST 2011


> Indeed. The problem is that the token of PetitParser only knows the character position from the stream. This would mean that we would have to modify the tracking of the position with extra information.
>
> Is there no other option?
If what you are doing is relating it back to the original source code 
... isn't the original source code stored in 1 specific format, \r, \n 
or \r\n? Or do you use the models that are parsed once to map it back to 
different versions of the same files on different platforms? In that 
case you could always convert the input file into the format you want.

To me however it seems like it makes most sense to keep the line + 
column count if you are going to keep anything yourself anyway. You do 
not need to rely on what petitparser knows already, you can keep this 
data yourself. Petitparser needs to have the char location since that's 
where it's parsing. The line+column is metadata that you need, not 
petitparser.
To implement this you just again need to keep track of all the newlines 
you see. Everytime you see a newline you update your newline count AND 
keep track of the position where the newline happened. This way you 
actually have the column count (the actual position - the location where 
the last newline occurred).

Another option I see is always parsing using a \r or \n file format by 
first converting it. Then when you show the position, you will have to 
check if the file is actually \r, \n or if it's rather \r\n. If it's \r 
or \n then you just give back the number as is. Otherwise you walk over 
the file to find out where all the newlines occur. From this you can 
build an array that tells you which position ranges have to add how many 
charcounts.

For example [0, 10, 15, 17, 20] if the newlines occur at [0, 9, 13, 14, 
16] (always subtract 1 char of the newline since we map from 1-sized 
newline to 2-sized newline). Now you can just translate your position by 
looking for the highest number lower than the position. For example if 
you were looking at position 15, this will map onto 14, which has index 
4, so you have to do + 4 -> the real position is 19. This is just a 
binary search for each position in the array of newlines, so it's 
O(number of newlines in file * number of tokens) to translate the model 
to become architecture-dependent.


The last option is to just store both position formats in your model 
directly, and figuring out which fileformat you are mapping it back 
onto. This is O(1) but requires double the data for position numbers (no 
biggy I suppose); but it does require your parser to keep track of the 
position info itself again. The previous option avoids that.

Hope this helps to make some sort of a decision :)

Toon



More information about the Pharo-project mailing list