Designing a new programming language (cost: 45 RP)

Qooper on 11/8/2014 at 18:03

Quote Posted by R Soul

I have a syntax suggestion. Allow this:
if(4 < n < 7) ... as well as the usual if(n > 4 && n < 7) ...

Actually I use this sometimes in Python and it can make certain checks slightly more elegant. I'll definitely consider it.

Quote Posted by zombe

Decided that i need to write down the pseudocode for the lexer

Wow, the mechanics of your lexer are so simple, yet the lexer itself is so powerful. My lexer (which I used to call tokenizer) is much more complex, but I guess it could be simplified with further design and refactoring. Here's the Python code for my current lexer:

Code:

from enum_tools import enum_start, enum_next

token_newline = (enum_start(), "newline")
token_string = (enum_next(), "string")
token_number = (enum_next(), "number")
token_dot = (enum_next(), "dot")
token_comma = (enum_next(), "comma")
token_parens_open = (enum_next(), "parens_open")
token_parens_close = (enum_next(), "parens_close")
token_brackets_open = (enum_next(), "brackets_open")
token_brackets_close = (enum_next(), "brackets_close")
token_braces_open = (enum_next(), "braces_open")
token_braces_close = (enum_next(), "braces_close")
token_symbol = (enum_next(), "symbol")
token_operator = (enum_next(), "operator")

def tokenize_whitespace(state):
newline = False
i = state.i
code = state.code
length = state.length
line_count = state.line_count
char_count = state.char_count
while i < length:
s = code
if s == "\n":
newline = True
line_count += 1
char_count = 0
elif s in " \t":
pass
else:
break
i += 1
char_count += 1
if newline:
state.tokens.append(Token(token_newline, None, state.line_count, state.char_count, state.i))
state.i = i
state.line_count = line_count
state.char_count = char_count

def tokenize_string(state):
start = state.i + 1
end = start
code = state.code
length = state.length
line_count = state.line_count
char_count = state.char_count
while end < length:
if code[end] == "\"" and code[end - 1] != "\\":
break
elif code[end] == "\n":
line_count += 1
char_count = 0
end += 1
char_count += 1
state.tokens.append(Token(token_string, code[start:end], state.line_count, state.char_count, state.i))
state.i = end + 1
state.line_count = line_count
state.char_count = char_count + 1

def tokenize_comment(state):
i = state.i + 1
code = state.code
length = state.length
line_count = state.line_count
char_count = state.char_count
while i < length:
if code == "\n":
line_count += 1
char_count = 0
break
i += 1
char_count += 1
state.i = i + 1
state.line_count = line_count
state.char_count = char_count + 1

def tokenize_dot(state):
state.tokens.append(Token(token_dot, None, state.line_count, state.char_count, state.i))
state.i += 1
state.char_count += 1

def tokenize_single_character(state):
code = state.code
s = code[state.i]
token = None
if s == ",":
token = token_comma
elif s == "(":
token = token_parens_open
elif s == ")":
token = token_parens_close
elif s == "[":
token = token_brackets_open
elif s == ]":
token = token_brackets_close
elif s == "{":
token = token_braces_open
elif s == "}":
token = token_braces_close
state.tokens.append(Token(token, None, state.line_count, state.char_count, state.i))
state.i += 1
state.char_count += 1

def tokenize_number(state):
floating = False
start = state.i
end = start
code = state.code
length = state.length
char_count = state.char_count
while end < length:
if code[end].isdigit():
pass
elif code[end] == "." and code[end + 1].isdigit() and not floating:
floating = True
else:
break
end += 1
char_count += 1
value = float(code[start:end]) if floating else int(code[start:end])
state.tokens.append(Token(token_number, value, state.line_count, state.char_count, state.i))
state.i = end
state.char_count = char_count

def tokenize_symbol(state):
start = state.i
end = start + 1
code = state.code
length = state.length
char_count = state.char_count
while end < length:
if not code[end].isalnum() and code[end] != "_":
break
end += 1
char_count += 1
state.tokens.append(Token(token_symbol, code[start:end], state.line_count, state.char_count, state.i))
state.i = end
state.char_count = char_count

tokenizer_rules = [
(tokenize_whitespace, lambda code, i: code in " \n\t"),
(tokenize_string, lambda code, i: code == "\""),
(tokenize_comment, lambda code, i: code == "#"),
(tokenize_number, lambda code, i: code.isdigit() or (code == "." and code[i + 1].isdigit())),
(tokenize_dot, lambda code, i: code == "." and code[i + 1] != "."),
(tokenize_symbol, lambda code, i: code.isalpha() or (code == "_" and (code[i + 1] == "_" or code[i + 1].isalnum()))),
(tokenize_single_character, lambda code, i: code in "_,()[]{}"),
]

def tokenize_operator(state):
start = state.i
end = start + 1
code = state.code
length = state.length
char_count = state.char_count
while end < length:
for rule in tokenizer_rules:
if rule[0] != tokenize_operator:
if rule[1](code, end):
break
else:
end += 1
char_count += 1
continue
break
state.tokens.append(Token(token_operator, code[start:end], state.line_count, state.char_count, state.i))
state.i = end
state.char_count = char_count

tokenizer_rules.append((tokenize_operator, lambda code, i: True))

def tokenize_code(code):
state = TokenizerState(code)
code = state.code
length = state.length
while state.i < length:
for rule in tokenizer_rules:
if rule[1](code, state.i):
rule[0](state)
break
else:
print("error in tokenize_code: no rule matched conditions!")
return state.tokens

class TokenizerState(object):
def __init__(self, code):
self.i = 0
self.line_count = 1
self.char_count = 0
self.tokens = []
self.code = code
self.length = len(code)

class Token(object):
def __init__(self, type, data, line, char, code_i):
self.type = type
self.data = data
self.line = line
self.char = char
self.code_i = code_i

def __str__(self):
return "[" + self.type[1] + ((", " + str(self.data)) if self.data is not None else "") + ]"

def __repr__(self):
return self.__str__()

tokenize_code is the function that does the lexing when it is given a character array. The rest are simply the many moving parts of the complex machine that is the lexer.

zombe on 11/8/2014 at 20:43
I have touched Python code twice in my life:
* Once nearly 10 years ago when some script malfunctioned. Did not understand any of the code but found the bad regex - and i did understand thous. Easy fix :) ... well, technically, i did not touch any parts of Python related stuff, but anyway. Let's say it counts.
* There was some unnecessary internet check in the Minecraft Coders Pack that failed for me. So, tracked it down and commented out. :D

In short, my grasp of Python is extremely limited - to say the least. But it seems i can understand enough, of the code posted, by intuition/experience from other languages. Looks odd, but seems to end up doing the same work (minus multi-line comments and stuff needed for the Off-side rule). Btw, since Python is kind of canonical example of the Off-side rule ( (http://en.wikipedia.org/wiki/Off-side_rule) ) and you dig Python - why not use it in your language too?

------------------------------
For completeness sake (and the fact that sooner or later i need to implement it), the "find_beginning_of_next_token" and "read_and_create_token" parts would go something like this (in my case):
Code:

// all the stuff obviously tracks the line number and column - which i continue to omit.
find_beginning_of_next_token
while at < end
char <- code[at]
if char not in allowed_chars // ex: note "\t" is invalid, as is pretty much anything not in ASCII set.
ERROR: unexpected character found - junk in source file // note also that no such limitation is for comments strings etc
if char in " \n\r" // long-rant-note: "\r" never advances "column", and fuck the old Macs (ie. new line depends what program/platform was used to create the file and it can change in the middle of file if multiple editors where used in the source lifetime etc: "\r", "\n", "\r\n", "\n\r")
at += 1
else if char == "/" and at + 1 < end
if code[at + 1] == "/" // "//" - line end comment
at += 2
while at < end and code[at] != "\n" // skip til end of line
at++
else if code[at + 1] == "*" // "/*" - multi-line comment, nesting (C++ has no nesting for some stupid reason - annoying)
comment = 1
at += 2
while at < end and comment != 0
if at + 1 < end
if code[at] == "/" and code[at+1] == "*" // look out for next opening multi-line comment
at += 1
comment++
else if code[at] == "*" and code[at+1] == "/" // look out for next closing multi-line comment
at += 1
comment--
at += 1
if comment != 0
WARNING: multi-line comment missing end marker
else
break
else
break

// note: token id is an enum, it would be silly and quite inefficient to do otherwise
read_and_create_token // precond: at < end
char <- code[at]
if char in uppercaseLetters
tok.id <- Type
tok.extra <- readAlphaNum // Set: [a-zA-Z0-9_]
else if char in lowercaseLetters
tok.id <- Name // the only difference between keyword and identifier is the token id
tok.extra <- readAlphaNum // which means i can degrade a keyword back to identifier
new_id = keyword_map[tok.extra] // whenever parser asks for one and the keyword allows it (boolean lookup by its token id)
if new_id not invalid
tok.id <- new_id
else if char in [0..9]
tok.id <- Value
tok.extra <- readValue // using my conversion function wrappers - what C++ provides in its libraries is extremely badly implemented and needs sanitizing etc
else
if char == "_" and at + 1 < end and code[at+1] in Letters
tok.id <- Unit
tok.extra <- readAlphaNum
switch char
case "+": if at < end and code[at] == "+" ... // be greedy: fail "+-" "+++" "++-", try "++", "+=", and if nothing else matched then just plain "+"
case "_": tok.id <- Underline // no extra part and as there is no "__"/"_???" token then no need to check what follows ("_name" was already handled)
case "\"": ...
...
default:
ERROR: unexpected character found - junk in source file

Pyrian on 11/8/2014 at 21:20
I have mixed feelings about Python. I really like the Off-side rule, but I feel like Python half-assed it by requiring a ":" at the front anyway. I think any late binding should be optional, explicit, and basically an exception for special cases. Using it for everything seems to me like a terrible idea. That sort of thing bit me too many times in Visual Basic.

I tried to give "Boo" a try in Unity, but I found the documentation horribly inadequate, and switched to C#. Kind of interesting how easy it was for me to work in C# without any prior experience; I've done quite a bit of JavaScript, enough to know I didn't want to use it if I didn't have to, I guess.

I know! I should write a new language, just how I like everything!

...Or not. :cheeky:

zombe on 11/8/2014 at 22:08
Quote Posted by Pyrian
I have mixed feelings about Python. I really like the Off-side rule, but I feel like Python half-assed it by requiring a ":" at the front anyway.

You mean this stuff:
Code:

if something:
blah

Yeah, that ":" is quite baffling to me too - my intention is:

Code:

if something \ blah
// "\" is for readability for both programmer and compiler, however ...
if something
blah
// ... needs no such thing. In fact ...
if something \
blah
// ... will cause syntax error. Reason being:
if something_fairly_long \ something_i_missed
// some comments perhaps ... and then
something_i_added_later // ... not noticing that the "if" has a body already
// ... oops.

Quote Posted by Pyrian
I think any late binding should be optional, explicit, and basically an exception for special cases. Using it for everything seems to me like a terrible idea. That sort of thing bit me too many times in Visual Basic.

In what sense? Late binding like for example in C++ or JavaScript? Both are fairly fucked up in different ways.

Share a pitfall example you have endured?

Quote Posted by Pyrian
I know! I should write a new language, just how I like everything!
...Or not. :cheeky:

Yeah. IMHO, it cannot be justified ... besides with, i just want to. A hobby of sorts.

Pyrian on 12/8/2014 at 01:10
Quote Posted by zombe
Yeah, that ":" is quite baffling to me too - my intention is:
That looks great. :D Nothing extra unless you want to do something abnormal, and then you can.

Quote Posted by zombe
In what sense? Late binding like for example in C++ or JavaScript? Both are fairly fucked up in different ways.
I thought late bindings in C++ were explicit? Like, you had to do something to set them up, rather than it just happening whether you want it or not. I have two issues with late binding. The first is that it's mostly unnecessary. The second is that it compiles and runs things that were actually errors on my part. It's generally combined with implicit variable declaration, which just kills me.

I did a lot of work in Visual Basic (mostly VBA in Excel, actually), and some of its close cousins (like LotusScript). You could turn off implicit declaration, and late binding was accomplished by having variables of type "Variant", which was also the default type if you used implicit declaration or simply didn't specify. It was okay because I didn't have to use any of it, and the instances where I'd deliberately use Variants were when I wanted functions to operate without knowing what they were getting (essentially necessary if I was working with Excel cell contents).

Quote Posted by zombe
Share a pitfall example you have endured?
Errors that compile and run, basically. Most commonly, I would have a typo that declared and set a variable instead of modifying an existing one. Grr. You'd at least get a warning for that nowadays - unless you made the same typo again, which I tend to. Or, I'd drop the wrong variable into an expression, and instead of getting an error, it would run with it to some strange and almost inexplicable end. Default object properties meant that if I forgot to specify which property I was accessing, I'd get the object name or something.

zombe on 12/8/2014 at 21:23
Quote Posted by Pyrian
I thought late bindings in C++ were explicit?

Yes, they are (in an ass-backward way). However, overriding (non-virtual) is silent and will bite your back. Not to mention other silliness.

Quote Posted by Pyrian
The second is that it compiles and runs things that were actually errors on my part. It's generally combined with implicit variable declaration, which just kills me.
That kind of crap is pretty much at the top on the list of things i hate when scripting languages do it.

As far as i know i have designed out all thous kind of problems (including the shit-ton of related infuriating crap C++ does).

-----------------------------------------
Today implemented nearly all of the tokenizer/lexer and tested it (and fixed the bugs, some of which were already present in the pseudocode).

Turning stuff, like:
Code:

/* blah /* blah */ blah */

context Fooo // dlhgskldfhg

final function test
-> Meh
if foo == bar
if blah && meh < asd \ donothing
else
snafu
dodad <<= whatever
+ more
- orless * asd

// token mess
!!?=$@~-*+#--<<===++ += %/:,..

... into (xhtml output. Wisdom from the school days, html is an excellent tool to help debug/reason parsers etc): (http://postimg.org/image/3unu3j0ej/)
Nice: tokens have html title "tooltip" noting the line and column numbers too (it actually ended up pointing out one of the errors for me).

All checks out, finally.

Yay!, today was not a total waste :D