zombe on 6/8/2014 at 12:24
Quote Posted by Qooper
/.../ lang that compiles to C and then to native binary. /.../ writing the compiler has been /.../
A long-long time ago when i did something like that (my-lang -> asm) i was corrected by the teacher, when talking about compiling, that it would be more appropriate to call my "compiler" a translator. Primarily because my "compiler" did none of the non-trivial work (like optimizing). Eh, don't care - splitting hair mighty thin i say. Anyway - do you do any optimizations or do you offload it all to C compiler?
------------------------------------------------------
PS. How about a code sample?
Quote Posted by Qooper
I'm sure there are people on these forums that have implemented their own toy-languages.
Several. Mostly transformation languages ... (or translators if one insists) ... that produce something useful (a'la lots of SQL for testserver database structure and test data update reusing what is already present on the server and can be reasoned with in the new context) from my custom language. Almost everywhere using LL1 context free grammars with recursive descent parsers - as that is just a ridonkulously easy way to do it.
My most recent one (risen out of "that is just stupid, and wicked cool at the same time - i have to make it"):
* LL1: of-bloody-course.
* parsing rules as a data array (interesting approach, but very inefficient)
* compiles into bytecode run by my own virtual machine (stack based)
* if the language parses successfully (ie. valid in language) then there is only one possible-failstate: stack overflow. Ie, it runs ANY gibberish without error (ex: calling non-existing functions etc is fine).
* hot-swap: code AND DATA can be changed on the fly (hey, as long as there is no stack overflow then whatever you do is fine :D ).
Yeah, kinda silly.
-----------------------------------------------------
As it happens, i am gearing up on writing another bloody language (our compiler [not the full name of the course obviously, but i do not remember and translating it would be annoying anyway] teacher opened with, paraphrased: "The last thing the world needs is another bloody programming language as there are literately millions already - but lets make one anyway.").
Boy, i have to say - C++XX is the best anti-example in the world. I got to think about it a few days ago and i still can not think of any part of C++XX that is not horrendously fucked up. None! C++XX is by far THE worst widespread programming language! Can anyone name anything in C++XX they find to be good/well-done? Obviously ignoring the history and other such reasons - judging just by the language itself.
Random examples of my crap (only a few simple things that crossed my mind here) expressed in an infodump-mess:
Code:
/* /* multi-line comments are nesting ... */ ... of course */
// classes/structures begin with capital letters, identifiers with lower case.
// Case sensitive for use but case insensitive for name clash checks. Ie: one cannot use "snafu" via "snafU" whereas "snafu" clashes with "snAfu".
// http://en.wikipedia.org/wiki/Off-side_rule
// dereferences are automatic where needed ("." is used for namespace selection, pointer dereference, etc)
// non-private variables cannot be overloaded in child blocks (except by the function scope itself)
// non-private functions cannot be overloaded similarly - unless explicitly declaring the intent
// operators can be overloaded
// operators have sane grouping, order and direction (vs C++)
// there are no default functions generated for anything
// etc. scoping/overloading is significantly less retarded than in C++.
// default function parameters cannot be omitted, instead "_" token must be used (= don't care token, usable in deconstructors also). Also, "default" is misleading word - as what is used is implementation private detail not open to outside world.
// object lifetime is managed by the language: class uses generic garbage collector by default (can be changed)
class SomeNamespace.Parent -> Blah.Child as MyChild // MyChild is optional alias
const Float constVar = fooB(0) // can only be set in a constructor (last assignment wins) - access restriction cannot be removed
Int* publicVar // pointer to Int (pointer can be null and hence cannot be dereferenced without null check)
read Int& restrictedVar // read = only MyChild has write access. reference to Int - cannot be null.
Int #privateVar 'I iCheeky' = 32 // private variable with iCheeky named interface that drops all access restrictions. Has default value
// functions are attached to type separately (note: nothing needs to be declared before first use) ... however, fooA is implicitly attached to MyChild as that is the default context after class/struct/namespace/context
member fooA \ return 7 // member function returning 7. "\" is a special continuation token used to avoid ambiguity and/or bad-readability (LL1 for the win).
// static variable - happens to be a public constant
const something = fooB(0) // const = "something" is compile time constant and can be used as such.
// explicitly attached to MyChild - private function (does not require the member object), exposed by iCheeky interface
static #MyChild.fooB(Int par) 'I iCheeky' -> Float
return Float(par) // there are no implicit conversions (besides "is-a" and other simple rules, etc) ... Float(par) is construction
// ~= pure virtual in C++. "virtual" is never declared - the language does that when and if it needs it under the hood itself.
abstract fooC -> {Int, Float}
// note: {} = manual construction. In this case: an anonymous structure, example use:
// var a = "something_irrelevant", {b, c} = someMyChild.fooC
// overload a function (makes the original ~= virtual and uses vtable WHEN and IF needed). Overloading can be disallowed by "final" keyword.
overload fooD \ return 99
member new \ code // constructor
member new.namedConstructor \ code // named constructor
member bye \ code // destructor
static Something.doStuff(MyChild& par)
// -------------------
var tmp = par // var ~= auto in C++
i := 99 // equivalent to: var i = 99
par.fooB = par.fooA // equivalent to: par.fooB(par.fooA())
MyChild.iCheeky.fooB = 99 // using non-public interface
// -------------------
var lambda = $\ return 42 // tiny lambda expression
var morecomplex = $(Int par) -> Float
if par == 1 \ return 3 else return 7
// -------------------
if foo \ somestuff_hidden_away_at_the_end_of_line
someother_stuff // ERROR: unexpected scope change - use of "scope", if that was the intent, is mandatory
// -------------------
// intentionally ugly way to write: for(Snafu a; a<10; a++) a.doStuff();
scope Snafu a; while a < 10 \ (defer? a++); a.doStuff // defer defers a block of code to scope exit, where adding "?" allows skipping the block by break/continue/goto/etc. If there is more than one then the order of execution of the blocks is reversed.
// its like using an anvil to kill a fly, probably sufficient to write
Snafu a; while a < 10 \ a.doStuff; a++
// or, if convenient, use foreach (PS. there is no "for", only "do-while" and "while" ... "for" is kinda useless and i can not think of any nice syntax for it)
foreach i in [0..10] \ code
foreach i in [blah.min < blah.max] \ code // PS. blah.min can be omitted, if "i" already has a suitable value
// ... or basically anything that gives an iterator
foreach {k, v} in someMap \ code
foreach {k, s, t} in somethingThatGivesTriplets \ code
foreach Float f in {1.0, 2.0, 5.0 ,7.0 ,11.0} \ code
// some manual construction examples
{a, b} = {b, a} // left side of = uses implicitly references => deconstructor = constructor => swap a, b
var mc = new MyChild{publicVar = 0, restrictedVar = &someInt} // manual construction ... this is invalid example as restrictedVar is, well, restricted and cannot be used in manual construction. It also does not have a default value, unlike privateVar - which does.
var nc = new MyChild.namedConstructor
var bc = new MyChild // basic constructor
// -------------------- so-called "continuation tokens" (ex: , && || etc) - lines cannot start with them and so they eat all the preceding whitespace
if foo == 1 && bar == 2
|| bar == 6
|| bar == 9
thenDoStuff
Still ironing out a few things.
Qooper on 6/8/2014 at 18:26
Huzzah, I've finally found my very own nerd corner! :D
Quote Posted by zombe
Anyway - do you do any optimizations or do you offload it all to C compiler?
At the moment I'm not doing any. My very first objective is to get things working. But I already know a few good places where I can do some neat optimizations, when the time comes to tweak things towards performance. But also I wouldn't be able to call my compiler a translator, since some of the things I'm providing are not a direct part of C. Instead I'm _implementing_ them using elaborate chunks of C.
Quote:
PS. How about a code sample?
So here's a little piece that's fine by the spec but doesn't exactly compile yet.
Code:
import sys #the import command works like the import in Java or Python, and not like the include directive in C or C++
## Comments always start with the '#' character and extend until a '\n' -character or EOF is encountered
## Here is an example how classes can be used to create automatically managed algebraic datatypes
##
class Tree #this creates an abstract class that can be inherited from
class Node(left: Tree, right: Tree): Tree
class Leaf(data: any): Tree
close Tree #and this closes the class so that it cannot be extended through inheritance any further
class Player {
val name: string
var health: int
var xp: int
def Player(name, startingHealth) = { #a pretty straight forward constructor, nothing special going on here
self.name = name
health = startingHealth
xp = 0
}
def getName() = name #if a function or method has only one expression in its definition, it gets returned automatically and type inference works here
def getHealth() = health
def getXP() = xp
get gainXP(amt: int): void = xp += amt #nothing is returned here since return type was explicitly defined to be void
}
def printTree(t): void = match t { #type inference can deduce that the type of t has to be both Node and Leaf, and the closest ancestor for both is Tree
Node(l, r) {
println("opening node:")
println("left:")
printTree(l)
println("right:")
printTree(r)
}
Leaf(d) {
println("data: " + d + " (" + type(d) + ")") #when an object is referred to by an any-reference, it carries an extra pointer to its type information, which in other cases is already known by the compiler
}
}
def printStuff(text: string, count: int): void = for _ <- 0 to count {sys.println(text)} #for-loops work by iterating containers or ranges
def tupleFun(x: int): (int, int) = (x, x*3) #tuples work as you'd expect (under the hood they're converted to C structs), and they provide a natural way to return several values from a function or method
def genericFunc(x) = x*2 #one type of genericity is to let type inference happen at the point of usage (this function definition will not result in any C code unless the function is actually called somewhere)
def main(args: string[]) = { #once again type inference can deduce the return type of this function to be int
val f = printStuff("Hello world!") #this creates a functor with the first argument bound
f(2) #this calls the functor with arguments ("Hello world!", 2)
val t = Node(
Node(Leaf("asdf"), Leaf(7)),
Leaf(8.75))
printTree(t) #prints the following lines:
#opening node:
#left:
#opening node:
#left:
#data: asdf (string)
#right:
#data: 7 (int)
#right:
#data: 8.75 (float)
val tpl1 = tupleFun(2) #returns the tuple (2, 6)
var x = 7 #val means shallow immutable and var means mutable
sys.println(x) #prints 7
(x, val y) = tupleFun(3) #puts the value of 3 into x and defines the newly declared immutable value y to be 6
sys.println(tpl1[0])
sys.println(x) #prints 3
sys.println(y) #prints 6
## A little bit about memory management.
## This language will have three ways of allocating memory for objects.
## 1. stack-based, which is automatically freed
## 2. heap-based, which needs to be handled by the programmer
## 3. reference-counted, which will be freed when the last reference to it goes out of scope
## And stack-based is what you get when you simply call a constructor, like in the next example:
##
val p = Player("Doomguy", 100) #creates a Player on the stack and assigns p to refer to it
sys.println(p.getName()) #prints Doomguy
val xpFunc = p.getXP #also creates a functor, because methods are simply functions with the first parameter being a pointer to a C struct
sys.println(xpFunc()) #prints 0
println(genericFunc(2)) #prints 4, which is what you get when you say 2*2
println(genericFunc("asdf")) #prints asdfasdf, which is what you get when you say "asdf"*2
return 0
}
Quote:
Boy, i have to say - C++XX is the best anti-example in the world. I got to think about it a few days ago and i still can not think of any part of C++XX that is not horrendously fucked up. None! C++XX is by far THE worst widespread programming language! Can anyone name anything in C++XX they find to be good/well-done? Obviously ignoring the history and other such reasons - judging just by the language itself.
I'm happy to see someone else shares my grief with C++. On so many occasions I've tried to explains what's wrong with this beloved/behated language, but no one seems to see the large elephant in the room. You know? "It's right there guys, you're almost touching it. Can't you see? And goodness gracious, now you're walking in its dung. Oh dear." Response: "Hmm nope, can't see it. I dunno what you're talking about."
I need to take a small break from writing and read your infodump, it looked interesting at first glance. Will post my thoughts on it later. So far, thanks everyone for contributing to this.. whatever this is :D
Yakoob on 7/8/2014 at 03:47
Quote Posted by zombe
* if the language parses successfully (ie. valid in language) then there is only one possible-failstate: stack overflow. Ie, it runs ANY gibberish without error (ex: calling non-existing functions etc is fine).
All the other stuff is neat but this one point I'm gonna grind my teeth at. PHP does the same thing, and it's one of the main reasons why it's horrible.
Quote:
Can anyone name anything in C++XX they find to be good/well-done?
I quite like the syntax and general structure (classes, functions, enums etc.) Plus inheritance can be useful if you know how to use it. Tho I agree with many inconsistent and sometimes confusion over explicit/implicit stuff.
I mentioned before, but C# is my favorite; feels like C++ but with all the annyoing/dangerous shit taken out and useful features put in (GC, .Net framework etc.) It has a share of its own kinks tho, but many of those have been getting ironed out over time.
Quote Posted by Pyrian
I haven't had any fundamental problems with C# so far (using it with Unity, too), but MonoDevelop, geez. It's like... It's trying to do more than it knows how to do? Often I don't know
what it's doing, and it doesn't seem to matter which style I use, it screws up any of them, nevermind "autocompleting" good identifiers into bad ones.
What kills me, though, is the fact it's flat out BROKEN. Sometimes I get weird graphical glitches where one line of code displays at a wrong place. Sometimes my tabs freeze and stop working. Sometimes I get cryptic compilation errors that go away when I restart. I just want it to... work...
Quote:
What killed us was that I thought it might be a good idea to concatenate strings using an operator (+), like virtually every other language does without issue. It turns out, that was a terrible idea. I kind of knew it was a terrible idea, but I guessed the optimizer would realize it didn't
really need four temporary objects to resolve A=B+C, yet I was wrong. Worked fine in small test groups, but when we tried to scale up to hundreds of units, we could only run it by turning DEBUG off. Since that was unacceptable, we had to return to good ol' strcat, and I got to be the team idiot.
Oh yea, it's one of those early lessons of STL you learn, and that's where it helps to know what really goes behind the scenes (i.e. vector<> is a continous memory block but gets realloced so always pre-size it if you know the final limit, list<> is a pointer list so there's no cost to adding elements but random access is super slow since you have to iterate over everything, etc).
That was a big deal back in the day when processing power was very limited, so having multiple stl contianers for differnt needs was crucial. It still is in some very core engine aspects (like graphics or physics) but 99% of the case you don't need to worry about it. There's some horrendously unoptimized code in my games that I tell myself I will fix later when it becomes a performance issue... but it never does.
zombe on 7/8/2014 at 12:39
Quote Posted by Qooper
Currently I'm doing the compilation in three steps:
1. read code for a compilation unit and tokenize it into smallest meaningful chunks
2. analyze this list of tokens and generate a tree structure that contains all the pieces of the program in a somewhat abstract form
3. take the syntax tree and emit target code (in this case C)
Tokenizer - cuts out the possibly meaningful parts
Lexer - adds extra information to tokens (a'la - this is a number / namespace [as only thous can start with a capital letter]). After lexer there is only a generic token for integer numbers where the number is in its extra information slot.
Tokenizer itself is rather unusable - and i and probably you too abuse its name to mean a lexer :D.
My planned route:
* tokenizer (well, lexer with no prior separate tokenizer step)
* LL1 descent parser to generate abstract syntax tree
> LL1 using leftmost element to recursively descend on ( (
http://en.wikipedia.org/wiki/Recursive_descent_parser) ). 1 = no lookahead needed, you know exactly what language construct is being described at every point. (
http://en.wikipedia.org/wiki/LL_parser)
> It is very simple and i suspect that even if you are not familiar with this stuff then you probably are intuitively using it.
> PS. do you have a list of production rules for your language? ( (
http://en.wikipedia.org/wiki/Backus-Naur_form) ? )
* analyze the tree and convert it into an internal tree (as stuff that exists in language does not necessarily exist in compiler. For example - "aliases"/"contexts" are meaningless for my compiler)
* compiler checks the validity of the program (propagates types where none were given, ensures variables/functions exist and are accessible etc). Where compile-time constants are required the parse tree will be run in virtual machine (basically C++ template meta-programming + constexpr without the restrictions and unneeded special syntax).
* convert into intermediate assembler with SSA ( (
http://en.wikipedia.org/wiki/Static_single_assignment_form) )
* do optimizations (thous that could not have been done sooner)
* assign registers and write out machine code
Random thoughts on code example:
Quote Posted by Qooper
class Tree #this creates an abstract class that can be inherited from
class Node(left: Tree, right: Tree): Tree
class Leaf(data: any): Tree
close Tree #and this closes the class so that it cannot be extended through inheritance any further
Wow, this was confusing (probably less so than for you to decipher my info dump). Only noticed that the last line has "close" on my third read x_x - what can i say, it's 30C+ here.
... and i am still not completely sure i get what it exactly represents. How does my ~pseudo-C++ sound?
struct Tree {};
struct Node : Tree { Tree *left, *right; };
struct Leaf : Tree { Anything data; };
+ disallow further extending Tree (i assume one is still free to extend Node and Leaf).
Right?
* "val", "var". A bit concerning that it is a bit hard to visually distinguish the two. Why not use "const" (by your description, that seems to match what i use it for)?
* "var xp: int" - i assume you can omit "int" if it can be inferred? Then why not "int xp" or "var xp" (which is the way i went)? That said, type names and identifiers are easy to distinguish, in my case, which helps ("Int foo" instead of requiring "var Int foo" to stay LL1).
* "this" vs "self", my native language is not English, so i am intrigued on your choice. Is one better than the other in your case? Why?
* does the xp initialization to zero have to be in constructor. In my case i can initialize (even using the member functions of the class being initialized) in class declaration and override it in constructor if needed (the compiler reorders all the tidbits (relatively simple/intuitive and deterministic rule) to make sense or cry error if it can not be done).
* "getName", i take that default access restriction is ~= C++ private? ... interesting, i highly prefer the opposite because prototyping/trying-things becomes before setting-things-in-stone. Private is an unnecessary burden till then. Scripting language eccentricity i guess.
* "match t" ... Speaking of keywords, what do you have? Mine (current):
Code:
Keywords usable as identifiers:
alias, struct, class, context, namespace,
override, abstract, member, static,
final, as, bye, const, interface
Keywords that cannot be used as identifiers:
foreach, in, do, while,
switch, if, else,
defer, scope,
return, break, continue, goto,
var, null, new, read, out, is, instanceof, assume
* "def... : void = for _ <- 0 to count {...}" - ah, "_" usage :). My version: "foreach _ in [0 <= count] \ code" ... assuming yours is not exclusive to "count". A bit ambiguous.
PS. What is "void" :P ... and why do you seem to declare that the function returns it?
PS. Be mindful when using "<-". Seems fine here, but can easily become ambiguous (and break LL1).
* Tuples as "(x, x*3)" looks like function parameters. Before i had "{}" free for use i also used "()", but it totally broke LL1 (even tho it looked like it would not at first glance [might be fine in your case ... so far]). Even tried "?(...)", but it looked silly. Thankfully the solution was clear - drop all other uses for "{}" (used it for in-line block) and i can use "{}" as a more powerful and less confusing win-win alternative.
* "val f = printStuff("Hello world!") #this creates a functor with the first argument bound" interesting, but i don't think i would be ok with it.
I assume the functor generation is triggered by missing parameter (or alternatively by missing invocation if it had no parameters to begin with)? What happens when someone down the line adds a printStuff with one parameter? Stuff breaks in a bad manner (possibly non-compiler detectable semantic change because of unrelated changes elsewhere - add inheritance to that and shit is bound to hit the fan). That is a no-no :/.
I strongly recommend adding explicitness (i like the idea itself and will at some point consider adding something like that myself).
Slightly related: i can use function invocations as variables - which your mentioned stuff would disallow (a'la javascript getters/setters with non-bonkers syntax):
Code:
context Example // ie. i do not care what Example is (class? struct? namespace?), i just want to conveniently add my junk to it, so set it as current context.
Int #hidden_int = 42 // a "jailed"/restricted integer (PS. Int is alias for Int32, there is no such thing as just Int)
static i_am_int_var \ return hidden_int // getter (with no additional logic for brevity)
static i_am_int_var(Int val) \ hidden_int = val; return val // setter (no additional logic also)
static i_am_another_int_var \ return &hidden_int // setter and getter in one (as the value is returned as reference).
static i_am_writeonly_int_var(Int val) \ hidden_int = val // setter only, and does not even return what the value was actually set at.
static i_am_readonly_int_var \ return hidden_int // getter only
// thous are all functions, just that you do not have to use them as such.
static test
Int i_am_a_real_int_var = 10
i_am_a_real_int_var -= i_am_int_var // OK
i_am_int_var += 10 // OK
i_am_another_int_var += i_am_another_int_var + i_am_readonly_int_var // OK
i_am_writeonly_int_var = 0 // OK
i_am_writeonly_int_var += 666 // ERROR!
i_am_readonly_int_var = i_am_a_real_int_var // ERROR!
Since you mentioned memory/object-lifetime management i mention mine :). Since my language is meant for scripting and not to be a standalone language then it will use whatever the host language provides. So far i already have (not made with the language in mind, just made prior for whatever):
* Incremental mark-and-sweep garbage collector implemented as optional trait that can be added to any C++ class.
* Reference counting, similarly as a optional trait.
* Single-owner/Unique
On memory management part there is also:
* Non-freeable allocator as trait.
* Pool as trait.
* Single-owner random chunks of memory.
I intend to expose it on scripting side by generic means (ie. rule and code lists). "class" uses GC by default, "struct" does not and in fact cannot as it has an immutable trait of "trivial-copy".
Code:
// example of traits on C++ side
class Snafu {
USE_GC; // instances of Snafu will use garbage collector. proper use is enforced, as much as possible (C++ is kinda shit), by compile and runtime guards/thingies.
// alternatively, using the generic trait list: USE(GC). Good if you want to add multiple traits like: USE(POOL, LOG)
public:
my stuff ...
};
Quote Posted by Qooper
I wouldn't be able to call my compiler a translator, since some of the things I'm providing are not a direct part of C. Instead I'm _implementing_ them using elaborate chunks of C.
That is still translation :P (One could say that no human language can be directly translated to another - they are different with their own quirks and features, adding stuff that was not in the original is unavoidable when translating).
PS. do not pay much attention to me on that subject - i am just expressing my frustration on the master-hair-splitter teacher i had nearly 20y ago The reality is that any translation in programming language context is compiling as that is how the word is defined (ex: translation of C laguage into machine code language). "Compiler": a tad bit superfluous word nowadays - it used to be badly defined word and now it is just kinda pointless.
Quote Posted by Pyrian
I'm sorry, C++XX? What's the XX? Versioning?
Yes. While C++14 and its prior versions have made the language less atrocious (... sorta) - it is still shit.
Quote Posted by Yakoob
All the other stuff is neat but this one point I'm gonna grind my teeth at. PHP does the same thing, and it's one of the main reasons why it's horrible.
Wait, since when is using nenexisting functions fine with PHP? :P
Anyway, it was needed for the main feature of changing any parts of code/data at runtime without shit hitting the fan - fitness of the language was never a concern. Like i said, it was a silly experiment (
that taught me that hot-swap capability is a must-have for any scripting language - it is just too good not to have, just do not go as far as i did).
Quote Posted by Yakoob
I quite like the syntax and general structure (classes, functions, enums etc.) Plus inheritance can be useful if you know how to use it.
Iteresting. Inheritance, and well, all of the OOP related stuff, is one of the most fucked up parts of C++ (fucked up way beyond any hope of redemption IMNSHO) :/
I would probably run over post-length-restriction even listing all the mind-boggling fuckups ... so, i just drop this (aka. welcome to 20 years ago): (
http://archive.adaic.com/intro/ada-vs-c/cppcv3.pdf) (third edition)
Amusing quote: "The critique is long; it would be good if it were shorter, but that would be possible only if there were less flaws in C++."
I would say it, as and anti example, is a must-read for everyone going out to write a new language. In addition to specialization questions like thous: (
http://www.redmountainsw.com/wordpress/2007/06/11/considerations-when-designing-your-own-programmingscripting-language/) (list seems a bit too short, but could not find anything better :( ). Not the mention the number one advice: DON'T WRITE ANOTHER BLOODY LANGUAGE! ... unless you
just want to - then it is fine.