[r470]: EC# for language pundits.txt Maximize Restore History

Download this file

EC# for language pundits.txt    1543 lines (1055 with data), 100.7 kB

   1
   2
   3
   4
   5
   6
   7
   8
   9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21
  22
  23
  24
  25
  26
  27
  28
  29
  30
  31
  32
  33
  34
  35
  36
  37
  38
  39
  40
  41
  42
  43
  44
  45
  46
  47
  48
  49
  50
  51
  52
  53
  54
  55
  56
  57
  58
  59
  60
  61
  62
  63
  64
  65
  66
  67
  68
  69
  70
  71
  72
  73
  74
  75
  76
  77
  78
  79
  80
  81
  82
  83
  84
  85
  86
  87
  88
  89
  90
  91
  92
  93
  94
  95
  96
  97
  98
  99
 100
 101
 102
 103
 104
 105
 106
 107
 108
 109
 110
 111
 112
 113
 114
 115
 116
 117
 118
 119
 120
 121
 122
 123
 124
 125
 126
 127
 128
 129
 130
 131
 132
 133
 134
 135
 136
 137
 138
 139
 140
 141
 142
 143
 144
 145
 146
 147
 148
 149
 150
 151
 152
 153
 154
 155
 156
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171
 172
 173
 174
 175
 176
 177
 178
 179
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190
 191
 192
 193
 194
 195
 196
 197
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
FAIR WARNING: THIS ARTICLE IS UNPUBLISHED. AN EC# COMPILER DOES NOT EXIST YET.
Enhanced C# for PL Nerds, Part 1 of 4: Introduction
====================================================
Enhanced C# (EC#) is a new statically-typed programming language that combines C# with features of LISP, D, and other languages.
This article is a quick introduction to Enhanced C# for programming language wonks, the kind of people who fantasize about using continuations to model higher-order polymorphic multimethod monads to conjure the spirit of Robin Milner, or something.
I'm not that kind of guy, though. Computer science is wonderful, but my main concern in life is creating useful tools for the real world. Frankly, I don't know the first thing about automated theorem provers, not one professor in my university ever taught type theory or Prolog (I took Computer Engineering, so I guess that's the difference between Computer Engineering and Computer Science!), and monads give me a headache (arrows seem much easier, but I digress.)
I'll discuss EC# briefly, but my main goal here is to explain the grand ideas behind it, and perhaps recruit some bright people to work on it with me as volunteers. EC# rules are currently tentative and subject to change. Even the name is debatable, as long as we're at version 0.1; someone suggested I should call it "C Major", for example. But as you'll see later, EC# syntax really puts the "#" in "C#", and EC# is more Google-friendly.
EC# is the first language of the Language of Your Choice (Loyc) project. If it becomes popular, Loyc will be a platform for building programming languages, analyzing code and transforming syntax trees. I want Loyc to be the foundation of all kinds of programming language tools: compilers, IDEs, language conversion tools, analysis tools, graphical code editors, domain-specific languages, things that Bret Victor might invent*, anything you can think of. Thus, EC# is merely an instance of a "Loyc language", much like C# is an instance of a ".NET language". The fact that EC# and Loyc are built on .NET is just an implementation detail--a very important detail today, but perhaps not that important in the grand scheme of things.
* EC# was invented on principle.
Quite frankly I think my goals are loftier than my abilities, but I can do a better job than most because I've been thinking about making a language for over ten years. I assume that the people who are most qualified to create something like Loyc are either (A) hashing out some aspect of type theory at a university, writing papers that no one outside their niche can understand, published in journals inaccessible to the public, or (B) making big bucks on proprietary tools like Resharper or the DMS Software Reengineering Toolkit. Nevertheless, EC# is my itch, so I'm scratching it.
Like C#, EC# is a statically typed object-oriented language in the C family. When complete, it should be about 99.7% backward compatible with C#. At first it will compile down to plain C#; eventually I want it to have a proper .NET compiler, and someday, a native-code compiler. EC# enhances C# with the following categories of features:
1. A procedural macro system
2. Compile-time code execution (CTCE)
3. A template system (the "parameterized program tree")
4. An alias system (which is a minor tweak to the type system)
5. Miscellaneous specific syntax enhancements
6. Miscellaneous semantic enhancements
Only item #1 and most of #5 currently exist. But EC# is much more powerful than C# with just the macro system and syntax extensions alone. The term "macro" comes from LISP; a macro is a method that runs at compile time that (usually) passes and returns code instead of normal data (in EC#, code is just data of type Node.)
EC# does not substantially change the type system, and the syntax is only slightly extensible because the need for backward compatibility with C# limits the possibilities. However, with the help of macros, EC# syntax is vastly more flexible than C#.
EC# is mainly a compile-time metaprogramming system built on top of C#, but it also provides lots of useful enhancements for people that are not interested in metaprogramming. Many of these enhancements are built using the metaprogramming facilities, but developers don't have to know or care about that.
"Metaprogramming" here refers to compile-time code generation and/or analysis; CTCE, templates and macros will each contribute in a different way to EC#'s metaprogramming system.
Here are some quick highlights. First, EC# offers some syntactic shortcuts compared to C#. Many shortcuts are offered by macros; others are built-in syntactic sugar.
public struct MyPoint
{
// Simultaneously declares two fields "X" and "Y" and a constructor for them.
// In general, EC# allows new() as a quicker syntax for writing constructors;
// The syntax [[...]] is a built-in syntactic sugar that means "call this
// macro on the statement that follows"; thus, set() is simply a method in
// the EC# standard library that creates and initializes X and Y for you.
[[set]]
public new(public int X, public int Y);
// You could make a test() macro for writing unit tests easily, and use it
// with syntax like this:
test {
var p = new Point(2, 3);
check(p.X == 2 && p.Y == 3);
var q = p;
q.X = 5;
check(p.X == 2);
}
}
// The field() macro creates a property with a backing field of the same type.
[[field(protected _foo)]]
public int Foo { get; set; }
// String interpolation (the $ prefix avoids breaking C# compatibility):
string name = "Dave";
MessageBox.Show($"I can't let you do that, $(name).");
// Method forwarding:
int Min(int x, int y) ==> Math.Min;
int two = Min(12, 2);
// Quick variable binding (creates a variable called "r" to hold the result)
if (Database.TryRunQuery()::r != null)
Console.WriteLine("Found {0} results", r.Count);
You can think of EC# syntax as "generalized C#"; the parser accepts almost anything that looks vaguely like C# code, and macros are required to convert code that EC# would not understand into code that has a well-defined meaning. In addition, the parser accepts some things like [[foo]] and `foo` that don't even look like C#.
Compile-time code execution will "just work". The const() pseudo-function forces compile-time evaluation, and it is implied in any context where a constant is required:
const double TripleWhammy = new[] { Math.PI, Math.E, Math.Sqrt(2) }.Sum();
The template system will complement the existing C# generics system. In C#, a generic method that compiles is guaranteed to work for all types that meet the method's constraints. That's a useful property, but it limits what you can do. Templates (marked with $ on the type argument) are handy when you need compile-time duck typing:
public static T Sum<$T>(IEnumerable<T> list)
{
T sum = (T)0;
foreach (var item in list)
sum += item;
return sum;
}
.NET Generics are well-designed and bless their hearts I love 'em, but they can't do anything like this because "+" and "0" don't exist for all types, and any attempt to write a "generic" sum function is very clunky in one way or another. What you see here is a C++-style template, and it solves the problem neatly.
Background
----------
First of all, I'm a performance guy; I've had to optimize most of the programs I've ever written (from the old Super Nintendo emulator SNEqr to the FastNav GPS system, which is optimized for a 400MHz ARMV4I machine), so I'm in the habit of constantly thinking about performance. However, the go-to language for performance is C++... and I hate C++ the more I use it. C++ is horrible. It's ugly, it compiles really slowly, it's bureaucratic, it requires copious code duplication, and GCC's error messages are a cruel joke. I could go on.
Not long ago I discovered a practical alternative, D version 2, which I like to call D2 because Google doesn't understand "D". The delightful D language, which inspires EC# in many ways, can't replace C++ fast enough if you ask me. It has many great features including slices and ranges, compile-time reflection, lambdas as template arguments, string and template mixins, rudimentary SIMD support, and "alias this" which is sort of like overloading the "." operator.
However, it also has a lot of rough edges that make me uncomfortable, and since its compiler is written in C++, I am not interested in hacking on it myself. Besides compiler bugs and a general beta feel, it also can't target Android, doesn't really support dynamic loading of DLLs, doesn't have a runtime reflection system (though I'm sure the compile-time reflection is really nice if you can figure out how to use it), and has a rather basic garbage collector.
I also really want an extensible language, and I have a clear impression that D's creator, Walter Bright, does not.
My need for an extensible language crystalized after I wrote an extension for the boo programming language that provided unit inference. You'd write x = y / 2`hr` in one place, and x = 160`km/hr` elsewhere, and hey presto, the compiler would infer that the variable y has units of `km`. If you then pass this to a function that expects `mph`, you get a compiler scolding. But when I anounced this creation on boo's mailing list, no one acknowledged it. Apparently it didn't matter to anyone there. Plus, it required a tweak to boo's syntax, but boo's creator or one of the developers (I forget) told me that there was no willingness to change the parser. Disappointed, I stopped working on the inference engine, and that was that.
Soon afterward I had the idea for Loyc: Language of your choice. It would be a system that lets individual users, not some slow and conservative language committee or ostensibly benevolent dictator, tweak the syntax and semantics of a language. This idea, however, puttered along very slowly because I had, besides a full-time job, no idea how to actually create an extensible language. How do you make a language in which lots of syntax and semantic extensions written by different people can magically get along? I kept asking myself that question and my brain was all like, ppffft, I dunno buddy!
I reduced my full-time job reduced to a part-time job as of August 2012. Inspired by D2 and a brief study of LISP, the ideas for EC# finally started crystalizing.
Don't We Have Enough Languages Already?
---------------------------------------
The world is already awash with hundreds (thousands?) of programming languages, so when a new language comes out, it's always fair to question if it's needed.
EC#'s foremost feature, macros, is based on LISP, and it's fair to ask, why don't I just use LISP? Well, for one thing, it's dynamically typed and I prefer static typing. But it's not just that.
Throughout my programming career I've mostly gone with the flow, learning mostly popular and well-known languages (BASIC, Pascal, C, C++, Ruby, C#, etc.) But every so often someone like Paul Graham or a random guy on Slashdot gushes about one of those "alternative" languages like LISP or Haskell, or possibly Prolog, Erlang or OCaml... the kinds of languages you should learn "even if you never use it for real-world apps".
And I know they're right. I have studied LISP and Haskell enough to know that the concepts they teach are very important. But somehow, it seems like every time I sit down to study them, I can't bring myself to write real software with them.
Part of the problem is just that I've been using conventional languages so long that it's difficult to change. I know that the popular languages are limited beasts and full of warts, but I know many of those warts inside and out. I understand the way a computer works almost down to the level of logic circuits; I am comfortable with the imperative model of computing that maps so well to the underlying machine; I understand the performance characteristics of my languages; and I enjoy working in Visual Studio. It's difficult to step out of that comfortable world into something as alien as LISP or Haskell.
But it's not just that, either. There are two other problems with these powerful, but unpopular, languages that I'd like to highlight. I think these "powerful fringe languages", or PFLs, have two problems that keep them unpopular:
1. The communication gap
2. The integration gap
The goal of EC# is to be as powerful as LISP (not to mention D2), without having these two problems.
The "communication gap" is the gulf between the terminology and mindset teachers use to describe "far-out" languages, compared with the terminology and mindset that seasoned programmers already understand. It also refers to the way that some PFLs "abstract away the computer" so that one can no longer understand how a program works in terms of the physical machine (I am thinking of Haskell and Prolog here, not so much LISP); Thus, the better you understand a computer as a machine rather than as a mathematical tool, the more difficulty you have learning the PFL.
Whenever I went out to the web to learn about Haskell, the tutorials I found tended to treat me like a programming novice with a degree in mathematics, which, of course, is exactly backwards. From what I've see, tutorials about Haskell do not mention the memory model of the computer, or how an executable program is a sequence of little instructions that fetch data from memory and manipulate it, or pointers are combined with little objects on the "heap" to construct complex data structures. And why should they? Haskell is a "powerful" language that abstracts away these details far more than C, C++, Java, or even LISP. So instead they talk about how you can write math-like equations that describe the relationship between input and output; they talk about currying and partial application and recursion and higher-order functions and monads.
Reading about Haskell can be interesting, overwhelming and/or exhausting, and yet I walk away disappointed, unsure if I have really learned anything. Indeed, after several unsuccessful attempts to understand what a "monad" was, I finally found an read an article about them and thought: "oh, I see, yeah, I think I get it now". A month later I realized that I had completely forgotten what I had learned.
I have never seen a Haskell tutorial that talked about how a Haskell program is actually executed, how memory is managed, how to scale it to many processors or whether that is even possible, how to use hashtables in Haskell, or how the concept of typeclasses compares with the familiar concepts of classes, inheritance, and virtual function tables. See the communication gap? Haskell programmers speak a different language than seasoned "real-world" devs. This gap is especially wide for me: most of the programs I've written have needed high performance and low memory usage; for me it's maddening that I am unable to even guess the computational complexity of a piece of Haskell, let alone guess its wall-clock time. Ditto for memory usage. Even the cost of basic properties like lazy evaluation is unknown to me.
The LISP family of languages, meanwhile, is (at least in some incarnations) closer to familiar imperative languages in terms of semantics, but its syntax is even more unusual than Haskell. And again, usually tutorials are written as if for programming novices, without relating LISP to popular languages whose names start with "J" or "C". Some LISP programmers even continue to use functions like "car" and "cdr" which are utterly meaningless outside the LISP world, the automotive world, and the optical media world. You could make similar complaints about C functions like "strstr" and "atoi", of course; the difference is that C can get away with a lot of stupid crap because it's already popular.
Indeed, the very fact that LISP continues to persist using s-expression syntax when perfectly good alternatives exist (e.g. sweet expressions) makes it obvious why LISP has not taken off. LISPers insist on communicating in their own special way, and it puts people off.
One of LISP's main advantages is its macro system, which allows you to easily manipulate syntax trees at compile-time. This allows you to automate the task of writing sequences of code that are similar, but not similar in a way that allows the similar sequences to be combined into a single function or a C++ template. A simple example of this would be if you have a series of 3-dimensional (X, Y, Z) points. Suppose that sometimes you need to manipulate the X axis, sometimes the Y axis, sometimes the Z axis, and sometimes all three (a k-d tree works this way, for example); in a conventional language, the only practical alternative is to use an array instead of naming each coordinate; but in many languages this is inefficient because the array is stored in a separate heap object from the "Point" object and then requires a bounds check on every access. In situations like this, I always give up and write separate, nearly identical, copies of the necessary code. LISP macros, though, solve this kind of problem easily via macros, with zero runtime cost.
Anyway, I find it ironic that while LISP has an unparalleled ability to manipulate syntax trees, making new forms of automation possible, it has no standard ability to manipulate syntax itself, forcing LISP programmers to "think like the machine" in a way that other languages don't. When it comes to creating a program, LISP says: doing a dull, repetitive task? No, no, let the computer handle that for you! But they still ask you to perform the dull, repetitive mental task of parsing LISP. Of course, humans are made for parsing, so I'm sure we could get used to LISP--someday. But while waiting for someday to come, LISP-based languages languish in obscurity.
Now sure, you could go to some extra effort to install a special LISP reader that supports infix syntax. But when you go read a LISP tutorial, that won't be the syntax they use to teach it. When you go to a LISP mailing list, that won't be the syntax of all the code there. When you download a large LISP program, same thing. In theory you can transform LISP to any syntax you want, but as a student this fact doesn't help at all.
Finally, I think I should take a moment to unfairly criticize Microsoft's own F#, too. I tried to learn it, but I found its syntax very confusing; for instance I couldn't understand why there were such strikingly different syntaxes for defining functions that do the same thing ("static member inc(x) = x+1", "let inc x = x+1", "fun x -> x+1"), I couldn't figure out what parts were the "core" language and what parts were syntactic sugar, and so on. Since F# is impure and based on the .NET type system which I already grok, the problems I faced were purely matters of communication; the web-based tutorials I read simply didn't tell me what I wanted to know (if you get that vibe from EC#, by the way, let me know what to clarify!)
EC# solves the communication gap by providing LISP-like macros and other new features on top of the familiar syntax of C#. And by converting the code to plain C#, you can see the effect of any macro or syntactic sugar.
The "integration gap" is just as important: it refers to the difficulty of combining PFLs into existing systems and code bases.
The problem is that you can't just mash up LISP with Haskell or Haskell with Ruby or the language of your choice. I mean, I've already got a nice personal toolbox of C++ code and C# code. Some of it was written by me, some of it by third parties. Wherever it came from, I want to be able to simply use that existing code, without any hassles, from my PFL program. And I can't. Equally important, even if I'd like to write a brand new code library or program with no dependencies on existing code, the reverse problem appears: I can't take my new library or parts of that program and import it into one of the popular languages.
EC# solves the integration gap by
1. being backward compatible with C# and by
2. being callable from all other .NET languages.
Not only that, the foundation of EC#, which I call Loyc (Language of your choice), is an idea for a general tool of syntactic manipulation and transformation, so
1. Someday it may be possible to integrate EC# code seamlessly with languages like C++, D or maybe even dynamic languages like Javascript.
2. By writing new parsers, other people could shoehorn existing languages into the Loyc system, which then allows code written in multiple languages to be combined seamlessly into a single assembly (DLL or executable file), possibly with circular dependencies between languages (not that you should, but it's possible in principle.)
3. EC# can solve the integration gap not only for itself, but for other experimental languages too. For instance, if you want to experiment with a new type system, why would you write an experimental compiler that produces native code or even LLVM bitcode? If you build your language to produce Loyc syntax trees instead, you can arrange to allow method calls and complex type sharing between code in your new language and EC# or .NET code. Because of that, your language is usable for real work and is easier to "sell" to your engineer friends. You could also borrow EC# syntax for your language, which saves you the trouble of writing a parser.
In summary, I am trying to create not just a programming language, but the most useful programming tool in the world. EC# is not based on C# because I love C# (although it is my second-favorite language after D). EC# is based on C# because I believe it is the best way to build an ecosystem of users and tools.
Please don't get the impression that I don't like Haskell, by the way, or LISP for that matter. No, Haskell are really neat, but there's a gap that needs to be bridged. In fact, I hope that Loyc will evolve into a system in which most code is functional code. However, I think that the functional language should be built on top of an imperative language. I imagine a set of macro libraries that transform functional code back into imperative code, e.g. replacing tail-recursion with looping, and copying with in-place modification where possible, with intermediate code that the programmer can readily see and understand. If he is unhappy with the performance of the functional code, he can change the functional code until the imperative equivalent works as desired; he could even replace the functional code with imperative equivalent and hand-optimize it.
Ahem!
-----
And now, let us pause for a moment of code examples.
// Element-in-set test
if (x in (2, 3, 5, 7))
Console.WriteLine("Congratulations, it's a prime! One digit, even! or odd!");
// Tuples and tuple unpacking
(var x, var y) = (a * b, a + b);
// Safe navigation operator (textBox??.Text means textBox==null?null:textBox.Text):
assert(textBox??.Text == model.Value.ToString());
var firstPart = orderNo??.Substring(0, orderNo.IndexOf("-"));
// Switch without braking, er, breaking:
switch (choice) {
case "y", "Y": Console.WriteLine("Yes!!");
case "n", "N": Console.WriteLine("No!!");
default: Console.WriteLine("What you say!!");
}
// Statements as expressions (some restrictions apply):
Console.WriteLine(
switch (choice) {
case "y", "Y": "Yes!!"
case "n", "N": "No!!"
default: "What you say!!"
});
// Expressions as statements (needs a macro to transform it into something useful):
x => y | z;
// "using" cast operator: allows a conversion only when it is guaranteed to succeed
int a = 7, b = 8;
var eight = Math.Max(a, b using double); // 'eight' has type double
// Symbols (a kind of extensible enum)
var sym = @@ThisIsASymbol;
// This is a macro definition.
[SimpleMacro("static_assert")]
Node StaticAssert(Node condition, IMessageSink msgs)
{
return s_quote {
static_if(!$condition) {
@#error("Static assertion failed: {0}", quote($condition)));
}
};
}
Here, we create a static_assert macro that can check a condition at compile-time. StaticAssert() is an ordinary method (available at compile-time and run-time), but the [Macro] attribute tells the compiler to also treat it as a macro, in this case giving it the name "static_assert".
The argument "condition" is passed to the method as a syntax tree. s_quote {...} quotes a block of code and requests substitution; within the quoted block, the dollar sign $ operator expands the value of "condition". When this method is called as a macro (using the name static_assert instead of StaticAssert), the compiler passes its argument as a syntax tree, and then expands the return value. So
static_assert(Math.PI > 3);
becomes
static_if(!(Math.PI > 3)) {
@#error("Static assertion failed: {0}", quote(Math.PI > 3));
}
static_if() is itself a macro, which collapses to nothing (vanishes) at compile-time because !(Math.PI > 3) is false. Macros can be called at the class level, outside of any method, so static_if can be used to decide whether to create a method or not.
static_if() uses syntactic sugar in order to take the braced block {...} as an argument. The parser automatically transforms statements of the form
foo(expr) { stmt; }
into
foo(expr, { stmt; });
where {...} denotes a list of statements. This transformation occurs without knowing whether "foo" is a macro or not, but normally it is.
@#error is a special built-in pseudo-method that prints a compile-time error. Note that "#error" cannot be used here because it would be parsed as a preprocessor directive, so the error would be printed by the preprocessor before the macro is even parsed, let alone invoked. The "@" sign marks "#error" as an identifier, and the "#" sign indicates that the identifier has some special meaning (in this case, the compiler recognizes #error as a directive to print a compile-time error).
When EC# is complete it will allow global operators, too, so EC# can serve as a do-it-yourself MATLAB:
public T[] operator+ <$T>(T[] a, T[] b)
{
if (a.Length != b.Length)
throw new ArgumentException("operator+: array lengths differ");
T[] result = new T[a.Length];
for (int i = 0; i < a.Length; i++)
result[i] = a[i] + b[i];
return result;
}
public T[] a<T>(params T[] list)
{
return list;
}
var fivePrimes = a(1,2,3,4,5) + a(1,1,2,3,7);
Note that since "operator+" appears outside any class, it is implicitly static; you don't have to tell the compiler the obvious.
Why .NET?
---------
You may wonder, if I care so much about performance, why do I want a .NET language?
First of all, at least on Windows, .NET's performance isn't bad at all. As for Mono... well, I'm sure it'll improve someday.
But the really key thing I like about .NET is that it is specifically a multi-language, multi-OS platform with a standard binary format on all platforms--much like Java, but technically superior. It's only "multi-OS", of course, insofar as you avoid Windows-only libraries such as WPF and WCF, and I would encourage you to do so (if your boss will allow it). .NET solves the "integration gap" as long as the languages you want to mash up are available in .NET.
Without a multi-language runtime, interoperability is limited to the lowest common denominator, which is C, which is an extremely painful limitation. Modern languages should be able to interact on a much higher level.
A multi-language platform avoids a key problem with most new languages: they define their own new "standard library" from scratch. You know what they say about standards, that the nice thing about standards is that there are so many to choose from? I hate that! I don't want to learn a new standard library with every new language I learn. Nor do I want to be locked into a certain API based on the operating system (Windows APIs and Linux APIs). And all the engineering effort that goes into the plethora of "standard" libraries could be better spent on other things. The existence of many standard libraries increases learning curves and severely limits interoperability between languages.
In my opinion, this is the most important problem .NET solves; there is just one standard library for all languages and all operating systems (plus an optional "extra" standard library per language, if necessary). All languages and all operating systems are interoperable--it's really nice! The .NET standard library (called the BCL, base class library) definitely could be and should be designed better, but at least they have the Right Idea.
Java has the same appeal, but it was always designed for a single language, Java, which is an unusually limited language. It lacks several important features that .NET has, especially value types, reified generics, pass-by-reference, delegates, and "unsafe" pointers. It doesn't matter if you switch to a different JVM language, those limitations have a significant cost even if the language hides them.
And C# is likeable. It's not an incredibly powerful language, but it's carefully designed and it's more powerful and more efficient than Java. Similarly for the .NET CLR: yeah, it has some significant limitations. For example, it has no concept of slices or Go-style interfaces. And the BCL is impoverished in places--its networking libraries are badly designed, its newest libraries are horribly bloated, and lots of important stuff is still missing from the BCL. Even so, in my opinion .NET is the best platform available.
Welcome to EC# 2.0
------------------
When I began to design EC#, I couldn't figure out how to accomplish my end goal--a fully extensible language--so instead I decided, as a starting point, that I would simply improve the existing C# language with a series of specific and modest features such as the "null dot" or safe navigation operator (now denoted ??.), the quick-binding operator (::), multiple-source name lookup (which, among other things, basically allows static extension methods), the "if" clause and method forwarding clauses, return value covariance, CTCE, aliases, and most notably, compile-time templates. My plan was to start with NRefactory in order to get the new language working as quickly as possible.
Once I was done drafting the new language, however, I noticed that despite all the work I'd put into the draft alone, the new language still didn't address one of the most requested features from C# users: "Provide a way for INotifyPropertyChanged to be implemented for you automatically on classes".
INotifyPropertyChanged is a simple interface for allowing code to subscribe to an object to find out when it changes. For example:
interface INotifyPropertyChanged {
event PropertyChangedEventHandler PropertyChanged;
}
class Person : INotifyPropertyChanged {
public event PropertyChangedEventHandler PropertyChanged;
string _name;
public string Name
{
get { return _name; }
set {
if (_name != value) {
_name = Value;
Changed("Name");
}
}
}
string _address;
public string Address
{
... same thing ...
}
DateTime _dateOfBirth;
public DateTime DateOfBirth
{
... same thing ...
}
void Changed(string prop)
{
if (PropertyChanged != null)
// The .NET "EventArgs" concept is stupid, but I digress
PropertyChanged(this, new PropertyChangedEventArgs(prop));
}
}
The problem is that you need 10 new lines of code for every new property in your class. Couldn't the language have some kind of feature to make this easier? But I didn't want to have a feature that was specifically designed for INotifyPropertyChanged and nothing else. Besides, there are different ways that users might want the feature to work: maybe some people want the event to fire even if the value is not changing. Maybe some people want to do a reference comparison while others want to do object.Equals(). Maybe it can't be fully automatic--some properties may need some additional processing besides just firing the event. How could a feature built into the language be flexible enough for all scenarios?
I decided at that point that it was time to learn LISP. After studying it briefly, I figured out that what would really help in this case is macros. Macros would allow users to provide a code template that is expanded for each property. But ideally, LISP-style procedural macros also require some straightforward representation of the syntax tree so that syntax trees can be easily manipulated.
I suppose in the case of INotifyPropertyChanged, a macro doesn't really have to introspect a syntax tree, it just need a way to "plug things in" and some sort of trickery to convert an identifier to a string and back. Something like this:
[SimpleMacro]
Node NPC(Node dataType, Node propName, Node fieldName)
{
string nameText = propName.ToString();
return s_quote {
// The substitution operator "$" inserts code from elsewhere.
public $dataType $propName {
get { return $fieldName; }
set {
if (!object.Equals($fieldName, value)) {
$fieldName = value;
Changed($nameText);
}
}
}
};
}
/* Usage: */
class Person : INotifyPropertyChanged {
public event PropertyChangedEventHandler PropertyChanged;
string _name;
NPC(string, Name, _name);
string _address;
NPC(string, Address, _address);
DateTime _dateOfBirth;
NPC(string, DateOfBirth, _dateOfBirth);
...
}
This feature would need substantial changes to NRefactory's parser, so it was looking quite difficult given that didn't have a clue how to user NRefactory's parser generator (called Jay). But for my macro system to be *really* powerful, it needed powerful tools to manipulate syntax trees. So I re-examined my whole plan--maybe I shouldn't start with NRefactory after all--and I thought about what kind of syntax tree would be best in a really good macro system.
I considered using LISP-style lists, but they didn't fit into C# very well. In particular I didn't feel that they were the right way to model complex C# declarations such as
[Serializable] struct X<Y> : IX<Y> where Y : class, Z { ... }
Besides, LISP lists don't hold any metadata such as the original source file and line number of a piece of code. And how would I address the needs of IDEs--incremental parsing, for example? So I fiddled around with different ideas until I found something that felt right for C# and other languages descended from Algol. Eventually I decided to call my idea the "Loyc node" or "Loyc tree".
The concept of a Loyc node involves three main parts: a "head", an "argument list" and "attributes".
I'll talk more about Loyc trees later. With my new idea ready, I decided to ditch my ideas for EC# 1.0 and go straight to the language of my dreams, EC# 2.0. No longer based on NRefactory, it would use a new syntax tree, a brand new parser, and a new set of features.
Footnote: the Loyc tree was originally two ideas that I developed concurrently: one idea for the concept of the tree as viewed by an end-user (i.e. other developers), and another idea for how the implementation works. The implementation originally involved two parallel trees, "red" and "green", inspired by a similar idea in Microsoft Roslyn, but I had difficulties with this approach and reverted to a simpler "just green" design in which the tree is entirely immutable and parents of nodes are unknown.
Enhanced EC# for PL Nerds, Part 2 of 4: Macros and Loyc trees
=============================================================
Features of the macro system
----------------------------
As I mentioned, a macro is a method that is called at compile time, that takes syntax trees as arguments and returns another syntax tree, which is expanded in-place. A macro can also take values as arguments, which are evaluated as compile-time constants.
The static_if macro is a simple example:
[SimpleMacro]
public Node static_if(bool condition, Node then)
{
return condition ? then : quote();
}
As a macro argument, "condition" must be evaluated at compile-time. Here, [SimpleMacro] is given no arguments, which means that the method always acts as a macro and cannot be called as an ordinary method.
By convention, macro names are lowercase to convey their nature as extensions of the programming language, but the parser does not attempt to distinguish macro calls from ordinary method calls; figuring out which is which requires a method lookup--a process that is probably more complex than macro expansion itself.
Here's another simple macro. All it does is run a statement twice:
[SimpleMacro]
public Node twice(Node action)
{
return s_quote {
$action;
$action;
};
}
quote {...} and s_quote {...} cause a block of code to be treated as data; surrounding code with quote {...} or s_quote {...} is called "quoting" the code. This mechanism is similar to C# expression trees, but more flexible: expression trees can only represent simple functions, but quote {...} can contain any code whatsoever: statements, methods, properties, classes, events, "using" directives, and more.
The main difference between quote {...} and s_quote {...} is that s_quote {...} supports substitution (the $ operator). In a s_quote {...} block, the substitution operator $ inserts the value of a variable or expression into the quoted code. The same character is used to insert values into strings:
MessageBox.Show($"Access to $(filename) has been denied.");
When EC# is complete, you'll be able to define a macro and call it in the same source file (or from another source file in the same project). Order will not matter; you could call static_if above or below its definition in the source code. However, calling a macro defined in the same project requires compile-time code execution (CTCE), which is not yet implemented. Therefore, right now you must define macros in a separate project, compile that project to an assembly (i.e. DLL or dynamic-link library), and then pass the macro assembly as a "compile-time reference" to the compiler.
When macro calls are nested, they are executed outside-in. So if you nest a static_if() within another static_if(), the outer one is executed first, then the inner one--which is the reverse of how normal method calls work. This makes sense, of course, since manipulations by the outer macro can affect the inner macro, or even eliminate the inner macro from existence.
A macro cannot be overloaded with ordinary methods, and a warning is issued if the name lookup process finds any methods that accept the same number of arguments being passed to the macro. In a sense, macros can be overloaded with each other, but a macro itself, rather than the compiler, decides if it applies to the arguments supplied. If a macro does not apply, it simply has to return a quoted @#error() node (as the top-level syntax element), or throw an exception. If exactly one macro does not error out, it is called.
However, if this is not enough to disambiguate between two macros with the same name, you can use the resolution mechanism that is already built into C#: the namespace system. As long as different people define their macros in different namespaces, it's easy to disambiguate them, either using a fully-qualified name, or by not importing two namespaces that have conflicting macros in the first place. Even if two macros in two different DLLs are defined in the same namespace and have the same name, you can still disambiguate them with C#'s "extern alias" feature. You can disambiguate a method call from a macro call with the same mechanisms.
Macro hygiene
-------------
Macros are hygienic by default. Symbols in quoted code (s_quote(...) or s_quote {...}) are resolved in the context where they appear; for instance, when you use this macro:
[Macro]
public Node positive(params Node[] content)
{
return s_quote(Math.Abs($content));
}
It will always use global::System.Math.Abs, even if "Math" is redefined as something else at the call site:
void Oops()
{
string Math = "No thanks";
int seven = positive(-7); // No problem
}
Similarly, a macro can safely declare variables:
[Macro]
public Node swap(Node a, Node b)
{
return s_quote {{
var temp = $a;
$a = $b;
$b = temp;
}};
}
Here, s_quote {{ ... }} is used to create a new scope. s_quote {{ ... }} is not really a special syntax, it is just a block "{...}" nested inside a quotation "s_quote { }". "s_quote" does not propagate the outer braces to the output, so an extra pair are needed to create a block.
The extra braces are needed because swap() is not isolated from itself. If you call swap() twice, you don't want "var temp" to be declared twice in the same scope, which is illegal according to the plain C# rules inherited by EC#.
However, the braced block {...} doesn't help you in this case:
int temp = 0, zero = 5;
swap(temp, zero);
A naive expansion of this would be silly:
int temp = 0, zero = 5;
{
var temp = temp; // don't worry, EC# doesn't actually do this.
temp = zero;
zero = temp;
}
Luckily, you don't have to worry about that. Even though swap() is not isolated from copies of itself, it is isolated from its caller. Therefore, the plain C# will actually end up looking something like this:
int temp = 0, zero = 5;
{
var temp_1 = temp;
temp = zero;
zero = temp_1;
}
During conversion to plain C#, the name "temp" that came from the macro is renamed to prevent a conflict; however, I don't have the details worked out! I am open to ideas about the exact rules and mechanism that should be used to provide this isolation.
Please note that only "s_quote" quoted code offers isolation (hygiene). quote {...} constructs a syntax tree without any special processing. For example:
[Macro] Node get_x() {
return quote(x);
}
void got_x()
{
int x = 5;
get_x() *= 2;
}
Here, get_x() return a non-isolated reference to "x". Therefore, it is able to access the variable "x" from the calling scope. If get_x() returned s_quote(x) instead, the compiler would complain that "x" could not be found because "x" was interpreted in a special context, isolated from the macro call.
'quote' does not support the substitution operator "$". The quote
quote {
try {
$foo;
} catch {
Console.WriteLine("foo flubbed!");
}
}
is parsed successfully, but it does not look up "foo" nor does it substitute its value; it merely captures the syntax of substitution itself. Usually this is not what you want, so be sure to use s_quote {...} when you need substitution to occur.
The difference between s_quote(...) and s_quote {...} is that s_quote(...) accepts a list of expressions, while s_quote {...} accepts a list of statements. Similarly, quote(...) quotes a list of expressions, while @{...} quotes a list of statements.
The "swap" macro above is more powerful than a standard method such as
void Swap<T>(ref T a, ref T b) { ... }
because you can swap properties, not just fields:
var p = new System.Drawing.Point(3, 4);
swap(p.X, p.Y); // OK
On the other hand, this macro is rather dangerous as-written because it evaluates each of its arguments twice, which is almost never a good idea. In many cases this can harm performance; in rare cases, it can cause incorrect behavior, as in
var r = new Random();
swap(array[r.Next(10)], array[r.Next(10)]);
This expands to
var r = new Random();
{
var temp = array[r.Next(10)];
array[r.Next(10)] = array[r.Next(10)];
array[r.Next(10)] = temp;
}
Which, of course, will do something quite weird. Avoiding this problem in swap() would require that the macro create temporary variables to hold intermediate values such as the two r.Next(10)s; but such transformations are too advanced for this introductory article.
Luckily, most macros just read variables and do not change them. Such macros can simply read the value once and store the result in a variable, e.g.
[Macro]
public Node square(Node x)
{
return s_quote {{
var tmp = $x;
tmp * tmp;
}};
}
Here we create a temporary variable "tmp" to avoid evaluating "x" twice; the second statement implicitly returns the value of tmp*tmp to the macro's caller.
Macros versus templates
-----------------------
Of course, "square" is a bad example of a macro because you should really just write an ordinary method:
public T Square<$T>(T x)
{
return x * x;
}
This template method easily supports any numeric type.
There is a relationship between templates and macros. Both templates and macros generate code; the difference is that macros generate code at the location they are called, while templates generate code remotely. For example, Square<int>(10) generates a Square(int) method in the location where Square<$T> was defined, while square(10) generates the code "{ var tmp = 10; tmp * tmp; }" right where it is called. Another difference is that templates generate code once for a given type argument (or set of arguments), while a macro generates new code each time it is called.
For squaring numbers, it is much more appropriate to use a template than a macro, because
- The code of the macro is longer and more complicated.
- The square() macro will generate more code than Square<$T> and the output C# code may look messy.
- Templates are very similar to C# generics, so C# developers can understand them more easily
In fact, I would always recommend using templates or generic methods instead of macros if it is possible to do so. Sadly, template support is not yet implemented! Therefore, you can use a macro as a temporary substitute. Or, you can write it the old-fashioned way, with copy-and-paste programming:
public int Square(int x) { return x * x; }
public long Square(long x) { return x * x; }
public float Square(float x) { return x * x; }
public double Square(double x) { return x * x; }
TODO: write a macro:
[[expand_for(int, long, float, double)]]
public T Square<$T>(T x) { return x * x; }
Statement/expression equivalence
--------------------------------
From the caller's perspective, it doesn't matter which of the four forms (@@(...), @@{...}, @(...), or @{...}) the macro actually uses; in fact, the macro may use none of them: it could construct syntax trees using Loyc and EC# compiler APIs.
And even though square() produces a block containing two statements, you can still use it inside an expression:
var r = new Random();
int teen = 13 + square(r.Next(7));
and this works as expected, because EC# allows statements to be inserted in a location where expressions are expected and vice versa. The macro expands to
int teen = 13 + {
var tmp = r.Next(7);
tmp * tmp
};
which is later converted to plain C# as
int __0;
{
var tmp = r.Next(7);
__0 = tmp * tmp;
}
int teen = 13 + __0;
An important feature of EC# is that the difference between "statements" and "expressions" is only syntactic, not semantic, i.e. the difference is only skin-deep. Any EC# expression can be a statement, and any statement can be written in the form of an expression. The EC# parser needs to know whether to expect expressions or statements in a given location, but once a syntax tree is built, the distinction between statements and expressions disappears.
Of course, plain C# doesn't work that way. Therefore, the conversion to plain C# can sometimes be messy and difficult, as you can imagine from the above code (which is a very simple example).
List nodes and splicing
-----------------------
A macro can return multiple statements at the outer level (it can also return multiple expressions, which is no different.) For example, suppose you define the following useless macro:
public static Random _r = new Random();
[Macro]
public static Node passTwoRandomDigitsTo(Node method)
{
return @@{
$method(_r.Next(10));
$method(_r.Next(10));
};
}
And suppose you call it like this:
static void FourRandoms()
{
passTwoRandomDigitsTo(Trace.WriteLine);
passTwoRandomDigitsTo(Trace.WriteLine);
}
Because @@{...} has only a single value but contains two statements, it actually creates the two statements in a special kind of node called a "list", which is denoted #(...) or #{...} in code. Thus, the macro expansion is
static void FourRandoms()
{
#{
Trace.WriteLine(_r.Next(10));
Trace.WriteLine(_r.Next(10));
}
#{
Trace.WriteLine(_r.Next(10));
Trace.WriteLine(_r.Next(10));
}
}
Since the concept of a list does not exist in plain C#, the list is eliminated during conversion:
static void FourRandoms()
{
Trace.WriteLine(_r.Next(10));
Trace.WriteLine(_r.Next(10));
Trace.WriteLine(_r.Next(10));
Trace.WriteLine(_r.Next(10));
}
Note that the compiler will ensure that '_r' refers to the Random instance associated with the macro, not to a local variable or something. Also, note that '_r' must be declared public so that it is accessible to anyone that calls the macro.
When a list is used in an expression context, it is evaluated by executing the expressions in the list and then taking the value of the final expression. For example,
int nine = #(int three = 3, three * three);
actually means
int three = 3;
int nine = three * three;
and is a horrendously bad style that should not be used. The point is that if I write
Console.WriteLine("{0}", passTwoRandomDigitsTo(Math.Cos));
it will expand to
Console.WriteLine(#{ Math.Cos(r.Next(10)), Math.Cos(r.Next(10)) });
which is equivalent to
Math.Cos(r.Next(10));
Console.WriteLine("{0}", Math.Cos(r.Next(10)));
which, of course, makes this useless macro seem even more stupid (if someone can propose a macro that returns two statements but is actually useful, I might use that instead).
Sometimes you don't want it to work this way; sometimes you want it to "splice" a list in the context where it is used. The "splice" macro uses the "Splice" option to make this happen:
[Macro(Splice = true)]
Node splice(Node n)
{
// This works for lists, but the real explode() also supports tuples
return n;
}
So if you write
double[] digits = new double[] { splice(passTwoRandomDigitsTo(Math.Abs)) };
it expands to
double[] digits = new double[] { Math.Abs(r.Next(10)), Math.Abs(r.Next(10)) };
which makes a lot more sense than its meaning without splicing:
Math.Abs(r.Next(10));
double[] digits = new double[] { Math.Abs(r.Next(10)) };
You could also add the Splice=true option to passTwoRandomDigitsTo itself, which avoids the need for splice().
The chicken-and-egg problem
---------------------------
Obviously, in the course of generating the program tree, EC# will often have to run parts of the program, and the results of compile-time code execution (CTCE) may be used to create new parts of the program. This creates a puzzle in the semantics of EC#, because a method running at compile-time could refer to a part of the program that has not yet been created. Even worse, the name could ambiguously refer both to something that has been created already and to something that has not been created yet. In other words, because the metaprogram can use parts of the program being compiled, there can be a circular dependency between the program and the metaprogram.
In my opinion, it is crucial to address this problem. The following example illustrates it:
int CallFunction(int x) { return (int)Overloaded(x); }
static_if (CallFunction(2) != 2)
{
int Overloaded(int j) { return j; }
}
long Overloaded(long i) { return i*i; }
const int C1 = Overloaded(3); // will C1 be 3 or 9?
const int C2 = CallFunction(3); // will C2 be 3 or 9?
First, the question arises: which should be evaluated first, C1 or static_if()? The answer is static_if(); the compiler expands macros as soon as possible, before analyzing fields or other members, so that other members have access to whatever the macro creates. But in order to evaluate the macro, the macro itself and its non-code arguments must be semantically analyzed, and then somehow executed. The "natural" sequence of events is as follows:
1. To understand what "CallFunction(2)" means, the compiler forces symbol tables
to be built for the surrounding code. Then it looks up this name in the
current scope and finds the definition on the first line.
2. The compiler analyzes CallFunction(int). To understand what Overloaded(x)
means, the compiler looks up "Overloaded" in the current scope and finds the
definition on the second line, "long Overloaded(long i)".
3. The compiler analyzes Overloaded(), then executes "CallFunction(2) != 2".
Overloaded(2) returns 4, so CallFunction(2) also returns 4, so the first
argument to static_if() is "true".
4. static_if() is called, which simply returns the method definition that was
passed to it.
5. This new method definition is inserted in place of the original call to
static_if().
6. C1 is evaluated. Since Overloaded(int) exists, it is called, so C1 is 3.
7. C2 is evaluated. It calls CallFunction(), which the compiler has already
analyzed! Without any obvious reason to repeat the analysis, the compiler
keeps the existing interpretation, so CallFunction() calls Overloaded(long)
even though Overloaded(int) is a better match. Therefore, C2 is 9.
However, this behavior is counterintuitive and may depend on the implementation details of the compiler.
In my opinion, it would be best to report an error when code like this is used. But how can the error be detected? I'm planning to use a conservative approach, which remembers the set of all names (with argument counts) that have been looked up so far in a given space. If a method or property by that name is created later (that supports that same number of arguments), the compiler issues an error.
This problem doesn't really exist yet, since you can only call macros in pre-built assemblies. But it will.
The Loyc syntax tree
---------------------
In most compilers, the syntax tree is very strongly typed, with separate classes or data structures for, say, variable declarations, binary operators, method calls, method declarations, unary operators, and so forth. Loyc, however, only has a single data type, Node, for all nodes*. There are several reasons for this:
- Simplicity. Many projects have thousands of lines of code dedicated
to the AST (abstract syntax tree) data structure itself, because each
kind of AST node has its own class. Simplicity means I write less code
and users learn to use it faster.
- Serializability. Loyc nodes can always be serialized to a plain text
"prefix tree" and deserialized back to objects, even by programs that
are not designed to handle the language that the tree represents**. This
makes it easy to visualize syntax trees or exchange them between
programs.
- Extensibility. Loyc nodes can represent any programming language
imaginable, and they are suitable for embedded DSLs (domain-specific
languages). Since nodes do not enforce a particular structure, they can
be used in different ways than originally envisioned. For example, most
languages only have "+" as a binary operator, that is, with two arguments.
If Loyc had a separate class for each AST, there would probably be a
PlusOperator class derived from BinaryOperator, with a LeftChild and a
RightChild. But since there is only one node class, a "+" operator with
three arguments is easy; this is denoted by #+(a, b, c) in EC# source
code. The EC# compiler won't understand it, but it might be meaningful
to another compiler or to a macro.
* In fact, there are a family of node classes, but this is just an
implementation detail.
** Currently, the only supported syntax for plain-text Loyc trees is
EC# syntax, either normal EC# or prefix-tree notation.
EC# syntax trees are stored in a universal format that I call a "Loyc tree". All nodes in a Loyc tree consist of up to four parts:
1. An attribute list (the Attrs property)
2. A Value
3. A Head or a Name (if a node has a Head, Name refers to Head.Name)
4. An argument list (the Args property)
The EC# language does not allow (2) and (3) to appear together (specifically, a Value can only be represented in source code if the Name is "#literal"), so for most purposes you can think of Value, Head and Name as a discriminated union known informally as "the head part of a node". There is no easy and efficient way to represent a discriminated union in .NET, so all five properties (Attrs, Value, Head, Name, Args) are present on all nodes.
Almost any Loyc node can be expressed in EC# using either "prefix notation" or ordinary code. The basic syntax of prefix notation is
[attributes] head(argument_list)
where the [attributes] and (argument_list) are both optional, and the head part could be a simple name. For example, the EC# statement
[Foo] Console.WriteLine("Hello");
is a single Node object with three children: Foo, Console.WriteLine, and "Hello". Foo is an attribute, Console.WriteLine is a Head, and "Hello" is an argument. Each of these children is a Node too, but neither Foo nor "Hello" have children of their own. The Head, Console.WriteLine, is a Node named "#." with two arguments, Console and WriteLine. The above statement could be expressed equivalently as
[Foo] #.(Console, WriteLine)("Hello");
This makes its structure explicit, but the infix dot notation is preferred.
Conceptually, Loyc trees have either a Head node or a Name symbol but not both. Foo, Console, WriteLine, and #. are all node names, while Console.WriteLine is a head node. However, you can always ask a node what its Name is; if the node has a Head rather than a Name, Name returns Head.Name. Thus, #. is the Name of the entire statement.
Attributes can only appear at the beginning of an expression or statement. Use parenthesis to clarify your intention if necessary, but please note that parenthesis are represented explicitly in the syntax tree, not discarded by the parser. Parenthesis cause a node to be inserted into the head of another node, so
(x())
is a node with no arguments, that has a Head that points to another node that represents x(). Attributes have lower precedence than everything else, and they do not require prefix notation, so
[Attr] x = y;
associates the attribute Attr with the "=" node, not with the "x" node.
Unlike C# attributes, EC# attributes can be any list of expressions, and do not imply any particular semantics. You can attach any expression as an attribute to any other statement or expression, e.g.
[4 * y << z()]
Console.WriteLine("What is this attribute I see before me?");
When the time comes to generate code, the compiler will warn you that it does not understand what the hell "4 * y << z()" is supposed to mean, but otherwise this statement is legal. Attributes serve as an information side-channel, used for instructions to macros or to the compiler. Macros can use attributes to receive information from users, to store information in a syntax tree temporarily, or to communicate with other macros.
Enhanced EC# for PL Nerds, Part 3 of 4: EC# Syntax
==================================================
Although I'd like EC# to be an extensible language, C# is far from an ideal starting point for an extensible syntax; when you try to add stuff to C#, it tends to become ambiguous and you may lose backward compatibility. Therefore, the EC# parser is not extensible; instead, I selected a set of syntax changes that are useful for many purposes, while preserving almost total backward compatibility. For example, the new backquoted strings (`foo`) are considered to be operators, and users can define custom operators through this mechanism as long as they surround them with backquotes.
Grammatically, EC# is a very different language from C# except that it "just so happens" to accept virtually all valid C# code. EC# is an expression-based language with two different syntactic styles, "expression style" and "statement style" (it also allows raw token lists for DSLs, but I'll skip those in this article). For any statement there is an equivalent expression in "expression style" (which may or may not use "prefix notation"). And of course, you can put an expression anywhere that a statement is allowed (just add a semicolon at the end).
As I mentioned before, EC# syntax is "generalized C#"; the parser accepts almost anything that looks vaguely like C# code, as well as some other stuff that doesn't look like C# at all.
Overview: new operators and other junk
--------------------------------------
There are some new operators in EC#, which will be discussed in another article.
- binary "??=": fills a gap in C#. x ??= y is a shortcut for x = x ?? y.
- binary "??.": safe navigation operator, like in the language Groovy.
Its behavior is not built into the compiler, but rather described by a macro.
- binary "**": exponentiation, e.g. 2 ** 8 = 256. Can be overloaded.
- binary "in": e.g. x in (1,2,3) checks whether (x==1 || x==2 || x==3).
Its behavior is described with a macro.
- unary "::": (::x) is a shortcut for global::x.
- unary ".": (.foo) is a valid expression, but it has no predefined meaning
and requires a macro to make sense of it.
- forwarding operator "==>": "==> foo" is the forwarding operator and this arrow
can also be used as a clause on a method.
- unary "is legal": "expr is legal" checks whether the expression "expr" compiles
without errors; it evaluates to a boolean constant. Same precedence as "is".
- "using" cast operator: a cast that is legal only if it is guaranteed to succeed.
- suffix and infix `backquoted string`: used to define custom operators.
- .. range operator: no built-in meaning; must be overloaded.
- unary dollar sign: $name and $(expr) perform code substitution
In addition there are some new bits of syntax that aren't classified as "operators":
- Variable declarations in subexpressions, e.g. Console.WriteLine(string s = "Hi")
- (a, b, c): creates a tuple (normally of type System.Tuple<...>)
- #X: denotes a special identifier called "#X".
- #{ a; b; c; }: forces statement notation without creating a new scope
- { a; b; c; }: forces statement notation and creates a new scope
- #(a, b, c): specifies a list of expressions. The value of the final expression
(in this case, c) is the value of the whole thing.
- $ keyword: gets the length of an array being indexed.
- $X: creates a Symbol (of type Loyc.Symbol) named "X"
- @(code), @{code} and @[code]: creates a syntax tree from the given code.
- @@(code), @@{code}: creates a syntax tree from the given code, and then
performs isolation and substitution.
- $"Hello $(e)": substitutes the value of e into the string at run-time.
- The switch() statement is now reclassified as an expression
- this(...) constructor syntax
- Generalized expression syntax (see below)
- Generalized statement syntax (see below)
- typeof<e>: a type (not an expression) based on the type of the expression e
- try a(); finally b(); without braces
- forwarding clause: int Foo() ==> Bar;
- "if" clause: int Foo<$T>() if type T is string { ... }
There is no "comma" operator in EC#. Rather, the comma ',' separates expressions, just as ';' separates statements. Its precise meaning depends on where it appears; at statement level, something like "x = 0, y = z;" is interpreted as two separate statements x = 0 and y = z, except that you can put both statements in a location where one statement was expected:
if (x)
x = 0, y = z;
Generalized expression syntax
-----------------------------
Most of the time, EC# expression syntax is the same as C#, with operators like + and += that work just as before. There are some new operators (mentioned above) but the syntax of this new stuff is nothing special, e.g. the new "x in y" operator is no different syntactically than "x + y" or "x && y".
However, at the beginning of every expression there can be a "preamble", so an EC# subexpression is parsed differently at the beginning than "in the middle". The "expression preamble" has the following parts which are all optional and must be provided in order:
1. a single word followed by a colon, e.g. foo:. In statement context, the parser considers it a label (which becomes a separate statement in the Loyc tree); in all other situations it is interpreted as a named argument.
2. a list of attributes, e.g. [x, y] or [x] [y] (the two styles are equivalent). "Macros as attributes", which have the form [[m(args)]], can also appear here; they are parsed like attributes but behave like a method call that takes the remainder of the expression as an argument.
3. either
3a. a list of modifier keywords (such as public or ref)
3b. a list of "attribute words" followed by a variable declaration
"attribute words" refers to identifiers ("foo" but not "foo.bar"), "modifier" keywords, and words preceded by a dollar sign (e.g. $x). The "modifier keywords" are listed in the first column of this table of C# keywords:
+-------------------------------------------------------------+
| (allowed as attrs.) | (not allowed as keyword attributes) |
| | | | |
| Modifier keywords | Statement names | Types** | Other*** |
|:-------------------:|:---------------:|:-------:|:---------:|
| abstract | break | bool | operator |
| const | case | byte | sizeof |
| explicit | checked | char | typeof |
| extern | class | decimal |:---------:|
| implicit | continue | double | else |
| internal | default | float | catch |
| new* | delegate | int | finally |
| override | do | long |:---------:|
| params | enum | object | in |
| private | event | sbyte | as |
| protected | fixed | short | is |
| public | for | string |:---------:|
| readonly | foreach | uint | base |
| ref | goto | ulong | false |
| sealed | if | ushort | null |
| static | interface | void | true |
| unsafe | lock | | this |
| virtual | namespace | |:---------:|
| volatile | return | | stackalloc|
| out | struct | | |
| | switch | | |
| | throw | | |
| | try | | |
| | unchecked | | |
| | using | | |
| | while | | |
+-------------------------------------------------------------+
* Allowed on declarations only ('new' is ambiguous otherwise)
** Type keywords would be unambiguous as attributes on variable
declarations only, but allowing them would be confusing, so you can't
*** Ambiguous examples showing that these should not be allowed as modifiers:
- "A(X) is Y" could mean "A(X, #{is Y})" at statement level
- "if (C) X; else Y;" could mean "{ if (C) X; } { else Y; }"
- "this.X = Y" could mean "this (.X = Y)"
- "typeof(X)" could mean simply (X), with an attribute
For example,
arg: [Attr] public partial wacky int x = y + z
is a valid EC# expression, i.e. the parser accepts it, although the attributes "public", "partial" and "wacky" may not be meaningful to the rest of the compiler. "arg" is a named argument, "Attr" is a normal attribute, and "int x", of course, is a variable declaration. Since this is an expression, not a statement, it can appear within some other expression, e.g.
HappyJoy(arg: [Attr] public partial wacky int x = y + z);
The preamble is only allowed at the beginning of a subexpression, so
error = [Attr] public partial wacky int x = y + z; // ERROR
gives a syntax error at [Attr]; you must instead write
weird = ([Attr] public partial wacky int x = y + z); // OK (but weird)
using '(' to start a new subexpression. ',' also starts a new subexpression, so
HappyJoy(a, b, arg: [Attr] public partial wacky int x = y + z, c = d);
is parseable ("c" must already exist; it is not part of the variable declaration!). Of course, this is useless in "plain" EC#, and the compiler will give you an error if you directly give it this statement. Instead, a macro is required to interpret the "attributes" and transform them into something that EC# or C# can compile. The non-attribute parts, "arg: int x = y + z" do have a built-in meaning: "arg:" specifies a named argument (i.e. HappyJoy must have an argument called "arg"), and "int x = ..." declares a variable and assigns a value to it.
Type arguments are ambiguous: "A < B > C" could be parsed as the variable declaration "A<B> C" or as a pair of comparisons, "(A < B) > C". To solve this problem, EC# requires that any potential variable declaration
1. is followed by '=' or '=>', e.g. A<B> C = 2, or is at the end of an expression (followed by ';', ',' or ')'
2. that a potential variable declaration whose type uses '<' is either
2a. Followed by '=',
2b. Preceded by non-keyword identifier attributes, e.g. "set A<B> C"
2c. At the beginning of a statement (where '<' is not normally used for comparison),
2d. Inside the argument list of an apparent method declaration
2e. Contained in a pair of parenthesis that is followed by '=' or '=>', e.g.
(x, A<B> C) = (y, D)
(A<B> C) => C + C
Examples that follow these rules include
(List<T> a, int b) = (new List<T>(), 7);
big bad List<T> list;
foo(List<T> list = new List<T>());
foo((List<T> b, a) => b.Add(a));
Examples that do not follow these rules include
int x = int y = 4; // Syntax error; "int x = (int y = 4);" is legal
foo(int b + 1); // Syntax error; "b" should be followed by "="
foo(big bad wolf * 2); // Syntax error; same problem
You might wonder: "hey, why don't you let me put attributes and variable declarations anywhere in the expression?" Well, some locations can't support attributes because certain cases like
(foo) [a] (bar)
are ambiguous. This example could mean either
1. (foo[a])(bar): get element [a] of foo, and then call it as a delegate.
2. (foo)([a] bar): apply attribute [a] to bar, then cast bar to type foo.
The C/C++/C# cast syntax, by the way, is an endless source of problems. Whenever I need to know if a proposed syntax is ambiguous, the type cast is one of the first places I look for trouble.
Similarly, if we allow variable declarations anywhere, there are cases like
f(a + B<C> d = e);
that are just not worth the headache. This example could mean either
1. f((((a + B) < C) > d) = e);
2. f((a + (B<C> d)) = e);
The generalized expression syntax is also used to parse argument lists of method declarations. This means that an argument list such as
A B(C D, out int X, [E] F G) { ... }
is simply parsed as a list of expressions, and indeed you can use syntax that really doesn't make sense in a method declaration, such as
A B(a, b, c) { ... }
A B(int a, abstract Savior = Jesus, getBlubberOf(whale)) { ... }
The compiler rejects but the parser accepts, since a macro could give it meaning later (the [[set]] macro does exactly that, allowing arguments such as "set int X".)
Note: C# 5's "await" is handled specially because it is not a real keyword and doesn't quite fit into this parsing framework.
Prefix notation
---------------
As you've seen, EC# supports a prefix notation that allows you to represent arbitrary Loyc trees:
[attributes] head(argument_list)
Recall that this represents three of the four parts of a Loyc node (the Value part is missing; literals such as "strings" are the only kind of nodes with values that EC# allows.) In terms of parsing, the (argument_list) has high precedence and binds tightly to the head, while the [attributes] must appear at the far left side of the subexpression and bind to the whole thing.
Prefix notation is simply a kind of expression notation, so you can freely mix prefix notation with ordinary expression notation.
The prefix notation often involves special tokens of the form #X, where X is
1. A C# or EC# identifier
2. A C# keyword
3. A C# or EC# operator
4. A backquoted string
5. One of the following pairs of tokens: {} or []
6. Whitespace, an open open parenthesis '(', or a brace '{' not immediately
followed by '}'. In this case, the whitespace, paren or brace is not
included in the token.
As it builds the AST, the parser translates all of these forms into a Symbol that also starts with '#'. The following examples show how source code text is translated into symbol name strings:
#foo ==> "#foo" #>> ==> "#>>"
#? ==> "#?" #{} ==> "#{}"
#while ==> "#while" #`Newline\n` ==> "#Newline\n"
@#while ==> "#while" #(whatever) ==> "#"
#`while` ==> "#while"
The parser treats all of these forms as a special "hash-keyword" (#keyword) token. A #keyword token is parsed like an identifier, but has the semantics of a keyword. That is, the parser treats it like an identifier, but the rest of the compiler treats it like a keyword. For example, "#struct" has the same meaning as "struct" but the syntax is completely different. The following forms are equivalent:
struct X : I { int x; } // standard notation
#struct(X, #(I), #(int x)); // prefix notation
Ordinary method calls like Foo(x, y) also count as prefix notation; it just so happens that plain C# assigns the same meaning to this notation as EC# does. In fact, syntactically, "#return(7);" is simply a method call to a method called "#return". Although the parser treats it like a method call, it produces the same syntax tree as "return 7;" would have.
The main purpose of #keyword-prefix notation is to show the structure of a syntax tree. Normally you should not use #struct(...), but it is occasionally useful when writing a macro, because it allows you to visualize the syntax tree in order to understand it or to debug problems with it. If a macro produces a syntax tree that cannot be represented by normal EC# code, the code printer (Node.Print()) will automatically display it with prefix notation instead.
So #struct is a keyword that is parsed like an identifier. This is different from the notation @struct which already exists in plain C#; this is an ordinary identifier that has a "@" sign in front to ensure that the parser does not mistake it for a keyword. To the parser, @struct and #struct are almost the same except that the parser removes the @ sign but not the # sign. However, later stages of the compiler treat @struct and #struct in completely different ways.
Since the "#" character is already reserved in plain C# for preprocessor directives, any node name such as "#if" and "#else" that could be mistaken for an old-fashioned preprocessor directive must use "@#" instead of "#" if it is at the beginning of a line. For example, the statement "if (failed) return;" can be represented in prefix notation as "@#if(failed, #return)"; the node name of "@#if" is actually "#if". Please note that preprocessor directives themselves are not part of the normal syntax tree, because they can appear midstatement. For example, this is valid C#:
if (condition1
#if DEBUG
&& condition2
#endif
) return;
How to represent these shenanigans is not yet decided. The macro system eliminates the need for ugly tricks like this, but I won't sacrifice backward compatibility.
The special #X tokens don't require an argument list. When a #keyword token lacks an argument list, the parser treats it like a variable name.
Loyc trees that have values cannot be expressed in EC#, except for literal nodes such as "Hello, World!" or 1337 (e.g. the value of the node that represents 1337 is simply (object)1337.) Literal nodes always have the name #literal, and they cannot be expressed in prefix notation (literals can have attributes, though).
Attributes can only appear at the beginning of an expression, so if you want to attach an attribute to "Duke Nukem" in
hero = "Duke Nukem";
you must use parenthesis to start a new subexpression*:
hero = ([Attr] "Duke Nukem");
* this changes the syntax tree slightly by introducing a nested head node, but it's no big deal.
Using '#' unnecessarily, whether it's a preprocessor statement or prefix notation, should be considered poor style because it is an advanced syntax that newbies don't need to know about.
Generalized statement syntax
----------------------------
EC# smooshes "executable statements" and "declarations" into a single grammar. For example, the parser will allow input such as
Console.WriteLine("Look ma, no method!");
public static void Main(string[] args)
{
Console.WriteLine("Hey, boy! What's going on up there?");
}
By itself, this won't work--you have to put your executable statements into a method--but the parser still understands it. This fact allows you to call macros outside methods, and it allows any kind of quoted code, since we might quote a method, or we might quote an executable statement:
Node A = @@{
public static void Main(string[] args) {}
};
Node B = @@{
Console.WriteLine("Look ma, no method!");
};
There are two syntactic sugars for invoking macros. The first looks like an attribute:
[[foo(x)]] statement;
This is intended for use outside methods or outside classes, in locations where "normal" attributes like [Serializable] or [Conditional] would appear. After encountering one of these pseudo-attributes, the parser immediately rewrites it as foo(x, statement). If you mix [[Macro_calls]] with normal attributes, only attributes after the macro call are passed to the macro. For example,
[A] [[B(c)]] [D]
void Explode(string reason) { throw new Exception(reason); }
means
[A] B(c,
[D] void Explode(string reason) { throw new Exception(reason); }
);
but the second form is not actually valid EC# syntax because "Explode" is written in statement form, in a location where a statement is not allowed. Anyway, you get the idea. After the compiler calls the macro B, it adds [A] at the front of the attribute list of the node returned by B.
The second syntactic sugar makes a macro call look similar to a built-in statement, and there are two variations:
- Method call style: macroName(x) { y; } means foo(x, { y; })
- Property style: macroName { y; } means foo({ y; })
A simple motivating example is the "unless" macro, which is the inverse of "if":
[Macro]
Node unless(Node cond, Node then)
{
return @@{
if (!$cond)
$then;
};
}
It would be nice if you could use it without braces, like an "if" statement:
unless (x == null)
x.Dispose();
In fact, the parser allows it, but braces are preferred. "macroName" can be a dotted name such as "A.B" and it can have type parameters as in "B<C>", but there are ambiguities to resolve, since "B<C>(X) { y; }" could be interpreted as
(B < C) > ((X) { y; }) ... or as
B<C> (X, { y; })
and similarly "B<C> { y; } -d;" could be parsed as
two statements, "B<C> { y; }" and "-d", or as
one statement, "(B < C) > ({ y; }) - d"
However, the plain C# parser already assumes that B<C>(X) is a method call; based on that assumption, it is reasonable to extend that assumption to macro syntax, using the interpretation B<C> (X, { y; }). The ambiguous input "B<C> { y; }" can be resolved in the same way, by assuming that { y; } is an argument to a macro called B<C>.
Observe that C# does not assign any meaning to a statement of the form
x (expr) y ...
where x and y are identifiers. Therefore, if the parser sees this form, it can assume that the second part "y ..." is intended to be an argument to the first part, and the interpretation is
x (expr, { y ...});
Unfortunately, the substatement is ambiguous in general:
unless(x) -y; could mean unless(x, { -y; }) or (unless(x)) - y
unless(x) [y] .z = 1; could mean unless(x, { [y] .z = 1; }) or (unless(x))[y].z = 1
The latter interpretation is used in such cases.
Also, one of the most common mistakes in C# is to forget a semicolon. If the user simply forgets a semicolon in
foo (expr)
bar = 0;
then the parser mistakenly nests "bar = 0" inside the call to "foo". There are two terrible things about this:
- The error message may talk about failing to find an overload of foo() with 2 arguments!
- There may be no error message--it may compile successfully, with an unintended meaning caused by calling foo with an extra argument!
To mitigate these problems, the parser issues a "probably missing semicolon" warning when there are no braces and either
1. The "macro" name does not start with a lowercase letter, or
2. The "inner" statement is not indented with respect to the outer statement.
Furthermore, since the "name (expr) stmt" syntax is sometimes ambiguous, the parser prints a note, by default, that braces are recommended (one message per source file).
The parser understand the following patterns as macro calls:
- x (y) z... where z is an identifier
- x (y) z... where z is a keyword that cannot be an infix operator (not one of: in, as, is, using)
- x (y) ++z... where z is an identifier
- x (y) --z... where z is an identifier
Braces are mandatory if you want to start the child statement any other way, e.g.
- with an opening parenthesis '('
- with a prefix operator (-, *, dot, etc.)
- with an attribute ([x] or [[x]])
The contents of a property definition are parsed the same way as statements in a type definition or a method definition. This allows you to place arbitrary statements inside a property definition:
int X {
int x;
get { return x; }
set { x = value; }
}
The EC# compiler does not understand code like this, but a macro could.
The parser isn't explicitly programmed to understand that "get" and "set" are special inside the property definition; instead it simply parses "get {...}" and "set {...}" the same way that it would parse "hello {...}" or anything else of that form: as a macro call. Similarly, event definitions can have arbitrary statements; "add" and "remove" are not treated specially. "get" and "set" are treated specially only after parsing, while building the program tree.
Ambiguities of EC#
------------------
The biggest ambiguities of generalized expression syntax have been discussed already, but we warned, the syntactic woes are just beginning.
Here's another one: the new quick-binding operator, ::, clashes with C#'s own scope-resolution operator; X::Y could be creating a variable called Y or referring to an existing symbol in namespace X.
This can be resolved with a postprocessing step after (or maybe during) parsing. We
1. scan the syntax tree for applicable "extern alias" and "using" statements, then
2. find all cases of x::y and change them to x:::y if no alias "x" is been defined. x:::y (with three colons) denotes variable creation unambiguously. This can "mistakenly" replace "::" with ":::" in contrived cases, but it will handle all existing C# code correctly.
Nullable types are almost ambiguous with expressions. Unlimited lookahead is required to find out whether a question mark represents the conditional operator or a nullable type:
int x = (A ? x = a+b+c+...); // Declare variable x of nullable type A?
int x = (A ? x = a+b+c+... : y); // Conditional operator
Pointer syntax is ambiguous; it looks as if
T* ptr = stackalloc T[n];
has a multiplication on the left side: (T * ptr) = stackalloc T[n];
That's unlikely (but possible), so the parser just assumes that, at statement level, the patterns X * Y and X * Y = Z declare pointers (when X looks like it might be a type name and Y is a simple identifier). Also, declaring pointers inside expressions is not allowed.
Now what about statements?
EC# smooshes executable statements and declarations together, which tends to create ambiguities, especially after adding the new features of EC#. The statements that use keywords do not cause any important conflicts, since the keywords guarantee a particular interpretation. So all of the following statements are non-issues:
using x;
class x { ... }
struct x { ... }
enum x { ... }
interface x { ... }
namespace x { ... }
event type x;
if (e) ... else ...
for (...) ...
foreach (...) ...
while (...) ...
switch (...) ...
checked { ... }
unchecked { ... }
using (e) ...
fixed (e) ...
lock (e) ...
return ...
goto ...
break; continue;
do ... while(e);
try ... catch ... finally
case ...:
default:
delegate ...;
To avoid any problems, generalized expression syntax does not allow any of these keywords to be used as attributes; so, what we really have to worry about are statements that don't necessarily use any keywords.
The basic types of declarations in C# are
1. Directives (using X; extern alias X;)
2. Events, delegates, and other keyword-based statements
3. Space declarations (namespace X.Y {...}, class F : I where... {...})
4. Method definitions (T F() where... {...}) and declarations (T F();)
5. Field definitions (A<B> C = D)
6. Properties (A B { ... })
None of these are ambiguous with expression-statements, except field definitions, but since fields and variable definitions have basically the same syntax, we can use the same rules as plain C#, e.g. L<T> x; is assumed to be a variable declaration and not a pair of comparisons. But now EC#'s generalized expression syntax adds the following twists:
7. Expressions that contain variable declarations: (x + (int y = 2))
8. Macro calls that look like built-in statements: (x(y) {z;} and x {z;})
9. Generalized expressions (expressions that start with "attributes")
10. Alias statements (alias X = ...)
11. Trait statements (trait T { ... })
12. "if" clauses, "def"
13. Substitution: $(x.ToString())
Due to (7), a statement like
Foo(A a = c, B b = d);
is ambiguous: it could either (A) declare a constructor called Foo that takes two arguments, or (B) create two variables and invoke a method or macro named Foo. A postprocessing step after (or during) parsing can resolve this by checking whether this "Foo" is enclosed in a type by the same name; EC# also provides a new constructor syntax
new(int x = 5, int y = 6) {}
which can be interpreted unambiguously as a constructor.
Methods definitions are not usually hard to parse; if you see patterns like
abstract A B(...);
abstract A B(...) {...}
abstract A B(...) ==> ...
abstract A B(...) where ...
(where A and B are simple or dotted identifiers and 'abstract' could be any number of 'attribute keywords', or none of them) there is only one interpretation: they must be method declarations.
It gets slightly more difficult when you consider methods that return pointers or generics, or that are themselves generic:
A B<T>(...) ...
A* B<U>(...) ...
A<U> B<U>(...) ...
A<L<U>> B<U>(...) ...
A<T<V>,U> B<U>(...) ...
The first and last case are not ambiguous, but the others could possibly be expressions. In these cases EC# just assumes that these are method declarations (as you've seen, looking at the contents of the parenthesis does not necessarily help to resolve the ambiguity, so the parser doesn't even try).
It may appear that allowing expressions to start with "keyword attributes" is a problem:
public Foo(int x = 0);
This could be a constructor or an expression that starts with a "public" attribute; however it is not the keywords themselves that cause the problem; this case really no different from
Foo(int x = 0);
and we have already discussed why this input is difficult. However, non-keyword attributes are ambiguous:
partial X(int x = 0);
This could be interpreted as a method call with attribute "partial", or as a method that returns a value of type "partial". For this reason, non-keyword attributes are not allowed on most expression-statements, such as method calls; but they are allowed on lots of other things:
- Variable and field declarations
- Statements based on keywords (such as "try", "do" and "class")
- Property definitions
- Method definitions
- trait and alias definitions
(8) and (9) can be ambiguous but we've mostly covered this ground already.
The alias statement (10) is a bit of a troublemaker. An "alias" statement looks like this:
alias Map<K,V> = Dictionary<K,V>
{
new V TryGetValue(K key, out V value) {
if (key != null)
return base.TryGetValue(key, value);
value = default(V);
return false;
}
}
Aliases will be interchangeable names for types that can, optionally, modify the set of methods available on a type. In this case the alias Map<K,V> is a synonym for Dictionary<K,V> except that its TryGetValue method never throws exceptions (obviously, this is how it should have worked in the first place). The names are completely interchangeable; you can replace Dictionary<A,B> with Map<A,B> at any random place in any program and it will still compile. The default accessibility is public in an alias, so this is not specified.
Anyway, since alias is not a real keyword, it can be ambiguous. Statements like
int alias = 0;
alias = 1;
alias(x);
are clearly not alias definitions. However, a type named alias will cause trouble:
struct alias {} // OK, not ambiguous
@alias Y = new alias(); // OK
alias X = Y; // ERROR: no type found with the name 'Y'
Oops. EC# always assumes that code of the form "alias x = ..." is an alias definition; this technically breaks C# compatibility, but since class names are normally capitalized, such an error is highly unlikely.
(11) is similarly ambiguous, since
public trait X;
public trait X { ... }
look like field and property definitions, respectively. EC# assumes instead that these are trait definitions, and this assumption will break compatibility with C# if there is a type defined called "trait". (Note: traits are not yet implemented and are a low priority, since they can be simulated with macros.)
(12) can cause a little trouble when an "if" clause is used with properties and fields:
A P if (C) { ... } // is it a property? or an "if" statement with non-keyword attributes?
Consequently, the "if" statement cannot have non-keyword attributes.
"def" can be used in place of (or in addition to) a method's return value:
def Square(int x) { return x * x; }
This tells the compiler to determine the return type automatically. A warning should be issued if there is a type in scope called "def"; actually, there should really be warnings just for defining types named "alias", "trait", "def", "var" or even "async".
Finally we have (13), dollar-sign-substitution. The rules about substitution are mostly relevant in quoted code (@@{ ... }) but the parser has to be designed to work in all situations. Just as we have to worry about pointer syntax ambiguities even though pointers are rarely used, so too must we be prepared to handle a dollar sign whereever it may appear. It may appear:
1. Where a statement is expected: $a; substitutes the statement stored in 'a'
2. Where an expression is expected: foo($a); substitutes the expression in 'a'
3. Where a type is expected: $a b; creates a variable 'b' whose type depends on 'a'.
4. Where an attribute is expected: [$a] class foo {}
5. Where a macro call is expected: $a (x) { y(z); } or [[$a]] int foo(int x);
6. Where a variable name, method name, property name, or type name is expected:
int $name;
int $name(int $arg);
int $name { get; set; }
class $name { $name() {} }
Substitution cannot occur where attribute words or keywords are expected, and it cannot be used to insert "class", "struct", etc. into a type definition. And obviously, it cannot replace punctuation:
$accessibility void f(); // SYNTAX ERROR
$type X : IEquatable<X> { } // SYNTAX ERROR
int z = x $operator y; // WHAT, ARE YOU NUTS?
It's an advanced topic, but you can still express these three ideas with prefix notation.
The substitution operator is not limited to @@(code) blocks, but the dollar sign must be followed immediately (without whitespace) by an identifier or by an open parenthesis, which denotes a subexpression that computes the value to be substituted.
Returning without return
------------------------
EC# allows you to embed statements inside expressions; as discussed in Part 2, the final statement in the list becomes the value of the block. Methods also have the same convention, so you can write simply
int Square(int x) { x*x }
int PI { get { Math.PI } }
without "return". The compiler will complain if your "expression-statement" is not the final value of the block:
int Abs(int x) {
x*x; // ERROR
if (x < 0)
-x // OK
else
x // OK
}
Enhanced EC# for PL Nerds, Part 4 of 4: Future Features, Help Wanted
====================================================================
Aliases
-------
Aliases are the minor tweak to the type system that I mentioned before.
The future
----------
- Resource imports & diagrammatic programming
- Slices & ranges
- Extensible syntax
Help wanted
-----------
Editor features:
Tooltips and F1 help to explain the meaning of operators and other punctuation syntax such as ??, [[, and @@{...}.
Smart syntax coloring based on the syntax tree. It may be expensive to run the macros of an EC# program, so syntax highlighting should be based on the syntax tree as much as possible. Existing C# syntax coloring is based mostly on token types or very superficial syntax features; I'd like to see highlighting like this:
<demo> we must write a colorizer anyway for these articles
struct Point<$T>
{
static readonly Point Empty = new Point<T>();
[[set]] public new(
}
- Type names are italicized or have a different background color
- Different levels of parenthesis should be highlighted differently.
- Definitions are highlighted differently from usages, e.g. the x in "int x"
should be highlighted differently from the x in "x++", and the "Foo" in
"void Foo();" should be highlighted differently from a call to Foo().
- Modifier words (public, partial), type keywords (int) and other keywords
can all use different colors.
- In strings, escape sequences and substitutions can be highlighted
Editor wishlist:
- Different font sizes: smaller font for large method bodies, larger font for class-level declarations
- region("comment") { ... }, details("comment") { ... } macros detected by editor. "(expr_in_parens);" for one-line minutia
- Lines that contain only a single { or } should be slightly reduced in height
- Context gathered at the top of the screen
- Multiple keyboard interface styles: Visual Studio, vi, emacs
- Tabs for indentation, spaces for alignment by default
- Elastic tab stops
- Rectangular selections exactly as in VS2010
The end
-------
- "protected override void Finalize()" finalizer syntax
I would like to thank John McCarthy, the D folks and bearophile's random links on the D forums for many of the interesting ideas that are planned for EC#.
Overview of the EC# compilation process
---------------------------------------
The phases of compilation are
1. Parsing (source code is lexed and parsed into an abstract syntax tree).
2. Building the program tree.
3. Semantic analysis and building the executable code.
4. Converting EC# code to plain C# or a .NET assembly (the back-end).
Phases 1 to 3 are collectively known as the front-end. The parser is quite separate from the other phases, but phases 2 through 4 may overlap, because in order to build the program tree, phase 2 requires some of the executable code to be built in advance, and it may (theoretically) use the back-end to help run code at compile-time.
Thus, the compiler's most difficult responsibility is to convert the syntax tree to a "program tree", which is a tree of "spaces" and "members".
1. Spaces are addressable code containers, such as namespaces, classes, and aliases, that can contain members and other spaces. "Addressable" means that every space has a unique name in the tree, and any part of a program can refer to a particular space. All data types are spaces, but some spaces are not data types. Aliases are a special kind of space that refers to another space, optionally creating a new perspective on that space.
2. Members are other named code elements defined in spaces, such as methods, fields, and properties. Members cannot contain spaces (although a macro could be used to simulate creating a local data type inside a method.) The most important kind of member is the method, which contains executable code.
The program tree is like a file system in which spaces are folders, members are files, and aliases are symbolic links. Some spaces and members can be composed from multiple syntax trees; for example, two "partial class" definitions with the same name are combined into a single space. Thus, the program tree is a separate entity derived from the syntax tree, but it refers back to some of the code that the syntax tree contains.
So far, this is all very similar to plain C#, but the program tree can also, temporarily, contain executable code outside of any member; in particular it can have macro calls. A macro is executed at compile-time to produce new code that replaces the macro call. The new code, in turn, can also contain macros, which are called at compile-time.
A macro is just a method with the [Macro] attribute. When the compiler determines that a method call refers to a macro, it "quotes" the macro's arguments instead of evaluating them; these arguments, which have a data type of "Node", are passed to the macro, then the macro is executed, and then the compiler replaces the macro call with whatever code the macro returned.
"Node" is a real data type that can be used at runtime as well as compile-time if necessary, and can represent any syntax tree whatsoever, including an infinite variety of syntax trees that would be meaningless to the EC# compiler.
EC# also calls non-macro methods at compile-time inside any "const" context, e.g. when computing the value of a "const" variable.
Once the program tree is complete, EC# builds any executable code that wasn't built in advance (phase 3), and then it converts the program tree to an output language (phase 4).
CTCE will be limited to a subset of EC#, and this subset will slowly expand as EC# is developed. Eventually the goal is to allow you to run any safe code (i.e. code that is not marked unsafe) that does not access global (static) variables, subject to restrictions on the use of external assemblies (by default, only certain whitelisted BCL classes will be accessible, e.g. you will not be able to access the file system at compile-time).