From: Paul S. <pm...@gm...> - 2013-05-11 12:29:26
|
Hello, I'm interested to make SDCC accept iCode (its own intermediate representation) as an input format. The motivation being taking intermediate format from other compilers' frontends, which support more languages and have better high-level optimizations, convert to iCode and feed to SDCC for codegeneration to 8-bit microcontrollers. The target I specifically have in mind is LLVM IR and C++. I did initial comparison of LLVM IR and iCode and they appear to share same level of concepts, the biggest difference I've seen so far is that LLVM IR is SSA form, while iCode is not. I'm sure there're lots of devils-in-details, bit I was satisfied with a quickly put Python script to convert trivial LLVM IR program to iCode, so I proceeded with looking how to feed iCode into SDCC in the first place. Well, before talking about taking iCode format as an input, first iCode format should be defined. So far, iCode is just internal SDCC data structure, with some adhoc and verbose dumps available for that internal structure. Actually, it's not even possible to produce iCode for *input* C program, because even --dumpraw starts dumping after some code transformations are started. So, my initial steps were to add --emit-i-code switch to dump iCode directly after construction from AST, and then try to define "external representation" of iCode and add dumping support for it, while leaving existing "debugging representation" mostly intact. This work is available in this branch: https://github.com/pfalcon/sdcc/commits/icode-output Based on that branch, there's another branch https://github.com/pfalcon/sdcc/commits/icode-input which provides (initial so far) iCode parsing support. After considering how I can add support for iCode parsing grammar, I went with the solution as described here: http://stackoverflow.com/questions/16452737/is-it-possible-to-call-one-yacc-parser-from-another-to-parse-specific-token-subs icode-input branch is largely work in progress, what I achieved so far is that compiling trivial C code to iCode, and then recompiling to iCode again produces matching output iCode's and asm's. As an example, ======== int a, b, c; int foo() { return a & b; } ======== is compiled into external iCode representation as: ======== defvar a{int fixed as=data} defvar b{int fixed as=data} defvar c{int fixed as=data} _entry: proc foo{int ( ) fixed as=code} iTemp0{int fixed as=data} = a{int fixed as=data} & b{int fixed as=data} ret iTemp0{int fixed as=data} _return: eproc foo{int ( ) fixed as=code} ======== I would target this work to be submitted upstream, so would appreciate comments on the idea, approach, and early implementation available so far. Please note that branches above will be rebased as I'm striving to provide clean patchset. (Btw, did you guys consider switching to git?) -- Best regards, Paul mailto:pm...@gm... |
From: Philipp K. K. <pk...@sp...> - 2013-05-11 15:51:01
|
On 11.05.2013 14:29, Paul Sokolovsky wrote: > Hello, > > I'm interested to make SDCC accept iCode (its own intermediate > representation) as an input format. The motivation being taking > intermediate format from other compilers' frontends, which support more > languages and have better high-level optimizations, convert to iCode > and feed to SDCC for codegeneration to 8-bit microcontrollers. Uh, if you are able to generate iCode, wouldn't it be just as easy to just generate C code? iCode corresponds pretty closely to C, and I can't think of an iCode that it wouldn't be easy to generate C code for. And you'd never have to worry about sdcc changing its iCode. Philipp |
From: Paul S. <pm...@gm...> - 2013-05-11 16:44:56
|
Hello, On Sat, 11 May 2013 17:50:43 +0200 Philipp Klaus Krause <pk...@sp...> wrote: > On 11.05.2013 14:29, Paul Sokolovsky wrote: > > Hello, > > > > I'm interested to make SDCC accept iCode (its own intermediate > > representation) as an input format. The motivation being taking > > intermediate format from other compilers' frontends, which support > > more languages and have better high-level optimizations, convert to > > iCode and feed to SDCC for codegeneration to 8-bit microcontrollers. > > Uh, if you are able to generate iCode, wouldn't it be just as easy to > just generate C code? iCode corresponds pretty closely to C, and I > can't think of an iCode that it wouldn't be easy to generate C code > for. And you'd never have to worry about sdcc changing its iCode. In other words, you propose me to write C codegeneration backend for LLVM, after your previous experiences with it (http://lists.cs.uiuc.edu/pipermail/llvmdev/2006-November/007352.html , etc.) and after the project itself found such a thing non-maintainable and buried it? ;-) The real problem though is concept mismatch. C and RTL are languages of different levels, and even if it would be possible to express RTL in (pretty dirty-looking) C subset, there's no guarantee that it, after passing thru compiler, will match original RTL. Then dirty hacks would need to be added to achieve that, to *C* syntax, with result hardly more robust and definitely less pretty than relying on iCode format (there're only so many ways to do RTL after all). So, based on the above and prior art (LLVM's own backend, http://www.cminusminus.org), I decided to skip looking at to-be-revived LLVM C backend at all, and stay on the level of intermediate representations. > > Philipp -- Best regards, Paul mailto:pm...@gm... |
From: Philipp K. K. <pk...@sp...> - 2016-10-11 11:13:39
|
On 11.05.2013 18:44, Paul Sokolovsky wrote: > Hello, > > On Sat, 11 May 2013 17:50:43 +0200 > Philipp Klaus Krause <pk...@sp...> wrote: > >> On 11.05.2013 14:29, Paul Sokolovsky wrote: >>> Hello, >>> >>> I'm interested to make SDCC accept iCode (its own intermediate >>> representation) as an input format. The motivation being taking >>> intermediate format from other compilers' frontends, which support >>> more languages and have better high-level optimizations, convert to >>> iCode and feed to SDCC for codegeneration to 8-bit microcontrollers. >> >> Uh, if you are able to generate iCode, wouldn't it be just as easy to >> just generate C code? iCode corresponds pretty closely to C, and I >> can't think of an iCode that it wouldn't be easy to generate C code >> for. And you'd never have to worry about sdcc changing its iCode. > > In other words, you propose me to write C codegeneration backend for > LLVM, after your previous experiences with it > (http://lists.cs.uiuc.edu/pipermail/llvmdev/2006-November/007352.html , > etc.) and after the project itself found such a thing non-maintainable > and buried it? ;-) The C backend has been resurrected, and some work has been done on it ouside the LLVM project. Using it, I created an experimental LLVM+SDCC toolchain: http://www.colecovision.eu/llvm+sdcc/ There are still some issues to be fixed, but it is already somewhat working (so far I only tested with input language C and targeting the STM8; basic Z80 support is there, and I intend to start looking for issues in that configuration later this year). Philipp |
From: Paul S. <pm...@gm...> - 2013-05-11 19:24:49
|
Hello, On Sat, 11 May 2013 19:44:45 +0300 Paul Sokolovsky <pm...@gm...> wrote: [] > > Uh, if you are able to generate iCode, wouldn't it be just as easy > > to just generate C code? iCode corresponds pretty closely to C, and > > I can't think of an iCode that it wouldn't be easy to generate C > > code for. And you'd never have to worry about sdcc changing its > > iCode. > > In other words, you propose me to write C codegeneration backend for > LLVM, after your previous experiences with it > (http://lists.cs.uiuc.edu/pipermail/llvmdev/2006-November/007352.html , > etc.) and after the project itself found such a thing non-maintainable > and buried it? ;-) > > The real problem though is concept mismatch. C and RTL are languages > of different levels, and even if it would be possible to express RTL > in (pretty dirty-looking) C subset, there's no guarantee that it, > after passing thru compiler, will match original RTL. Then dirty > hacks would need to be added to achieve that, to *C* syntax, with > result hardly more robust and definitely less pretty than relying on > iCode format (there're only so many ways to do RTL after all). Actually, indeed, why not? Claim that C-from-RTL conversion won't match the original RTL semantics can be happily (re)verified for one compiler after another, SDCC in this case, and main point - it is somehow taken for granted that LLVM C backend must be hundreds of kilobytes of messy C++ code, written for years, just to be torn away as complete mess eventually. Indeed, it doesn't have too - it can be a quick Python script which converts LLVM IR to C. Anyway, I start with iCode to C conversion, because I wonder how soon *that* will diverge. Well, original iCode and one gotten from iCode-to-C conversion are of course rather different, but for trivial program, SDCC was able to optimize it out, so asm's match modulo label numbers (btw, I found that if you use label like _iftrue_0 in C, it will clash with iCode generated ;-) ). It all breaks down when trying to call a function. LLVM syntax for this is call foo(arg1, arg2, ...), i.e. fully abstracts calling convention, while iCode already has target dependedness with stuff like: send 0x5 {const-unsigned-char literal}{argreg = 1} bparam<addr parm>{char fixed as=data} := 0xa {const-unsigned-char literal} iTemp2{int fixed as=code} = call bar{int ( char fixed, char fixed) fixed as=code} well, assuming params setup statements directly precede call and come in order, it's still easily convertible to C... -- Best regards, Paul mailto:pm...@gm... |
From: Philipp K. K. <pk...@sp...> - 2013-05-11 19:31:36
|
On 11.05.2013 18:44, Paul Sokolovsky wrote: > Hello, > > On Sat, 11 May 2013 17:50:43 +0200 > Philipp Klaus Krause <pk...@sp...> wrote: > >> On 11.05.2013 14:29, Paul Sokolovsky wrote: >>> Hello, >>> >>> I'm interested to make SDCC accept iCode (its own intermediate >>> representation) as an input format. The motivation being taking >>> intermediate format from other compilers' frontends, which support >>> more languages and have better high-level optimizations, convert to >>> iCode and feed to SDCC for codegeneration to 8-bit microcontrollers. >> >> Uh, if you are able to generate iCode, wouldn't it be just as easy to >> just generate C code? iCode corresponds pretty closely to C, and I >> can't think of an iCode that it wouldn't be easy to generate C code >> for. And you'd never have to worry about sdcc changing its iCode. > > In other words, you propose me to write C codegeneration backend for > LLVM, after your previous experiences with it > (http://lists.cs.uiuc.edu/pipermail/llvmdev/2006-November/007352.html , > etc.) and after the project itself found such a thing non-maintainable > and buried it? ;-) Well, maybe a rewrite of the backend will help. > The real problem though is concept mismatch. C and RTL are languages of > different levels, and even if it would be possible to express RTL in > (pretty dirty-looking) C subset, there's no guarantee that it, after > passing thru compiler, will match original RTL. Then dirty hacks would > need to be added to achieve that, to *C* syntax, with result hardly more > robust and definitely less pretty than relying on iCode format > (there're only so many ways to do RTL after all). Well, as long as the goal is to be able to use LLVM frontends with sdcc backends in some way, I see no need to closely match. Functional equivalence between the representations is all one needs. So one could have LLVM transforming C++ (or C) into LLVM IR, LLVM would do machine-independent optimizations on LLVM IR, then transform LLVM IR into C (which doesn't have to look like the input), then sdcc parses C, transforms it into iCode, does machine-dependent optimizations on the iCode, generates asm code. The end result would be well-optimized asm code for any target sdcc supports. Philipp |
From: Paul S. <pm...@gm...> - 2013-05-12 01:38:05
Attachments:
strlen.diff
strlen.c
|
Hello, On Sat, 11 May 2013 21:31:28 +0200 Philipp Klaus Krause <pk...@sp...> wrote: [] > > The real problem though is concept mismatch. C and RTL are > > languages of different levels, and even if it would be possible to > > express RTL in (pretty dirty-looking) C subset, there's no > > guarantee that it, after passing thru compiler, will match original > > RTL. Then dirty hacks would need to be added to achieve that, to > > *C* syntax, with result hardly more robust and definitely less > > pretty than relying on iCode format (there're only so many ways to > > do RTL after all). > > Well, as long as the goal is to be able to use LLVM frontends with > sdcc backends in some way, I see no need to closely match. Functional > equivalence between the representations is all one needs. Yes, but how to prove it's working correctly? The idea is to (experimentally) prove that round trip's result (C->IR->C->ASM) matches original result (C->ASM) on the large test corpus. Without that, it will exactly "work in some way", but hardly correctly. So, as I wrote in another mail, I went for that, just to see where exactly it'll break. It didn't take long, mere strlen() did it. it turns out that SDCC doesn't even optimize "+ 1"s to increment. On top of that, it faithfully allocates any temporary variables, not trying to eliminate unneeded in any way. For example, it fails to see that: iTemp3 = p + 0x1; p = iTemp3; == p = p + 0x1; = ++p > > So one could have LLVM transforming C++ (or C) into LLVM IR, LLVM > would do machine-independent optimizations on LLVM IR, then transform > LLVM IR into C (which doesn't have to look like the input), then sdcc > parses C, transforms it into iCode, does machine-dependent > optimizations on the iCode, generates asm code. > > The end result would be well-optimized asm code for any target sdcc > supports. > > Philipp -- Best regards, Paul mailto:pm...@gm... |
From: Philipp K. K. <pk...@sp...> - 2013-05-12 10:01:23
|
On 12.05.2013 03:37, Paul Sokolovsky wrote: > Hello, > > On Sat, 11 May 2013 21:31:28 +0200 > Philipp Klaus Krause <pk...@sp...> wrote: > > [] > >>> The real problem though is concept mismatch. C and RTL are >>> languages of different levels, and even if it would be possible to >>> express RTL in (pretty dirty-looking) C subset, there's no >>> guarantee that it, after passing thru compiler, will match original >>> RTL. Then dirty hacks would need to be added to achieve that, to >>> *C* syntax, with result hardly more robust and definitely less >>> pretty than relying on iCode format (there're only so many ways to >>> do RTL after all). >> >> Well, as long as the goal is to be able to use LLVM frontends with >> sdcc backends in some way, I see no need to closely match. Functional >> equivalence between the representations is all one needs. > > Yes, but how to prove it's working correctly? The idea is to > (experimentally) prove that round trip's result (C->IR->C->ASM) matches > original result (C->ASM) on the large test corpus. Without that, it will > exactly "work in some way", but hardly correctly. > Well, you could e.g. send the regression tests (sdcc's or LLVM's or gcc's) through it and test if they still pass when the original tests are replaced by the C code we get from LLVM. Philipp |
From: Philipp K. K. <pk...@sp...> - 2013-05-12 11:46:53
|
On 11.05.2013 21:24, Paul Sokolovsky wrote: > Anyway, I start with iCode to C conversion, because I wonder how soon > *that* will diverge. Well, original iCode and one gotten from > iCode-to-C conversion are of course rather different, but for trivial > program, SDCC was able to optimize it out, so asm's match modulo label > numbers (btw, I found that if you use label like _iftrue_0 in C, it > will clash with iCode generated ;-) ). Please file a bug report for that, so it won't be forgotten. _iftrue_0 should be an allowed label. I've done quite some work in the past on similar issues. I guess sdcc should emit prepend one additional underscore to the name of the automatically-generated label to be on the safe side. Philipp |
From: Paul S. <pm...@gm...> - 2013-05-13 17:12:32
|
Hello, On Sun, 12 May 2013 13:46:45 +0200 Philipp Klaus Krause <pk...@sp...> wrote: [] > > modulo label numbers (btw, I found that if you use label like > > _iftrue_0 in C, it will clash with iCode generated ;-) ). > > Please file a bug report for that, so it won't be forgotten. _iftrue_0 > should be an allowed label. I've done quite some work in the past on > similar issues. I guess sdcc should emit prepend one additional > underscore to the name of the automatically-generated label to be on > the safe side. Done: https://sourceforge.net/p/sdcc/bugs/2163/ -- Best regards, Paul mailto:pm...@gm... |
From: Sandeep D. <san...@ie...> - 2013-05-11 15:51:06
|
Hi Paul, A very interesting thought indeed. SDCC uses an AST form (after parsing) before going to iCode form. Do you think you could convert the output of LLVM to output a more C like syntax after optimization ? Then you could create a simple parser in SDCC to re-create the AST form. That maybe a simpler route, I will let the other (more current maintainers) chime in here. Sandeep On May 11, 2013, at 5:29 AM, Paul Sokolovsky <pm...@gm...> wrote: > Hello, > > I'm interested to make SDCC accept iCode (its own intermediate > representation) as an input format. The motivation being taking > intermediate format from other compilers' frontends, which support more > languages and have better high-level optimizations, convert to iCode > and feed to SDCC for codegeneration to 8-bit microcontrollers. > > The target I specifically have in mind is LLVM IR and C++. I did > initial comparison of LLVM IR and iCode and they appear to share same > level of concepts, the biggest difference I've seen so far is that LLVM > IR is SSA form, while iCode is not. I'm sure there're lots of > devils-in-details, bit I was satisfied with a quickly put Python script > to convert trivial LLVM IR program to iCode, so I proceeded with > looking how to feed iCode into SDCC in the first place. > > Well, before talking about taking iCode format as an input, first iCode > format should be defined. So far, iCode is just internal SDCC data > structure, with some adhoc and verbose dumps available for that > internal structure. Actually, it's not even possible to produce iCode > for *input* C program, because even --dumpraw starts dumping after some > code transformations are started. > > So, my initial steps were to add --emit-i-code switch to dump iCode > directly after construction from AST, and then try to define "external > representation" of iCode and add dumping support for it, while leaving > existing "debugging representation" mostly intact. > > This work is available in this branch: > https://github.com/pfalcon/sdcc/commits/icode-output > > Based on that branch, there's another branch > https://github.com/pfalcon/sdcc/commits/icode-input which provides > (initial so far) iCode parsing support. After considering how I can add > support for iCode parsing grammar, I went with the solution as described > here: > http://stackoverflow.com/questions/16452737/is-it-possible-to-call-one-yacc-parser-from-another-to-parse-specific-token-subs > > icode-input branch is largely work in progress, what I achieved so far > is that compiling trivial C code to iCode, and then recompiling to iCode > again produces matching output iCode's and asm's. As an example, > > ======== > int a, b, c; > > int foo() > { > return a & b; > } > ======== > > is compiled into external iCode representation as: > > ======== > defvar a{int fixed as=data} > defvar b{int fixed as=data} > defvar c{int fixed as=data} > > _entry: > proc foo{int ( ) fixed as=code} > iTemp0{int fixed as=data} = a{int fixed as=data} & b{int fixed as=data} > ret iTemp0{int fixed as=data} > _return: > eproc foo{int ( ) fixed as=code} > ======== > > > I would target this work to be submitted upstream, so would appreciate > comments on the idea, approach, and early implementation available so > far. Please note that branches above will be rebased as I'm striving to > provide clean patchset. (Btw, did you guys consider switching to git?) > > > -- > Best regards, > Paul mailto:pm...@gm... > > ------------------------------------------------------------------------------ > Learn Graph Databases - Download FREE O'Reilly Book > "Graph Databases" is the definitive new guide to graph databases and > their applications. This 200-page book is written by three acclaimed > leaders in the field. The early access version is available now. > Download your free book today! http://p.sf.net/sfu/neotech_d2d_may > _______________________________________________ > sdcc-devel mailing list > sdc...@li... > https://lists.sourceforge.net/lists/listinfo/sdcc-devel |
From: Philipp K. K. <pk...@sp...> - 2013-05-11 15:54:40
|
On 11.05.2013 17:35, Sandeep Dutta wrote: > Hi Paul, > > A very interesting thought indeed. SDCC uses an AST form (after parsing) > before going to iCode form. Do you think you could convert the output of > LLVM to output a more C like syntax after optimization ? Then you could > create a simple parser in SDCC to re-create the AST form. That maybe > a simpler route, I will let the other (more current maintainers) chime in here. > > Sandeep But then what's the advantage over just using a subset of C as output? Philipp |
From: Paul S. <pm...@gm...> - 2013-05-11 16:25:09
|
Hello, On Sat, 11 May 2013 08:35:52 -0700 Sandeep Dutta <san...@ie...> wrote: > Hi Paul, > > A very interesting thought indeed. SDCC uses an AST form (after > parsing) before going to iCode form. Do you think you could convert > the output of LLVM to output a more C like syntax after > optimization ? Then you could create a simple parser in SDCC to > re-create the AST form. That maybe a simpler route, I will let the > other (more current maintainers) chime in here. Well, typical LLVM language frontend, like Clang (C & C++) also parses into AST, and then converts that into LLVM IR, which has well defined external representation and serves as input to machine-dependent codegeneration. So, trying to convert it back to C-like language is like trying to decompile it, without 100% guarantee of equivalence. > > Sandeep > [] -- Best regards, Paul mailto:pm...@gm... |