Re: [Tcl9-cloverfield] Happy new year!

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Dammit, I was just about to go to work, and then your email hit my inbox!  Next time, wait five more minutes before sending! ;^)

"Frédéric Bonnet" <fre...@fr...> wrote:
> Large objects can use contiguous malloc'd blocks wrapped around custom
> objects,

Good idea; I didn't think of that.  It's not like your work completely replaces malloc.  They can be used together.

> or they can be split into smaller nodes and assembled using a
> tree (like ropes do with string chunks).

That is what I was thinking of.  I was just a little worried that this would become cumbersome.

> the latter is preferable and would give better performances especially
> because of data sharing and limited memory allocation overhead.

I doubt it'll be measurable, if my assumption holds true.  My assumption is that large blocks will be rare and infrequently reallocated.  I'm thinking a couple might be created when the extension library is loaded or initialized.  Possibly they won't even be reachable from the Cloverfield memory manager; they can just be referenced by pointers kept in the library's data segment.  Internal stuff, you understand.  Note that these blocks might contain pointers to stuff that is allocated using your manager, which is just fine.

> Regarding mutability, I've dropped refcounted objects in favor of a
> mark-and-sweep GC

Let me get this straight.  In Tcl, the structure only has to be copied on write when it is shared.  The copy will not be shared, so it can be modified in-place.  After doing the copy, it's possible that the original might cease to be shared.

In Cloverfield, nothing can be modified in-place, since everything could potentially be shared.  There's no way to prove otherwise.  So modifications are always done by creating new objects that are altered copies of the original.  To cut down on the overhead of copying, you want to keep the object size as small as possible.

Sound right?

> > > - traversal is near linear in most case because we can keep
> > > contextual info in iterator structures and directly access
> > > the leaves that contain actual data
> > 
> > Can you explain this for me?
> 
> Have a look at the iterator code.

Sorry, I have forgotten to read your code, and I certainly don't have time now.  I don't have time to write this email either, but I'm doing it anyway. ;^)

> In short, iterators are
> stack-allocated structures that store the current iteration context,
> that is, the leaf node and the local index where iteration takes place
> plus info about nodes at upper levels.

So it won't be necessary to start over all the time.  Got it.  I've seen this at work with linked lists.

node* get_next_node(node* head)
{
   return head->next;
}

node* get_node_by_index(node* head, int index)
{
   for (; index > 0; --index) {
      head = get_next_node(head);
   }
   return head;
}

The first function is what you describe, it takes constant time, and it can be used to iterate over a list in linear time.  The second function takes linear time, and it can be used to iterate over a list in quadratic time.

How can it be used to iterate over a list?, I hear you ask.

void process_list(node* head)
{
   int index;
   node* current = head;
   while (1) {
      current = get_node_by_index(head, index);
      if (current == NULL) {
         break;
      }
      process_node(head);
   }
}

Now I hear you ask WTF?!, but I can't answer that question.  I wish I could...  Seriously, I have seen code like this before.  It didn't really matter, since the list was only one or two elements long.  But it was gross nevertheless.  I would have replaced it with a preallocated array with a reasonable maximum size, such as ten.

>  - small objects *should* fit within a single cell as much as
> possible.

So do you have code to automatically split the object across cells?

>  - large objects *must* be split into smaller ones;

Now you have me confused, unless there's a third class of object that you consider to be neither small nor large.  So how do you define small and large?  Hmm.  Later in this email you talk about pages as well as cells.  I guess you're trying to say that the preferred object size is at most one cell, and all "physical" objects must be at most one page.  The caller can string together "physical" objects into "logical" objects if they need to be really big.

> alternatively, one memory mapping (a strength rather than a weakness).

What's memory mapping, in this context?

> I believe that only very specialized types would be impacted by this
> limit, and it's also likely that those would already rely on underlying
> libraries that provide their own structures

The underlying libraries are likely to use malloc anyway, or perhaps they have their own customized allocator schemes.  Some require the caller to manage memory by putting everything on the stack, in the data or BSS segments, or on the heap.  In that last case the programmer will still need malloc, but that is fine so long as the programmer is given the opportunity to register a destructor callback to be executed when the object is reaped by the garbage collector.

> My design is based on the observation that modern architectures are
> neither limited in computation performances nor in memory size, [...]

The traditional malloc approach is optimized for ease of programming in C-like language.  It makes perfect sense to pick something else that may be harder to use in C when the programmer will be coding in a higher-level scripting language anyway.  The goal of any programming language is to make things easy for the programmer, so if the programmer isn't going to be using malloc directly, it's no longer necessary to incur malloc's performance cost.  Unload the baggage. ;^)

> The current code builds flat strings if the whole data fits into no
> more than 3 cells. This is a totally arbitrary value that I chose as a
> compromise between memory consumption and data sharing/duplication.

Once there is Real Code written in this New Scripting Language, we can graph this threshold versus execution time and memory utilization, then pick whatever threshold performs best.

> > set a (0 1 2)
> > set b ({*}$a 3 4 5)
> > set c {{*}$a 3 4 5}
> % puts [lindex $b 0]
> 0
> % puts [lindex $c 0]
> $a

Why is {*} interpreted inside a brace-quoted word?  Tcl performs no modifications to the contents of a brace-quoted word; the only processing it does is finding the closing brace.  I would prefer to keep things this way, even if it's more complicated now to find the closing brace.  Otherwise, won't there be trouble with brace-quoted scripts?  It should be possible to get the string representation of a script without losing all the {*}'s inside.

> b is expanded at evaluation time and forms a 6-element list.

The parse tree for the last word of the second line would say:

- construct a list as follows:
  - expand to multiple elements: current value of variable "a"
  - delimiter: " "
  - element: "3"
  - delimiter: " "
  - element: "4"
  - delimiter: " "
  - element: "5"

When executed, this parse tree would construct an object with the following two representations:

1. String representation: rope concatenating the string value of "a" with " 3 4 5"
2. List representation: rope concatenating the list value of "a" with (3 4 5)
   - Each element is a rope pointing into the same memory as the non-space parts of representation (1)

It's necessary to build both representations for a parenthesized word, due to the way you defined parens quoting.  Basically it will index the same data in two different ways.

It's possible to instead make a unified third representation that looks a lot like the above parse tree, then later build string or list representations from this third representation, as needed.

> % set v [string repeat "0123456789" 10000]
> 012345678901234567890123456789<... 99940 chars
> skipped>012345678901234567890123456789
> 
> The idea could be extended not only to strings but to containers as
> well (lists). So you could apply this concept to errors caused by
> stale references.

Reminds me of Python's str()/repr() dichotomy.  Not necessarily a bad thing.  Making them one and the same is a nice goal, but it has limitations.  So this [info] subcommand would return a tagged list describing the structure of the object being queried, without necessarily returning its actual string representation.

> (Note that Cloverfield avoids this class of problems by design since
> references always have a valid string rep and a known value in case
> they are unresolved, although this may change if we allow the string
> rep generation to fail, as is possible with file backed memory
> mapping.)

The concern I have about references having a valid string representations is that they can be constructed by accident or as a result of malicious user input.  Tcl has duck typing, meaning that an object's "type" is determined by how the code uses the object.  But references are automatically dereferenced, not explictly dereferenced, so it's possible to convince otherwise-correct code to do the wrong thing just by feeding it a specially crafted string.  My references don't have a string representation, so the "referencehood" is signaled by an unforgeable out-of-band condition.  It's a similar deal as null being signaled by variable existence.

> > Now for pointer trickery.  Thankfully, pointers always have string
> > representations, even when they point to variables that don't exist
> > yet. ;^)
> 
> So I get that your @ suffix is syntactic sugar for an explicit pointer
> dereference.

Yeah, but I think it's a bit more than sugar, since it enables the programmer to communicate his intent directly to the variable lookup code without detouring through a command substitution.  This should result in much better bytecodes, and it's easier for humans and debuggers to read.  It's a major distraction to back up to the beginning of the substitution to insert "[deref", then return to where the pointer actually gets followed and insert "]".  A similar problem happens in C; with "*" instead of "deref".  So I put the pointer-follow notation at the point where it actually happens!  It's like "->" or "[0]" in C.  I've actually had to use the latter in macros; it sometimes comes in handy.

> > puts $&[whatever]{0} -> 5
> > puts $&(5 4 3 2 1 0){0} -> 5
> 
> Interesting. This allows the variable indexing syntax without an
> actual variables.

That's the idea.

> Kind of C++ anonymous objects.

Where the function returns by reference?  This can be done somewhat in C, except the caller has to remember to dereference using * or ->.

> This means that you can merge (in the string sense) two
> "objects" without caring about their actual string rep.

What happens when two lists are merged, versus when two strings are merged?  Tcl:

% proc string_concat {a b} {return $a$b}
% concat {0 1 2} {3 4 5}
0 1 2 3 4 5
% string_concat {0 1 2} {3 4 5}
0 1 23 4 5

By the way, I think a [string concat] command would be very useful in functional contexts.

> in the grand OO tradition, ropes define an interface, not a structure.

Actually I think that's the same as what I suggest, in that the different representations are unified, not special cases.  You're proposing to also unify strings with the other object types.  Sounds good to me.

-- 
Andy Goth | http://andy.junkdrome.org/
unununium@{aircanopy.net,openverse.com}