From: Ethan A M. <me...@uw...> - 2020-10-15 20:52:24
|
On Thursday, 15 October 2020 12:26:22 PDT Allin Cottrell wrote: > On Thu, 15 Oct 2020, Ethan A Merritt wrote: > > > On Thursday, 15 October 2020 08:02:27 PDT Allin Cottrell wrote: > >> Maybe I'm missing something, but isn't the documentation for > >> "jitter" backwards with respect to the swarm/square choice? > >> > >> The text reads thus: "The default jittering operation displaces > >> points only along x. This produces a distinctive pattern sometimes > >> called a "bee swarm plot". The optional keyword square adjusts the y > >> coordinate of displaced points in addition to their x coordinate so > >> that the points lie in distinct layers..." > > > > The text is correct as written. Perhaps the attached figure, > > combining two plots from jitter.dem, will clarify. > > > > Left panel: > > original data, randomly distributed on y, all x values the same > > Center panel: > > "beeswarm" result from displacing points along x > > if they would otherwise overlap. > > Points are still randomly distributed on y. > > Right panel: > > "square" plot uses the x displacements that would > > generate a beeswarm plot, and adds an y displacement > > that is effectively a floor(y) operation where the unit of > > the floor operation is the "overlap" parameter to jitter. > > > >> > >> In the demo and in my own usage it seems the default is in fact to > >> displace in both the x and y dimensions while "square" limits the > >> scatter to the x dimension. (Except that in some cases I'm not > >> getting any y displacement with either choice, but that's another > >> issue.) > > Thanks, Ethan. I think I get it now. But this is potentially quite > confusing -- to understand exactly what jitter is doing one really > has to look closely at the y data in numerical form. So 'swarm' > preserves the original y values and just shoves points sideways to > get them off each other, while 'square' will also regularize the y > values to get the points into straight rows (more or less). Bee swarm plots were new to me. I came across one in a paper I was reading and made a note to myself to look into it. The resulting gnuplot implementation was guided by what R does. A salient feature of bee swarm plots is that the jitter operation is reversible. If you consider any single point you can reconstruct its original [x,y] coordinates by projecting back onto the corresponding discrete x value. The "square" option loses this. It is essentially a representation of binned data with a small number of points in each bin. There is nothing to distinguish two points in the same bin from each other, as their original coordinates have been lost. As the number of points becomes large, both options are inferior to a violin plot. The violinplot demo compares them and also shows a Gaussian jitter. > And then, given a pile-up of data points at some discrete {x,y} > value, there's no option to nudge them apart in both dimensions to > form a cloud? This sounds nice, but if x and y are both continuous I don't think it is a well-defined operation. For a small-ish number of points you could define an energy function that has of a steep penalty gradient for overlap and then minimize the total energy by monte carlo. That rapidly becomes compute-intensive as the number of points increases. For x and y discrete there's a better option, sometimes called a bubble plot. At each discrete [x,y], draw a circle with size proportional to the number of points that piled up there. The "size" is a sore point, however. radius or area? I have not given much thought to whether that can be done with existing options in gnuplot or whether it would require new code or an external data processing stage. cheers, Ethan |