Re: [proteowizard-developer] Reader reorg

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi Darren. When we settle on some of these top-level design issues I'll 
feel more comfortable jumping in and contributing code, so please excuse 
this long discussion over rather trivial details. It should sync us up 
(and perhaps future readers) as to the intended purpose and scope of the 
library. :)

Responses inline.

Kessner, Darren E. wrote:
> data: data file abstraction (file I/O -> basic data structures)
>
> analysis: anything on top of the basic data structures
>
> I'd still rather have vendor readers outside msdata, since the data
> model is independent of any plug-ins:
>
> - data
>   - msdata
>   - msdata_vendor_readers
>
> Yes, I agree we should use vendor name suffixes to avoid collisions, so
> 'rawfile' -> 'util/vendor_access/Thermo_RAW' or something along those
> lines.
>
> For analysis module naming, we can start with prefixes for reusable libs
> (e.g. peaks -> FT_frequency or something like that) and then deepen the
> hierarchy as needed.  If there are only one or two FT-specific libs, I
> don't think there's a need to create a subdir for them -- on the other
> hand, I'm not absolutely opposed to an FT subdir.
Msdata is independent of the readers, but the readers are not 
independent of msdata. I think that's a good reason to make the 
vendor_readers a subdirectory of msdata in the actual file layout. This 
is somewhat hypocritical because of my previous argument about not 
having the file layout reflect the dependency hierarchy if the 
dependencies are documented elsewhere, so I'll amend it. Dependent code 
that is on the same root branch (data/analysis/utility/build/etc.) as 
its dependency should be in the dependency's directory or in a 
subdirectory of it.

With regard to prefix designations to indicate technology-specificity 
(FT_blah), I prefer staying away from that when it's feasible. Why do 
you incline toward a flat hierarchy with prefixes instead of a 
conventional hierarchy (/FT/blah)? Does it cause some maintenance issues 
I haven't considered? It would mean a few extra Jamfiles, but it would 
also allow those Jamfiles to wrap up the various technology-specific 
functionality into a single library, although I'm not sure that's 
actually of any consequence.

Also, are you saying that technology-specific code can include both data 
("basic data structures") and analysis (anything else on top of that)? 
Can you give a few concrete examples of that? I'm thinking that having 
tech-specific data structures shared in the core data directory could 
possibly mean easier reusability between the various tech-specific 
analysis libraries, but if we take the conventional hierarchy route we 
can put that shared code in the tech-specific root directory instead of 
the core data directory. For instance:
- data
    - msdata
- analysis
    - FT (shared FT-related data structures and functions go in here)
       - "domain" as combination of freq/transient or separate 
"frequency" and "transient" directories (analysis code dependent on FT 
data structures and functions)
versus
- data
    - msdata
    - FT (data structures and functions used by /analysis/FT, and 
possibly as independently useful code)
- analysis
    - FT (analysis code dependent on /data/FT)

If we (can) eliminate the prospect of alternative core "data" code by 
assuming that such alternatives would be technology-specific code in 
/analysis/SomeTech, we can flatten /data/msdata to just /msdata. Is 
there an example of some generic "data" code that would break that 
assumption and make me look like an ass? ;)

> Just a second response regarding the library name...  I agree that a
> non-proteo name would appeal to more users.
>
> My ideal course of action would be to get Josh on board fully, get
> msconvert to the point where it can effectively replace ReAdW, and then
> reincarnate the project as "the larger collaborative effort" with a new
> name.  Or maybe just a funky symbol and it can be TPFKAPW (the project
> formerly known as...).  How about we continue to overload Microsoft
> names and use MSIL (mass spec interface layer)  ;) 
Great! I'm glad we can come up with a name collaboratively. It's easier 
and more natural to promote the effort that way (just like mzML tries to 
be as inclusive as possible and, of course, went through several name 
changes).

> Actually, your proposal is pretty close to the overall dependency level
> layout (in docs and poster), so let's make it explicit:
>
> pwiz
>  - utility
>    - as you proposed
>  - data
>    - msdata (main data structures, but nothing vendor-specific)
>    - msdata_readers (vendor-specific Reader plugins for msdata)
>    - other data file stuff (e.g. transients, peak data, etc.)
>  - analysis
>    - as you proposed
>  - tools
>   
Do we want:
proteome vs. proteomics vs. proteomic? I can't decide between the first two.

Also:
- analysis
    - proteomics (analysis algorithms specific to proteomics)
    - genomics (mystery category)
- tools
    - proteomics (tools specific to proteomics)
    - genomics (guess what goes here!)
or
- analysis (analysis algorithms that are theoretically generic to all 
mass spec, e.g. peak picking and deisotoping)
- tools (tools that are theoretically generic to all mass spec)
- proteomics
    - analysis (analysis algorithms specific to proteomics)
    - tools (tools specific to proteomics)
- genomics
    - analysis
    - tools

The first approach above would probably keep our project tidier in the 
long run (especially to people who don't want the analysis bits, only 
the basic I/O).

-Matt