Menu

sRAP differential expression error

Anonymous
2014-02-03
2018-07-31
  • Anonymous

    Anonymous - 2014-02-03

    I am a researcher trying to run your sRAP package on my RNA-Seq data. I followed your vignette until I tried the RNA.deg function and got the following error:

    stat.table <- RNA.deg(sample.table, expression.mat, project.name, project.folder, box.plot=FALSE, ref.group=TRUE, ref="control",method="aov", color.palette=c("green","orange"))

    Error in apply(ref.mat, 1, mean, na.rm = T) :
    dim(X) must have a positive length

    I am sorry to bother you, but would you know what is causing this error? Thank you very much.

     
  • Charles Warden

    Charles Warden - 2014-02-03

    Thank you for using sRAP!

    I may have to sit down to review the code more carefully, but I'd like to get some more general details from you first.

    1) Does the demo code work?

    2) What is your experimental design? 2 groups with triplicates, large patient cohort, etc?

    3) How are you creating the input file for sRAP?

    4) Perhaps similar to #3, it looks like you have made it to the RNA.deg() step. Am I correct in assuming that RNA.norm() and RNA.qc() work correctly? If you haven't tried testing those functions, would you mind doing so an letting me know if that can identify a problem with an upstream step?

     
  • Anonymous

    Anonymous - 2014-02-03

    Thanks a lot for the timely response.

    1) The demo code does indeed work.

    2) My experiment is two groups. The experimental group has three biological replicates, the other (control/reference) group only has one replicate.

    3&4) The RNA.norm and RNA.qc() functions do work. The original input file was a text file created by saving an excel file as tab-delimited text. The excel file had samples listed in rows and each gene listed in columns.

    Also, I tested the RNA.deg function with the argument ref.group=FALSE and it created a list of differentially expressed genes for me. However, using the function in this manner, I think I discard the information that one of the samples is a "reference" from which the others should be calculating fold-changes.

    Please let me know if my response is unclear and thanks again!

     
  • Charles Warden

    Charles Warden - 2014-02-03

    Ok - in the long term, I will work on troubleshooting this problem (since it would be best if the function didn't straight-up crash in situations like this). I will let you know about my progress on this.

    However, I see some potential higher-level problems:

    1) Without replicates in both groups, sRAP shouldn't be able to calculate a p-value / FDR value. So, RNA.deg() may really only be providing you with fold-change values (in terms of usefulness).

    2) Fold-change values (and p-value/FDR values, when you have replicates) will be affected by missing values. Biologically, this comes from genes that aren't expressed (or expressed at extremely low levels, such that you had no aligned reads). Based upon the error description, it sounds like this problem could be caused by a gene without any expression values. If you follow the pipeline sequentially, this shouldn't happen because a rounding factor (0.1, by default) should be used in the normalization from RNA.norm(). However, I think it would be pretty much guaranteed to happen if you had a table of RPKM values that you directly provided directly to the RNA.deg() function. So, does it sound like this might be a problem in your specific case? If so, I would recommend using the value from RNA.norm() rather than a table directly imported from your Excel file.

    Also, would you mind explaining how you choose to use the sRAP package? For example, were you specifically interested in the RNA.deg() function or the pipeline as a whole (including use of BD-Func for functional enrichment)?

    I ask because of specific issue number #1: algorithms to analyze samples with replicates will almost always be different than those that analyze samples without replicates (even if they are different implementations of the same package). Based upon my experience, I would say that that I don't think the p-values provided in samples without replicates aren't really better at producing gene lists than using a filter based upon fold-change value alone (for genes with sufficiently high coverage).

    So, if a gene list was your end point and you wanted to have p-values for your differentially expressed genes, then sRAP might not be the best option. Among the popular RNA-Seq tools, I think DESeq is probably the best. Last time I checked, DESeq was good for sample with replicates but the p-value / FDR values tended to be high for groups without replicates (although p-values were provided). Philosophically, I actually think this support my interpretation of such experimental designs, but it is possible that it will work better for you (and I have only tested DESeq - maybe DESeq2 works better when dealing with a lack of replicates).

    If you wanted to use BD-Func, please let me know. Regardless of what solution you use for differential expression, I want to make sure you can use this function. Actually, BD-Func doesn't care about p-values for differential expression: it only uses the normalized expression values or fold-change values (depending upon which implementation that you use). In that respect, it may actually be a quite good fit for you (in comparison to things like GO enrichment tools, which may not be very meaningful for extremely large gene lists - such as lists of 3000+ differential expressed genes).

    If you still want to use the sRAP fold-change values (and don't care about BD-Func), can you please confirm that your table had normalized expression values (such as RPKM values) and not raw read counts? As long as this is the case, RNA.norm() implements the strategy used in my fold-change paper (if that is why you wanted to read my algorithm). I will certainly look into keeping the package from crashing, but I can give you a heads up that this is actually something you can do from the exported Excel file (using the AVERAGE function to get the mean expression values for each group, and you can then calculate a log ratio; the 1.5 fold-change value would correspond to |log2ratio| > 0.58). Even if I fix the problem with my first try, it takes a few days to update the Bioconductor website. So, this might work faster for you in this specific case.

     
    • Anonymous

      Anonymous - 2018-07-31
      Post awaiting moderation.
  • Anonymous

    Anonymous - 2014-02-03

    I initially chose sRAP because it uses RPKM values as an input. Our lab uses the CLC Bio RNA-Seq tool to generate RPKM values, so I already had a list of RPKM values in excel format.

    I looked at the original Excel table and found I did have some genes that had a 0 RPKM value for at least one of the four samples. I had already removed all the genes that had a 0 RPKM value for all 4 samples. I ran the RNA.norm function on the tab-delimited text file generated from this Excel file and included the argument RPKM.cutoff 0.1, but when I used the resulting expression.mat file in the RNA.deg functions, I got the same error as before.

    I noticed in the normalized_expression.xlsx file created by the RNA.norm function, the samples are in the columns and the genes are in the rows, which is not the acceptable input for the RNA.deg function according to the documentation. When I tried to manually transpose the data in Excel and save it to a text file and then ran the RNA.deg function, I received the following error:

    Error in 2:ncol(expression.table) : argument of length 0

    I work on an invertebrate species that we have a set of gene models for, so while I would like to test the BD-Func, I am not sure if I have the correct input files for my species. I use the gene model identifier numbers to trace my differentially expressed genes as I proceed through the sRAP package.

    I don't care too much about the p-values at this point, I am mainly trying to generate a list of differentially expressed genes using different fold-change parameters. So I am mainly interested in the RNA.norm and RNA.deg functions. I will obtain replicates for the control/reference group in future experiments, but that may take some time.

    Right now, I am trying the approach you suggested in your last paragraph using the Excel AVERAGE function. I will let you know how it turns out.

    Please let me know if something is unclear or I left something out that might help you.

     
  • Charles Warden

    Charles Warden - 2014-02-03

    Ok, good - this makes sense.

    If you do use Excel, you should remember that those are log2(RPKM + 0.1) expression values. So a log2Ratio is AVERAGE(Treatment) - AVERAGE (Control). You'll have problems if you try to use log2(AVERAGE(Treatment)/AVERAGE(Control)).

    I'm surprised that CLC Bio won't provide fold-change values, but I do generally tend to prefer CLC Bio for DNA-Seq and Partek for RNA-Seq. At least RNA.norm() will help avoid getting genes with ridiculously high fold-change values due to low-coverage genes.

    Sitting down and carefully reviwing the algorithm and the description of your problems is something I will do, but it probably can't be done today. That said, the transposition was supposed to happen. Excel allows for more rows than columns, so I felt I pretty much needed to have samples defined on columns and genes on rows. That is also how the statistical results should be laid out (so, you can sort by fold-change, p-value in other cases, etc.). If you are familiar with R, you can use dim() to check the dimensions of the data frame and head() to check the first few columns (or by using something like object.name[1:10,]). That way, you can make sure the format for your version of RNA.deg() is the same as should be produced (using the demo dataset as a reference).

    A non-standard reference genome won't matter for anything except BD-Func. If you do happen to know functional categories with up-regulated and down-regulated sets of genes for your species, I can point out how to define a custom list of signatures (I know this works with the standalone version, and I'm pretty sure I left this option available in the Bioconductor version). Otherwise, I unfortunately agree that it probably won't help you very much.

     
  • Anonymous

    Anonymous - 2014-02-03

    I'm glad you mentioned the Excel tip in your last email. Just to clarify (and sorry to bother you with this, I just want to make sure I have it correct), I am taking the average of the 3 "treatment" samples from the normalized_expression.xlsx file (output from the RNA.norm function) and subtracting from this the one control sample value I have (average is itself) to arrive at a log2Ratio. I can then find the fold-change by taking 2 to the power of log2Ratio (if I am correct?) for each gene.

    I am not that familiar with R, but I will try to use the dim() function as you suggested.

     
  • Charles Warden

    Charles Warden - 2014-02-03

    I would probably try to leave the log2Ratio as is. At least the way I define fold-change, you would need to use a conditional statement:

    If log2Ratio > 0, fold-change = 2^log2Ratio

    If log2Ratio < 0, fold-change = -(2^(-log2Ratio))

    Either way, positive values mean the gene has higher expresssion in the treatment, while negative values have higher expression in the control. If you don't have a p-value, I would probably focus on genes with at least a 2-fold difference (which is conviently above 1 or below -1). The standard criteria I would us is |fold-change| > 1.5 (which is |log2Ratio| > 0.58) and FDR < 0.05.

     
  • Anonymous

    Anonymous - 2014-07-15
    Post awaiting moderation.

Anonymous
Anonymous

Add attachments
Cancel