Thread: [Gmod-schema] Loading chromosome sequences greater than 1G bp long.

Brought to you by: daveclements, girlwithglasses, gk_fan, hueyling, and 10 others

gmod-schema

[Gmod-schema] Loading chromosome sequences greater than 1G bp long.

From: Cheng, Chun-H. <chu...@ws...> - 2023-05-09 17:44:42

Hi,

We have a big genome with large chromosomes that each of them is greater than 1G bp. We're having trouble loading them into the 'feature' table. We've tried the Tripal FASTA loader and a custom script, but both failed with some error (in Postgres v12 log: ERROR:  invalid memory alloc request size 1161290884). Is there any way we can load the sequences for the genome into Chado?

Here's some length information about the genome:

chr1     1853204363
chr2     1709916750
chr3     1527935595
chr4     1588398909
chr5     1297479159
chr6     1379031673


Thank you very much for your help,

Chun-Huai Cheng

Re: [Gmod-schema] Loading chromosome sequences greater than 1G bp long.

From: Joe C. <jwc...@lb...> - 2023-05-11 23:14:33

Hello,

I cannot be done. Even though the documentation for a TEXT type in postgresql says it is ‘variable unlimited length', there is an internal limitation of the size of an allocated buffer. It’s in the documentation for the character data type https://www.postgresql.org/docs/current/datatype-character.html:

> In any case, the longest possible character string that can be stored is about 1 GB.

I recently ran into this problem myself and tried various ways around the limit by storing chunks and trying to concatenate with plpgsql functions, but that did not work. (https://www.postgresql.org/message-id/80025ECD-44A6-454F-A4F9-784474B84952%40lbl.gov <https://www.postgresql.org/message-id/800...@lb...>).

You need to store residues as chunked pieces in a separate table and rely on your middleware code to split pieces on injest and concatenate on output.

Joe Carlson

> On May 9, 2023, at 10:22 AM, Cheng, Chun-Huai via Gmod-schema <gmo...@li...> wrote:
> 
> Hi,
> 
> We have a big genome with large chromosomes that each of them is greater than 1G bp. We're having trouble loading them into the 'feature' table. We've tried the Tripal FASTA loader and a custom script, but both failed with some error (in Postgres v12 log: ERROR:  invalid memory alloc request size 1161290884). Is there any way we can load the sequences for the genome into Chado?
> 
> Here's some length information about the genome:
> 
> chr1     1853204363
> chr2     1709916750
> chr3     1527935595
> chr4     1588398909
> chr5     1297479159
> chr6     1379031673
> 
> 
> Thank you very much for your help,
> 
> Chun-Huai Cheng
> _______________________________________________
> Gmod-schema mailing list
> Gmo...@li... <mailto:Gmo...@li...>
> https://lists.sourceforge.net/lists/listinfo/gmod-schema <https://lists.sourceforge.net/lists/listinfo/gmod-schema>

Re: [Gmod-schema] Loading chromosome sequences greater than 1G bp long.

From: Scott C. <sc...@sc...> - 2023-05-12 05:19:49

<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body dir="auto"><div dir="ltr"></div><div dir="ltr">I was kind of remembering that you wrote about this. I wonder if we could/should compile our own Postgres server to bump up that limit. At first blush, that sounds like a terrible idea but I could be convinced.&nbsp;</div><div dir="ltr"><br><blockquote type="cite">On May 11, 2023, at 4:14 PM, Joe Carlson &lt;jwc...@lb...&gt; wrote:<br><br></blockquote></div><blockquote type="cite"><div dir="ltr"><meta http-equiv="Content-Type" content="text/html; charset=utf-8">Hello,<div class=""><br class=""></div><div class="">I cannot be done. Even though the documentation for a TEXT type in postgresql says it is ‘variable unlimited length', there is an internal limitation of the size of an allocated buffer. It’s in the documentation for the character data type&nbsp;<a href="https://www.postgresql.org/docs/current/datatype-character.html:" class="">https://www.postgresql.org/docs/current/datatype-character.html:</a></div><div class=""><br class=""></div><div class=""><blockquote type="cite" class=""><span style="font-family: &quot;Open Sans&quot;, sans-serif; font-size: 14.4px; font-variant-ligatures: normal; orphans: 2; widows: 2; text-decoration-thickness: initial;" class="">In any case, the longest possible character string that can be stored is about 1 GB.</span></blockquote></div><div class=""><br class=""></div><div class="">I recently ran into this problem myself and tried various ways around the limit by storing chunks and trying to concatenate with plpgsql functions, but that did not work. (<a href="https://www.postgresql.org/message-id/800...@lb..." class="">https://www.postgresql.org/message-id/80025ECD-44A6-454F-A4F9-784474B84952%40lbl.gov</a>).</div><div class=""><br class=""></div><div class="">You need to store residues as chunked pieces in a separate table and rely on your middleware code to split pieces on injest and concatenate on output.</div><div class=""><br class=""></div><div class="">Joe Carlson<br class=""><div><br class=""><blockquote type="cite" class=""><div class="">On May 9, 2023, at 10:22 AM, Cheng, Chun-Huai via Gmod-schema &lt;<a href="mailto:gmo...@li..." class="">gmo...@li...</a>&gt; wrote:</div><br class="Apple-interchange-newline"><div class=""><div class="elementToProof ContentPasted0" style="font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt;">Hi,<div class=""><br class="ContentPasted0"></div><div class="ContentPasted0">We have a big genome with large chromosomes that each of them is greater than 1G bp. We're having trouble loading them into the 'feature' table. We've tried the Tripal FASTA loader and a custom script, but both failed with some error (in Postgres v12 log: ERROR: &nbsp;invalid memory alloc request size 1161290884). Is there any way we can load the sequences for the genome into Chado?</div><div class=""><br class="ContentPasted0"></div><div class="ContentPasted0">Here's some length information about the genome:</div><div class="ContentPasted0"><br class=""></div><div class="ContentPasted0">chr1&nbsp; &nbsp;&nbsp; 1853204363</div><div class="ContentPasted0">chr2&nbsp;&nbsp;&nbsp;&nbsp; 1709916750</div><div class="ContentPasted0">chr3&nbsp;&nbsp;&nbsp;&nbsp; 1527935595</div><div class="ContentPasted0">chr4&nbsp;&nbsp;&nbsp;&nbsp; 1588398909</div><div class="ContentPasted0">chr5&nbsp;&nbsp;&nbsp;&nbsp; 1297479159</div><div class="ContentPasted0">chr6&nbsp;&nbsp;&nbsp;&nbsp; 1379031673</div><div class=""><br class="ContentPasted0"></div><div class=""><br class="ContentPasted0"></div><div class="ContentPasted0">Thank you very much for your help,</div><br class=""></div><div class="elementToProof" style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;"><div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt;" class=""><span style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt;" class="">Chun-Huai Cheng</span><br class=""></div></div><span style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; float: none; display: inline !important;" class="">_______________________________________________</span><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;" class=""><span style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; float: none; display: inline !important;" class="">Gmod-schema mailing list</span><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;" class=""><a href="mailto:Gmo...@li..." style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px;" class="">Gmo...@li...</a><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;" class=""><a href="https://lists.sourceforge.net/lists/listinfo/gmod-schema" style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px;" class="">https://lists.sourceforge.net/lists/listinfo/gmod-schema</a></div></blockquote></div><br class=""></div><span>_______________________________________________</span><br><span>Gmod-schema mailing list</span><br><span>Gmo...@li...</span><br><span>https://lists.sourceforge.net/lists/listinfo/gmod-schema</span><br></div></blockquote></body></html>

Re: [Gmod-schema] Loading chromosome sequences greater than 1G bp long.

From: Joe C. <jwc...@lb...> - 2023-05-12 16:08:49

Hi Scott,

For those of us in institutionalized settings, running a custom compilation of postgres might be a hurdle. They tend not to like custom builds of privileged code.

I think we might want to talk about other issues where chado is showing its age. Chromosomes are getting bigger and it’s only a matter of time before fmin and fmax need to be larger than 2G. (I’m looking at a pine tree outside my window now.) We’re running a big db - the current size is 11T - and I’ve needed to promote some counters to bigints. We can call it okiichado.

I’ve wondered whether it makes sense to store residues in chunks in a dimension table rather than as a column in the feature table. I realize that internally postgres breaks things up and stores it differently than the main records. But what are the access patterns that we really use? For most features (mRNAs, peptides,…) we will want to grab the whole thing. But for chromosomes don’t we normally just want a substring? Would storing things in chunks make that more efficient? We’d need a function call to be able to access a substring so that makes it more complicated.

(Right now we’re a hybrid of storing residues for some things accessed often in the feature table and chunked for the bigger things in a dimension table.)

> On May 11, 2023, at 9:54 PM, Scott Cain <sc...@sc...> wrote:
> 
> I was kind of remembering that you wrote about this. I wonder if we could/should compile our own Postgres server to bump up that limit. At first blush, that sounds like a terrible idea but I could be convinced. 
> 
>> On May 11, 2023, at 4:14 PM, Joe Carlson <jwc...@lb...> wrote:
>> 
>> Hello,
>> 
>> I cannot be done. Even though the documentation for a TEXT type in postgresql says it is ‘variable unlimited length', there is an internal limitation of the size of an allocated buffer. It’s in the documentation for the character data type https://www.postgresql.org/docs/current/datatype-character.html: <https://www.postgresql.org/docs/current/datatype-character.html:>
I really _should_ learn to proofread. I meant to say ‘It cannot be done…'

Joe
>> 
>>> In any case, the longest possible character string that can be stored is about 1 GB.
>> 
>> 
>> I recently ran into this problem myself and tried various ways around the limit by storing chunks and trying to concatenate with plpgsql functions, but that did not work. (https://www.postgresql.org/message-id/80025ECD-44A6-454F-A4F9-784474B84952%40lbl.gov <https://www.postgresql.org/message-id/800...@lb...>).
>> 
>> You need to store residues as chunked pieces in a separate table and rely on your middleware code to split pieces on injest and concatenate on output.
>> 
>> Joe Carlson
>> 
>>> On May 9, 2023, at 10:22 AM, Cheng, Chun-Huai via Gmod-schema <gmo...@li... <mailto:gmo...@li...>> wrote:
>>> 
>>> Hi,
>>> 
>>> We have a big genome with large chromosomes that each of them is greater than 1G bp. We're having trouble loading them into the 'feature' table. We've tried the Tripal FASTA loader and a custom script, but both failed with some error (in Postgres v12 log: ERROR:  invalid memory alloc request size 1161290884). Is there any way we can load the sequences for the genome into Chado?
>>> 
>>> Here's some length information about the genome:
>>> 
>>> chr1     1853204363
>>> chr2     1709916750
>>> chr3     1527935595
>>> chr4     1588398909
>>> chr5     1297479159
>>> chr6     1379031673
>>> 
>>> 
>>> Thank you very much for your help,
>>> 
>>> Chun-Huai Cheng
>>> _______________________________________________
>>> Gmod-schema mailing list
>>> Gmo...@li... <mailto:Gmo...@li...>
>>> https://lists.sourceforge.net/lists/listinfo/gmod-schema <https://lists.sourceforge.net/lists/listinfo/gmod-schema>
>> _______________________________________________
>> Gmod-schema mailing list
>> Gmo...@li...
>> https://lists.sourceforge.net/lists/listinfo/gmod-schema

Re: [Gmod-schema] Loading chromosome sequences greater than 1G bp long.

From: Cheng, Chun-H. <chu...@ws...> - 2023-05-12 16:04:56

Thank you both for answering my questions.

I don't mind using the GMOD version of Postgres if it's well maintained. Or at least should we urge the Postgres developers to bump that limit in the future release?

Chun-Huai
________________________________
From: Scott Cain <sc...@sc...>
Sent: Thursday, May 11, 2023 9:54 PM
To: Joe Carlson <jwc...@lb...>
Cc: Cheng, Chun-Huai <chu...@ws...>; gmo...@li... <gmo...@li...>
Subject: Re: [Gmod-schema] Loading chromosome sequences greater than 1G bp long.

[EXTERNAL EMAIL]

I was kind of remembering that you wrote about this. I wonder if we could/should compile our own Postgres server to bump up that limit. At first blush, that sounds like a terrible idea but I could be convinced.

On May 11, 2023, at 4:14 PM, Joe Carlson <jwc...@lb...> wrote:

Hello,

I cannot be done. Even though the documentation for a TEXT type in postgresql says it is ‘variable unlimited length', there is an internal limitation of the size of an allocated buffer. It’s in the documentation for the character data type https://www.postgresql.org/docs/current/datatype-character.html:<https://urldefense.com/v3/__https://www.postgresql.org/docs/current/datatype-character.html:__;!!JmPEgBY0HMszNaDT!r6EhVZuJU69Aj_RlYei7ShU_fjjwROeYApefHMo6TG3Hczdk1aP0_uYvPhCLfdMhFAMsNrqhpke6KF8kD5N0eg$>

In any case, the longest possible character string that can be stored is about 1 GB.

I recently ran into this problem myself and tried various ways around the limit by storing chunks and trying to concatenate with plpgsql functions, but that did not work. (https://www.postgresql.org/message-id/80025ECD-44A6-454F-A4F9-784474B84952%40lbl.gov<https://urldefense.com/v3/__https://www.postgresql.org/message-id/800...@lb...__;!!JmPEgBY0HMszNaDT!r6EhVZuJU69Aj_RlYei7ShU_fjjwROeYApefHMo6TG3Hczdk1aP0_uYvPhCLfdMhFAMsNrqhpke6KF9HEvjIzQ$>).

You need to store residues as chunked pieces in a separate table and rely on your middleware code to split pieces on injest and concatenate on output.

Joe Carlson

On May 9, 2023, at 10:22 AM, Cheng, Chun-Huai via Gmod-schema <gmo...@li...<mailto:gmo...@li...>> wrote:

Hi,

We have a big genome with large chromosomes that each of them is greater than 1G bp. We're having trouble loading them into the 'feature' table. We've tried the Tripal FASTA loader and a custom script, but both failed with some error (in Postgres v12 log: ERROR:  invalid memory alloc request size 1161290884). Is there any way we can load the sequences for the genome into Chado?

Here's some length information about the genome:

chr1     1853204363
chr2     1709916750
chr3     1527935595
chr4     1588398909
chr5     1297479159
chr6     1379031673

Thank you very much for your help,

Chun-Huai Cheng
_______________________________________________
Gmod-schema mailing list
Gmo...@li...<mailto:Gmo...@li...>
https://lists.sourceforge.net/lists/listinfo/gmod-schema<https://urldefense.com/v3/__https://lists.sourceforge.net/lists/listinfo/gmod-schema__;!!JmPEgBY0HMszNaDT!r6EhVZuJU69Aj_RlYei7ShU_fjjwROeYApefHMo6TG3Hczdk1aP0_uYvPhCLfdMhFAMsNrqhpke6KF-P204wlg$>

_______________________________________________
Gmod-schema mailing list
Gmo...@li...
https://lists.sourceforge.net/lists/listinfo/gmod-schema<https://urldefense.com/v3/__https://lists.sourceforge.net/lists/listinfo/gmod-schema__;!!JmPEgBY0HMszNaDT!r6EhVZuJU69Aj_RlYei7ShU_fjjwROeYApefHMo6TG3Hczdk1aP0_uYvPhCLfdMhFAMsNrqhpke6KF-P204wlg$>