BBMap / Tickets / #40 Multiple streams input

#40 Multiple streams input

Milestone: 1.0

Status: open

Owner: nobody

Labels: None

Updated: 2020-09-24

Created: 2020-09-24

Creator: Jordi Camps

Private: No

I'm using reformat.sh for a FastQ to SAM conversion. For proper results, I need to pre-process the FastQ files, and I would like to avoid writing them to disk. To do this, I need to provide two streams of data (two ends) to reformat.sh so I cannot use stdin.
The easy way would be using bash's process substitution:

reformat.sh in1=<(preprocess "$fq1") in2=<(preprocess "$fq2") ...

but this outputs zero data, presumably because it reads the first bytes to decide on the format and quality enconding of the input and then reset the stream expecting to start over again. That works on files, but not streams.

Adding parameter qin=33 does not help, as the format is still to be decided and results are the same.

Adding the format to the input file (like stdin.fq) also fails:

reformat.sh in1=<(preprocess "$fq1").fq in2=<(preprocess "$fq2").fq qin=33 ...

In this case, everything is known in advance, so no real need to read the file prior to real processing, but now the actual file looks like /dev/fd/63.fq, and when trying to open it, it fails, as the file to be opened is /dev/fd/63.

Finally, the workaround I used to all this is to use named pipes:

tmp1=$(mktemp --dryrun tmp.XXXXXX_1.fastq)
tmp2=$(mktemp --dryrun tmp.XXXXXX_2.fastq)
mkfifo "$tmp1" "$tmp2"
preprocess "$fq1" > "$tmp1" &
preprocess "$fq2" > "$tmp2" &
reformat.sh in1="$tmp1" in2="$tmp2" ...
wait
rm "$tmp1" "$tmp2"

but this is cumbersome. Create temporary files, create the pipes, remove them later (better done by a trap on EXIT), sending jobs to background and waiting for them... this is a lot of structural code and more error prone compared to the easy approach of the process substitution.

It would be good to have way to specify all the unknowns of the input and output files through parameters instead of relying on the naming or auto-discovery. That would allow the code to work with the easier approaches.

Discussion

Just discovered the extin=fq parameter. But having

reformat.sh in1=<(preprocess "$fq1") in2=<(preprocess "$fq2") extin=fq qin=33 ...

does not solve the issue either:

$ reformat.sh qin=33 extin=fq in1=<(<test_1.fastq) in2=<(<test_2.fastq) out=test.sam
java -ea -Xms300m -cp /apps/BBMAP/38.57/bbmap/current/ jgi.ReformatReads qin=33 extin=fq in1=/dev/fd/63 in2=/dev/fd/62 out=test.sam
Executing jgi.ReformatReads [qin=33, extin=fq, in1=/dev/fd/63, in2=/dev/fd/62, out=test.sam]

Set INTERLEAVED to false
Input is being processed as paired
Input:                          0 reads                 0 bases
Output:                         0 reads (NaN%)  0 bases (NaN%)

Time:                           0.127 seconds.
Reads Processed:           0    0.00k reads/sec
Bases Processed:           0    0.00m bases/sec

Last edit: Jordi Camps 2020-09-24

Multiple streams input

BBMap short read aligner, and other bioinformatic tools.

Milestone

Searches

Help

#40 Multiple streams input

Discussion