Originally created by: ziflex
This proposal introduces:
Every time you scrape a website, Ferret does not return result until query execution completes. It keeps all results in memory and if an error occurs - all data is lost.
I.e. you either have all data or nothing.
This proposal offers an optional alternative approach - asynchronous execution using streams.
Instead of keeping data in memory and waiting for the end of execution, we could return the results as soon as they arrive using stream based data structures.
On each iteration we would push results to a stream and a caller of a compiled query would receive it.
In situations when having all data is not critical, it will improve stability and efficiency.
Here is a possible syntax for an async execution:
import (
"context"
"encoding/json"
"fmt"
"os"
"github.com/MontFerret/ferret/pkg/compiler"
"github.com/MontFerret/ferret/pkg/drivers"
"github.com/MontFerret/ferret/pkg/drivers/cdp"
"github.com/MontFerret/ferret/pkg/drivers/http"
)
func main() {
query := `
LET doc = DOCUMENT('https://www.theverge.com/tech', { driver: "cdp" })
WAIT_ELEMENT(doc, '.c-compact-river__entry', 5000)
LET articles = ELEMENTS(doc, '.c-entry-box--compact__image-wrapper')
LET links = (
FOR article IN articles
RETURN article.attributes.href
)
FOR link IN links
// The Verge has pretty heavy pages, so let's increase the navigation wait time
NAVIGATE(doc, link, 20000)
WAIT_ELEMENT(doc, '.c-entry-content', 5000)
LET texter = ELEMENT(doc, '.c-entry-content')
YIELD texter.innerText
`
comp := compiler.New()
program, err := comp.Compile(query)
if err != nil {
panic(err)
}
ctx := drivers.WithContext(context.Background(), cdp.NewDriver())
out, err := program.Run(ctx)
if err != nil {
panic(err)
}
// .Run now returns io.Reader interface
// Even if a query does not use ASYNC iteration
data, err := ioutils.ReadAll(out)
}
Originally posted by: PierreBrisorgueil
It's a good idea ! in my particular case I will generally look for small data, very regularly. By cons for larger one shot needs it can be important! little used but a good feature for big scraps ! or maybe for a broadcasting data site itself in stream O.O
Originally posted by: ziflex
I've been thinking about this issue for some time.
Right now I'm leaning towards adding new YIELD keyword that would mean that the returned value must be represented as a stream of data.
The returned data structure would only allow sequential read though (no access by index or key) .
It could be treated/implemented more like a lazy evaluation rather than streaming (if we can avoid creating new goroutines and not deal with parallelism- that would be great), since that's how generators in JavaScript or other languages that support YIELD keyword work. But the Program API would be changed in a stream-like fashion in order to deliver data to user as soon as possible.