Ferret / Tickets / #459 Async iteration

#459 Async iteration

Status: open

Owner: nobody

Labels: type/enhancement (23) area/runtime (12) status/proposal (9)

Updated: 2021-05-04

Created: 2020-03-06

Creator: Anonymous

Private: No

Originally created by: ziflex

Changes

This proposal introduces:

New keyword
New data structure
Another returned value from a program execution

Background

Every time you scrape a website, Ferret does not return result until query execution completes. It keeps all results in memory and if an error occurs - all data is lost.
I.e. you either have all data or nothing.

Proposal

This proposal offers an optional alternative approach - asynchronous execution using streams.
Instead of keeping data in memory and waiting for the end of execution, we could return the results as soon as they arrive using stream based data structures.
On each iteration we would push results to a stream and a caller of a compiled query would receive it.
In situations when having all data is not critical, it will improve stability and efficiency.

Here is a possible syntax for an async execution:

import (
    "context"
    "encoding/json"
    "fmt"
    "os"

    "github.com/MontFerret/ferret/pkg/compiler"
    "github.com/MontFerret/ferret/pkg/drivers"
    "github.com/MontFerret/ferret/pkg/drivers/cdp"
    "github.com/MontFerret/ferret/pkg/drivers/http"
)

func main() {
    query := `
        LET doc = DOCUMENT('https://www.theverge.com/tech', { driver: "cdp" })
        WAIT_ELEMENT(doc, '.c-compact-river__entry', 5000)
        LET articles = ELEMENTS(doc, '.c-entry-box--compact__image-wrapper')
        LET links = (
            FOR article IN articles
                RETURN article.attributes.href
        )

        FOR link IN links
            // The Verge has pretty heavy pages, so let's increase the navigation wait time
            NAVIGATE(doc, link, 20000)
            WAIT_ELEMENT(doc, '.c-entry-content', 5000)
            LET texter = ELEMENT(doc, '.c-entry-content')
            YIELD texter.innerText
`

    comp := compiler.New()

    program, err := comp.Compile(query)

    if err != nil {
        panic(err)
    }

    ctx := drivers.WithContext(context.Background(), cdp.NewDriver())

    out, err := program.Run(ctx)

    if err != nil {
        panic(err)
    }

     // .Run now returns io.Reader interface
     // Even if a query does not use ASYNC iteration 
     data, err := ioutils.ReadAll(out)
}

Discussion

Anonymous - 2020-03-07

Originally posted by: PierreBrisorgueil

It's a good idea ! in my particular case I will generally look for small data, very regularly. By cons for larger one shot needs it can be important! little used but a good feature for big scraps ! or maybe for a broadcasting data site itself in stream O.O

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2021-02-18

Originally posted by: ziflex

I've been thinking about this issue for some time.

Right now I'm leaning towards adding new YIELD keyword that would mean that the returned value must be represented as a stream of data.

The returned data structure would only allow sequential read though (no access by index or key) .

It could be treated/implemented more like a lazy evaluation rather than streaming (if we can avoid creating new goroutines and not deal with parallelism- that would be great), since that's how generators in JavaScript or other languages that support YIELD keyword work. But the Program API would be changed in a stream-like fashion in order to deliver data to user as soon as possible.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Async iteration

Ferret is a web scraping system

Milestone

Searches

Help