2020-01-13 16:45:54 +08:00
2018-09-28 14:14:59 +08:00
2020-01-12 16:19:38 +08:00
2020-01-13 16:45:54 +08:00
2018-06-07 14:39:46 +08:00
2018-06-07 14:39:46 +08:00
2020-01-13 14:27:40 +08:00
2020-01-12 16:54:24 +08:00

DataExtracter Help


DataExtracter helps you quickly extract data from any web pages.

All you need to do is:

  • Find out the selectors for target data
  • Type scripts in the console of extension backgroud page, as introduced bellow.

Qucik Start

Extract current page

$('.item', ['a', 'a@href']);

Extract multiple pages (1-10, interval 1)

$('.item', ['a', 'a@href'],"http://sample.com/?pn=${page}", 1, 10, 1);

Extract multiple urls (list)

$('.item', ['a', 'a@href'],["http://sample.com/abc","http://sample.com/xyz"]);

Extract specified pages (1,3,5)

$('.item', ['a', 'a@href'], "http://sample.com/?pn=${page}", [1, 3, 5]);

Task Call Signitures

// extract data from current page
function (itemsSelector:string, fieldSelectors:string[])
// extract data from a range of pages
function (itemsSelector:string, fieldSelectors:string[], urlTemplate:string, from:number, to:number, interval:number)
// extract data from a list of pages
function (itemsSelector:string, fieldSelectors:string, urlTemplate:string, pages:number[])
// extract data from a list of pages
function (itemsSelector:string, fieldSelectors:string[], urls:string[])
// extract data of urls which extracted from last task result
function (itemsSelector:string, fieldSelectors:string[], urls:ExtractResult)

Stop Tasks

The only way to stop tasks before its finish, is Closing the target tab.

Tasks wait for their target elements' appearance, given some elements were loaded asynchronously.
If you typed wrong selectors, the task waits forever for elements which don't exists.

Extract Attributes.

e.g.: link text and target (use 'selector@attribute')

new Extractor().task('.item', ['a', 'a@href']).start();

Advanced Usage

Use Task Chain.

e.g.: Collect links from http://sample.com/abc, then, Extract data of each link

e = new Extractor()
e.task('.search-list-item', ['a@href'], ["http://sample.com/abc"])
    .task('list-item', ["a.title", "p.content"])
    .start();

Extractor Options

Specify extra options, to make task do some actions before scrape the data.

var job = new Extractor({ "scrollToBottom": 1 });

Available options:

  • scrollToBottom: Try scroll pages to the bottom, some elements are loaded only we user need them.

Export Result of Any Task

To a multiple task Extractor e:

e = new Extractor()
e.task('.search-list-item', ['a@href'], ["http://sample.com/abc"])
    .task('list-item', ["a.title", "p.content"])
    .start();

User will be asked to export the final result when it finishes.

Incase you want to export it again, use:

e.export()

To export another task result, other than the final one:

// export the result of first task
// to the example above, that is a list of urls
e.export(0)
// export the result of second task
e.export(1)

Task Management

Continue Tasks

Sometimes, it's hard to finish them in an single execution, that why we need "Continuing of Tasks".

You can always continue tasks (with following), even it stops in the middle of a task:

e.start()

The Extractor kept the state of last execution, and starts from where it stopped.

Restart Tasks

What should I do, if I don't like to continue from last state, but restart from certain task?

// restart all tasks
e.restart(0)
// restart from 2nd task
e.restart(1)

Save & Load State

It may also be hard to finish tasks in even a single day, we need a way to save current state, and come back tommorow.

Create and run an extractor:

e = new Extractor()
e.task('.search-list-item', ['a@href'], ["http://sample.com/abc"])
    .task('list-item', ["a.title", "p.content"])
    .start();

Save the state:

e.save();

Load the state:

Open the popup window, upload the saved state file. Then, and in the backgoud console:

e = new Extractor().load();
Description
No description provided
Readme 325 KiB
Languages
TypeScript 93.1%
HTML 5.1%
JavaScript 1.8%