* chance to continue on mismatch url for redirectTab * support empty field selectors * add Extractor.results() * add ExtractResult.walk(), ExtractResult.visit() * add ! directive to click elements * code optimize
246 lines
5.2 KiB
Markdown
246 lines
5.2 KiB
Markdown
# DataExtracter Help
|
|
----------------------------
|
|
|
|
DataExtracter helps you quickly extract data from any web pages.
|
|
|
|
All you need to do is:
|
|
|
|
- Find out the selectors for target data
|
|
- Type scripts in the console of `extension backgroud page`, as introduced bellow.
|
|
|
|

|
|
|
|
## Qucik Start
|
|
|
|
Extract current page
|
|
|
|
```js
|
|
$('.item', ['a', 'a@href']);
|
|
new Extractor().task('.item', ['a', 'a@href']).start();
|
|
// fieldSelectors can be empty strings if items have no child to select
|
|
new Extractor().task('.item a', ['', '@href']).start();
|
|
```
|
|
|
|
> `$(...args)` is the short form of `new Extractor().task(...args).start();`, which is introduced later.
|
|
|
|
Extract multiple pages (1-10, interval 1)
|
|
|
|
```js
|
|
$('.item', ['a', 'a@href'],"http://sample.com/?pn=${page}", 1, 10, 1);
|
|
```
|
|
|
|
Extract multiple urls (list)
|
|
|
|
```js
|
|
$('.item', ['a', 'a@href'],["http://sample.com/abc","http://sample.com/xyz"]);
|
|
```
|
|
|
|
Extract specified pages (1,3,5)
|
|
|
|
```js
|
|
$('.item', ['a', 'a@href'], "http://sample.com/?pn=${page}", [1, 3, 5]);
|
|
```
|
|
|
|
## Task Call Signitures
|
|
|
|
```ts
|
|
// extract data from current page
|
|
function (itemsSelector:string, fieldSelectors:string[])
|
|
// extract data from a range of pages
|
|
function (itemsSelector:string, fieldSelectors:string[], urlTemplate:string, from:number, to:number, interval:number)
|
|
// extract data from a list of pages
|
|
function (itemsSelector:string, fieldSelectors:string, urlTemplate:string, pages:number[])
|
|
// extract data from a list of pages
|
|
function (itemsSelector:string, fieldSelectors:string[], urls:string[])
|
|
// extract data of urls which extracted from last task result
|
|
function (itemsSelector:string, fieldSelectors:string[], urls:ExtractResult)
|
|
```
|
|
|
|
## Stop Tasks
|
|
|
|
Close the target tab, in which current tasks is running.
|
|
|
|
Or use `job.stop()`:
|
|
|
|
```js
|
|
job = new Extractor().task('.search-list-item', ['a@href'], ["http://sample.com/abc"])
|
|
.task('list-item', ["a.title", "p.content"])
|
|
.start();
|
|
job.stop();
|
|
```
|
|
|
|
> Next time you call `job.start();`, the job will continues from where it stopped.
|
|
|
|
## Extract Attributes
|
|
|
|
e.g.: link text and target (use 'selector@attribute')
|
|
|
|
```js
|
|
new Extractor().task('.item', ['a', 'a@href']).start();
|
|
```
|
|
|
|
## Click Selected Elements
|
|
|
|
The following clicks selected links and extracts link `text` and `href`
|
|
|
|
```js
|
|
new Extractor().task('.item', ['!a', 'a@href']).start();
|
|
```
|
|
|
|
## Advanced Usage
|
|
|
|
### Use Task Chain.
|
|
|
|
e.g.: Collect links from `http://sample.com/abc`, then, Extract data of each link
|
|
|
|
```js
|
|
e = new Extractor()
|
|
e.task('.search-list-item', ['a@href'], ["http://sample.com/abc"])
|
|
.task('list-item', ["a.title", "p.content"])
|
|
.start();
|
|
```
|
|
|
|
### Extractor Options
|
|
|
|
Specify extra options, to make task do some actions before scrape the data.
|
|
|
|
```js
|
|
var job = new Extractor({ "scrollToBottom": 1 });
|
|
```
|
|
|
|
Available options:
|
|
|
|
- `scrollToBottom`: Try scroll pages to the bottom, some elements are loaded only we user need them.
|
|
|
|
|
|
### Export Result of Any Task
|
|
|
|
To a multiple task Extractor `e`:
|
|
|
|
```js
|
|
e = new Extractor()
|
|
e.task('.search-list-item', ['a@href'], ["http://sample.com/abc"])
|
|
.task('list-item', ["a.title", "p.content"])
|
|
.start();
|
|
```
|
|
|
|
User will be asked to export the final result when it finishes.
|
|
|
|
Incase you want to export it again, use:
|
|
|
|
```js
|
|
e.export()
|
|
```
|
|
|
|
To export another task result, other than the final one:
|
|
|
|
```js
|
|
// export the result of first task
|
|
// to the example above, that is a list of urls
|
|
e.export(0)
|
|
// export the result of second task
|
|
e.export(1)
|
|
```
|
|
|
|
## Task Management
|
|
|
|
### Continue Tasks
|
|
|
|
Sometimes, it's hard to finish them in an single execution, that why we need "Continuing of Tasks".
|
|
|
|
You can always continue tasks by start it again, not matter in what phase it stops.
|
|
|
|
```js
|
|
e.start()
|
|
```
|
|
|
|
The `Extractor` kept the execution state, and starts from where it stopped.
|
|
|
|
### Restart Tasks
|
|
|
|
What if I don't like to continue from last state, but restart certain tasks?
|
|
|
|
```js
|
|
// restart all tasks
|
|
e.restart(0)
|
|
// restart from 2nd task
|
|
e.restart(1)
|
|
```
|
|
|
|
### Save & Load State
|
|
|
|
It may also be hard to finish tasks in even a single day, we need a way to save current state, and come back tommorow.
|
|
|
|
Create and run an extractor:
|
|
|
|
```js
|
|
e = new Extractor()
|
|
e.task('.search-list-item', ['a@href'], ["http://sample.com/abc"])
|
|
.task('list-item', ["a.title", "p.content"])
|
|
.start();
|
|
```
|
|
|
|
Save the state:
|
|
|
|
```js
|
|
e.save();
|
|
```
|
|
|
|
Load the state:
|
|
|
|
Open the popup window, upload the saved state file. Then, and in the backgroud console:
|
|
|
|
```js
|
|
e = new Extractor().load();
|
|
e.start();
|
|
```
|
|
|
|
> The uploaded state will be cleaned in 30 seconds, if you don't load it.
|
|
|
|
## Watch Mode
|
|
|
|
Watch mode tries to exract data from every page you visit **in current window**.
|
|
|
|
```js
|
|
e = new Extractor();
|
|
e.task('.search-list-item', ['a@href'], ["http://sample.com/abc"])
|
|
.task('list-item', ["a.title", "p.content"]);
|
|
e.watch(1); // start watching for first task
|
|
```
|
|
|
|
To stop watching, you can either `close current window`, or:
|
|
|
|
```js
|
|
e.stop();
|
|
```
|
|
|
|
## Results Operation
|
|
|
|
To get the results of a task:
|
|
|
|
```js
|
|
let results = job.results(0);
|
|
```
|
|
|
|
Visit URLs (if any) in the results one by one:
|
|
|
|
```js
|
|
results.visit();
|
|
```
|
|
|
|
Walk through all results one by one:
|
|
|
|
```js
|
|
results.walk((row,col,value)=>{console.log(value)});
|
|
```
|
|
|
|
## Developpment
|
|
|
|
Clone this project and execute:
|
|
|
|
```sh
|
|
npm i
|
|
npm run prod
|
|
# or
|
|
npm run dev
|
|
``` |