# DataExtracter Help ---------------------------- DataExtracter helps you quickly extract data from any web pages. All you need to do is: - Find out the selectors for target data - Type scripts in the console of `extension backgroud page`, as introduced bellow. ![](template/assets/console.png) ## Qucik Start Extract current page ```js $('.item', ['a', 'a@href']); new Extractor().task('.item', ['a', 'a@href']).start(); // fieldSelectors can be empty strings if items have no child to select new Extractor().task('.item a', ['', '@href']).start(); ``` > `$(...args)` is the short form of `new Extractor().task(...args).start();`, which is introduced later. Extract multiple pages (1-10, interval 1) ```js $('.item', ['a', 'a@href'],"http://sample.com/?pn=${page}", 1, 10, 1); ``` Extract multiple urls (list) ```js $('.item', ['a', 'a@href'],["http://sample.com/abc","http://sample.com/xyz"]); ``` Extract specified pages (1,3,5) ```js $('.item', ['a', 'a@href'], "http://sample.com/?pn=${page}", [1, 3, 5]); ``` ## Task Call Signitures ```ts // extract data from current page function (itemsSelector:string, fieldSelectors:string[]) // extract data from a range of pages function (itemsSelector:string, fieldSelectors:string[], urlTemplate:string, from:number, to:number, interval:number) // extract data from a list of pages function (itemsSelector:string, fieldSelectors:string, urlTemplate:string, pages:number[]) // extract data from a list of pages function (itemsSelector:string, fieldSelectors:string[], urls:string[]) // extract data of urls which extracted from last task result function (itemsSelector:string, fieldSelectors:string[], urls:ExtractResult) ``` ## Stop Tasks Close the target tab, in which current tasks is running. Or use `job.stop()`: ```js job = new Extractor().task('.search-list-item', ['a@href'], ["http://sample.com/abc"]) .task('list-item', ["a.title", "p.content"]) .start(); job.stop(); ``` > Next time you call `job.start();`, the job will continues from where it stopped. ## Extract Attributes e.g.: link text and target (use 'selector@attribute') ```js new Extractor().task('.item', ['a', 'a@href']).start(); ``` ## Click Selected Elements The following clicks selected links and extracts link `text` and `href` ```js new Extractor().task('.item', ['!a', 'a@href']).start(); ``` ## Advanced Usage ### Use Task Chain. e.g.: Collect links from `http://sample.com/abc`, then, Extract data of each link ```js e = new Extractor() e.task('.search-list-item', ['a@href'], ["http://sample.com/abc"]) .task('list-item', ["a.title", "p.content"]) .start(); ``` ### Extractor Options Specify extra options, to make task do some actions before scrape the data. ```js var job = new Extractor({ "scrollToBottom": 1 }); ``` Available options: - `scrollToBottom`: Try scroll pages to the bottom, some elements are loaded only we user need them. ### Export Result of Any Task To a multiple task Extractor `e`: ```js e = new Extractor() e.task('.search-list-item', ['a@href'], ["http://sample.com/abc"]) .task('list-item', ["a.title", "p.content"]) .start(); ``` User will be asked to export the final result when it finishes. Incase you want to export it again, use: ```js e.export() ``` To export another task result, other than the final one: ```js // export the result of first task // to the example above, that is a list of urls e.export(0) // export the result of second task e.export(1) ``` ## Task Management ### Continue Tasks Sometimes, it's hard to finish them in an single execution, that why we need "Continuing of Tasks". You can always continue tasks by start it again, not matter in what phase it stops. ```js e.start() ``` The `Extractor` kept the execution state, and starts from where it stopped. ### Restart Tasks What if I don't like to continue from last state, but restart certain tasks? ```js // restart all tasks e.restart(0) // restart from 2nd task e.restart(1) ``` ### Save & Load State It may also be hard to finish tasks in even a single day, we need a way to save current state, and come back tommorow. Create and run an extractor: ```js e = new Extractor() e.task('.search-list-item', ['a@href'], ["http://sample.com/abc"]) .task('list-item', ["a.title", "p.content"]) .start(); ``` Save the state: ```js e.save(); ``` Load the state: Open the popup window, upload the saved state file. Then, and in the backgroud console: ```js e = new Extractor().load(); e.start(); ``` > The uploaded state will be cleaned in 30 seconds, if you don't load it. ## Watch Mode Watch mode tries to exract data from every page you visit **in current window**. ```js e = new Extractor(); e.task('.search-list-item', ['a@href'], ["http://sample.com/abc"]) .task('list-item', ["a.title", "p.content"]); e.watch(1); // start watching for first task ``` To stop watching, you can either `close current window`, or: ```js e.stop(); ``` ## Results Operation To get the results of a task: ```js let results = job.results(0); ``` Visit URLs (if any) in the results one by one: ```js results.visit(); ``` Walk through all results one by one: ```js results.walk((row,col,value)=>{console.log(value)}); ``` ## Developpment Clone this project and execute: ```sh npm i npm run prod # or npm run dev ```