184 lines
		
	
	
		
			3.9 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			184 lines
		
	
	
		
			3.9 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| # DataExtracter Help
 | |
| ----------------------------
 | |
| 
 | |
| DataExtracter helps you quickly extract data from any web pages. 
 | |
| 
 | |
| All you need to do is:
 | |
| 
 | |
| - Find out the selectors for target data
 | |
| - Type scripts in the console of `extension backgroud page`, as introduced bellow.
 | |
| 
 | |
|  
 | |
| 
 | |
| ## Qucik Start
 | |
| 
 | |
| Extract current page
 | |
| ```js
 | |
| $('.item', ['a', 'a@href']);
 | |
| ```
 | |
| 
 | |
| Extract multiple pages (1-10, interval 1)
 | |
| 
 | |
| ```js
 | |
| $('.item', ['a', 'a@href'],"http://sample.com/?pn=${page}", 1, 10, 1);
 | |
| ```
 | |
| 
 | |
| Extract multiple urls (list)
 | |
| 
 | |
| ```js
 | |
| $('.item', ['a', 'a@href'],["http://sample.com/abc","http://sample.com/xyz"]);
 | |
| ```
 | |
| 
 | |
| Extract specified pages (1,3,5)
 | |
| 
 | |
| ```js
 | |
| $('.item', ['a', 'a@href'], "http://sample.com/?pn=${page}", [1, 3, 5]);
 | |
| ```
 | |
| 
 | |
| ## Task Call Signitures
 | |
| 
 | |
| ```ts
 | |
| // extract data from current page
 | |
| function (itemsSelector:string, fieldSelectors:string[])
 | |
| // extract data from a range of pages
 | |
| function (itemsSelector:string, fieldSelectors:string[], urlTemplate:string, from:number, to:number, interval:number)
 | |
| // extract data from a list of pages
 | |
| function (itemsSelector:string, fieldSelectors:string, urlTemplate:string, pages:number[])
 | |
| // extract data from a list of pages
 | |
| function (itemsSelector:string, fieldSelectors:string[], urls:string[])
 | |
| // extract data of urls which extracted from last task result
 | |
| function (itemsSelector:string, fieldSelectors:string[], urls:ExtractResult)
 | |
| ```
 | |
| 
 | |
| ## Stop Tasks
 | |
| 
 | |
| Close the target tab, in which current tasks is running.
 | |
| 
 | |
| ## Extract Attributes.
 | |
| 
 | |
| e.g.: link text and target (use 'selector@attribute')
 | |
| 
 | |
| ```js
 | |
| new Extractor().task('.item', ['a', 'a@href']).start();
 | |
| ```
 | |
| 
 | |
| ## Advanced Usage
 | |
| 
 | |
| ### Use Task Chain.
 | |
| 
 | |
| e.g.: Collect links from `http://sample.com/abc`, then, Extract data of each link
 | |
| 
 | |
| ```js
 | |
| e = new Extractor()
 | |
| e.task('.search-list-item', ['a@href'], ["http://sample.com/abc"])
 | |
|     .task('list-item', ["a.title", "p.content"])
 | |
|     .start();
 | |
| ```
 | |
| 
 | |
| ### Extractor Options
 | |
| 
 | |
| Specify extra options, to make task do some actions before scrape the data.
 | |
| 
 | |
| ```js
 | |
| var job = new Extractor({ "scrollToBottom": 1 });
 | |
| ```
 | |
| 
 | |
| Available options:
 | |
| 
 | |
| - `scrollToBottom`: Try scroll pages to the bottom, some elements are loaded only we user need them.
 | |
| 
 | |
| 
 | |
| ### Export Result of Any Task
 | |
| 
 | |
| To a multiple task Extractor `e`:
 | |
| 
 | |
| ```js
 | |
| e = new Extractor()
 | |
| e.task('.search-list-item', ['a@href'], ["http://sample.com/abc"])
 | |
|     .task('list-item', ["a.title", "p.content"])
 | |
|     .start();
 | |
| ```
 | |
| 
 | |
| User will be asked to export the final result when it finishes.
 | |
| 
 | |
| Incase you want to export it again, use:
 | |
| 
 | |
| ```js
 | |
| e.export()
 | |
| ```
 | |
| 
 | |
| To export another task result, other than the final one:
 | |
| 
 | |
| ```js
 | |
| // export the result of first task
 | |
| // to the example above, that is a list of urls
 | |
| e.export(0)
 | |
| // export the result of second task
 | |
| e.export(1)
 | |
| ```
 | |
| 
 | |
| ## Task Management
 | |
| 
 | |
| ### Continue Tasks
 | |
| 
 | |
| Sometimes, it's hard to finish them in an single execution, that why we need "Continuing of Tasks".
 | |
| 
 | |
| You can always continue tasks by start it again, not matter in what phase it stops.
 | |
| 
 | |
| ```js
 | |
| e.start()
 | |
| ```
 | |
| 
 | |
| The `Extractor` kept the execution state, and starts from where it stopped.
 | |
| 
 | |
| ### Restart Tasks
 | |
| 
 | |
| What if I don't like to continue from last state, but restart certain tasks?
 | |
| 
 | |
| ```js
 | |
| // restart all tasks
 | |
| e.restart(0)
 | |
| // restart from 2nd task
 | |
| e.restart(1)
 | |
| ```
 | |
| 
 | |
| ### Save & Load State
 | |
| 
 | |
| It may also be hard to finish tasks in even a single day, we need a way to save current state, and come back tommorow.
 | |
| 
 | |
| Create and run an extractor:
 | |
| 
 | |
| ```js
 | |
| e = new Extractor()
 | |
| e.task('.search-list-item', ['a@href'], ["http://sample.com/abc"])
 | |
|     .task('list-item', ["a.title", "p.content"])
 | |
|     .start();
 | |
| ```
 | |
| 
 | |
| Save the state:
 | |
| 
 | |
| ```js
 | |
| e.save();
 | |
| ```
 | |
| 
 | |
| Load the state:
 | |
| 
 | |
| Open the popup window, upload the saved state file. Then, and in the backgroud console:
 | |
| 
 | |
| ```js
 | |
| e = new Extractor().load();
 | |
| e.start();
 | |
| ```
 | |
| 
 | |
| > The uploaded state will be cleaned in 30 seconds, if you don't load it.
 | |
| 
 | |
| ## Developpment
 | |
| 
 | |
| Clone this project and execute:
 | |
| 
 | |
| ```sh
 | |
| npm i
 | |
| npm run prod
 | |
| # or
 | |
| npm run dev
 | |
| ``` |