Commit Graph

34 Commits (6bd5f49c412be8edaadc5bc707dcd2a521207da5)

Author SHA1 Message Date
Michael Peter Christen 9fcd8f1bda added canonical filter
2 years ago
Michael Peter Christen 5a52b01c09 front-end integration of tag valency
2 years ago
Michael Peter Christen a2a40a3096 new link to crawlstart api documentation
2 years ago
Michael Peter Christen 3959d43a5c fixed doku link
3 years ago
Michael Peter Christen d0abb0cedb enabling all crawl profiles in all network modes
4 years ago
Michael Peter Christen f03e16d3df enhanced crawl start url check experience
5 years ago
luccioman 6b45cd5799 New optional crawl filter on the URL a doc must match to crawl its links
6 years ago
luccioman fcf6b16db4 Added new crawler attribute for finer control over Media Type detection
6 years ago
luccioman 92e10d7d1c Added a crawl start hint message on availability or not of wkhtmltopdf
6 years ago
luccioman 534f09e92b Added and updated hint messages about remote crawler status
6 years ago
luccioman cced94298a Added a new crawler document filter type using Solr syntax
7 years ago
luccioman fb3032c530 Added a crawl filtering possibility on documents Media Type (MIME)
7 years ago
Michael Peter Christen 187075b878 added nav filter
7 years ago
luccioman 7c644090ff Fixed CrawlStartExpert.html HTML validation errors
7 years ago
luccioman 519fc9a600 Issue #156 : new option to clean up (or not) search cache on crawl start
7 years ago
luccioman eb20589e29 Fixed issue #158 : completed div CSS class ignore in crawl
7 years ago
luccioman 79a2ba306a Updated links to Java Regular Expressions documentation to version 8
7 years ago
Michael Peter Christen 25573bd5ab added a crawl filter based on <div> tag class names
7 years ago
luccioman 0f80c978d6 Limit the number of initially previewed links in crawl start pages.
8 years ago
luccioman 62f75417ef Updated Pattern JavaDoc links to current minimum (1.7) JDK version.
8 years ago
luccioman 812abfc868 Converted one more set of URLs to pure relative ones.
8 years ago
Michael Peter Christen 97930a6aad added must-not-match filter to snapshot generation.
10 years ago
Michael Peter Christen fed26f33a8 enhanced timezone managament for indexed data:
10 years ago
Michael Peter Christen 1309619a71 remove remote indexing option in crawl start if not in p2p mode
10 years ago
Michael Peter Christen b5ac29c9a5 added a html field scraper which reads text from html entities of a
10 years ago
Michael Peter Christen 8df8ffbb6d enhanced the snapshot functionality:
10 years ago
Michael Peter Christen 6f0167fac1 get cloned crawl start parameter for snapshots
10 years ago
Michael Peter Christen 97f6089a41 YaCy can now create web page snapshots as pdf documents which can later
10 years ago
orbiter f642cfbe30 added hint to the regular expression tester
10 years ago
Michael Peter Christen 2de159719b added an option to set 'obey nofollow' for links with rel="nofollow"
10 years ago
Michael Peter Christen 1b279d7a7e fixed external link
11 years ago
reger 89e2c5e884 fix: allow enable of CrawlStartExpert.html #file
11 years ago
Michael Peter Christen a2fba6584f use submitted default userAgent if cloning a crawl
11 years ago
orbiter d29b6db270 made crawl start pages public since they do not reveal individual
11 years ago