Commit Graph

189 Commits (43d5cd101ef7790a993f0810e20db38a559019f8)

Author SHA1 Message Date
Michael Christen 4304e07e6f crawl profile adoption to new tag valency attribute
2 years ago
reger24 a7e93d9328 Add option to add host to default blacklist from search result
3 years ago
reger24 027e284ef9 Enhance notability of current blacklist by diff color in header
3 years ago
reger24 18dddb74c9 Harmonize loading/reading blacklist
3 years ago
reger24 f28d705cd0 update IndexBroser_p add to blacklist button
3 years ago
reger24 6a5f0b3684 Servlet IndexBroser_p add button "Add to blacklist"
3 years ago
Daleth Darko 3ced06c731 Various javadoc fixes
3 years ago
Michael Peter Christen bd3f2483a1 replaced url and date retrieval by only url retrieval
3 years ago
Michael Peter Christen 9be36800a4 increased redirect depth by one
4 years ago
luccioman 61c337f29a Decode blacklist entries for easier edition of non ascii chars
6 years ago
luccioman ed93221fa1 Improved normalization of blacklist path patterns having non ascii chars
6 years ago
luccioman dbf4c1cd76 Improved blacklist entries editing operations :
7 years ago
luccioman 7baa99f26f Fixed stored URL in web cache when redirection(s) occurs.
7 years ago
luccioman 5db1c9155a Do locale independant case conversion on hosts, schemes, and file exts.
7 years ago
Michael Peter Christen 25573bd5ab added a crawl filter based on <div> tag class names
7 years ago
luccioman d8eaf621cc Fixed blacklist returned location URL on empty parameters
7 years ago
luccioman 1e84956721 Support loading local files with a per request specified maximum size.
7 years ago
luccioman bf55f1d6e5 Started support of partial parsing on large streamed resources.
7 years ago
luccioman 0b75e92ac2 Do not wrap unnecessarily loader IOExceptions in IOExceptions
7 years ago
luccioman 433bdb7c0d Respect maxFileSize limit also when streaming HTTP and when relevant.
7 years ago
luccioman 8399275142 Properly close file output streams even on exceptions scenarios.
8 years ago
luccioman a04feac064 Ensure file input streams proper closing in both success and failures
8 years ago
luccioman a9cb083fa1 Improved consistency between loader openInputStream and load functions
8 years ago
luccioman 522a268305 Improved new blacklist entries URL scheme detection.
8 years ago
luccioman 58d23047dd Handle '?' and '+' chars as valid wild cards when adding to blacklist.
8 years ago
luccioman f66438442e Extended Mediawiki dump import to remote URLs.
8 years ago
luccioman 54405577aa Replaced absolute redirection locations by relative ones when possible.
8 years ago
luccioman 339f005ced Blacklist import and update performance improvements.
8 years ago
reger 395f2e8946 Make ServletRequest implement the standardized HttpServletRequest interface,
8 years ago
luccioman f0639d810c Customized name for Threads still using the default "Thread-n" pattern.
8 years ago
luccioman da362628fb Added fine log level for too long blacklist matching processing.
8 years ago
luccioman 4b699c469a Blacklist refactoring : extracted a function for easier unit testing
8 years ago
luccioman 242707f9b4 Fixed loadFromCache with strategy IFFRESH.
8 years ago
Michael Peter Christen 5e165a8150 removed unused imports
8 years ago
reger 5e335b32da fix Blacklist.contains() matching path pattern to string
8 years ago
reger 5e9e871192 fix Blacklist.remove by using pattern.toString to find pattern to remove,
8 years ago
reger 1843ea7e69 on Blacklist.add pattern to source file also update internal entry maps
8 years ago
reger efb9f1a8b7 save resource for unused blacklistFiles map
9 years ago
reger 06d0e2aeb9 result heuristic (also used in greedy learning mode) to use outbound links if result is full index doc. Otherwise use default loader methode.
9 years ago
reger b7e8358645 make use of header.getContentType where possible (mime is normalized afterwards)
9 years ago
luc f01d49c37a Process large or local file images dealing directly with content
9 years ago
Michael Peter Christen fed26f33a8 enhanced timezone managament for indexed data:
10 years ago
Michael Peter Christen b5ac29c9a5 added a html field scraper which reads text from html entities of a
10 years ago
Michael Peter Christen 69eacdf4eb applying precompiled CommonPattern.COMMA.split to all places where
10 years ago
Michael Peter Christen 8df8ffbb6d enhanced the snapshot functionality:
10 years ago
reger ff18129def ViewFile servlet: update index if newer,
10 years ago
Michael Peter Christen e586e423aa in case that loading from the cache fails, load from wkhtmltopdf without
10 years ago
Michael Peter Christen d5bac64421 recognize more html file types for snapshots
10 years ago
Michael Peter Christen 25a64c51b3 moved snapshot generation out of the html handler to prevent that
10 years ago
reger 48aed15c48 skip loader wait cycle on concurrent access in nocache configuration.
10 years ago