Commit Graph

374 Commits (01cc32217fb2dccaf20284261579addca0d42474)

Author SHA1 Message Date
sgaebel fc03c4b4fe removes some warning and unused objects
4 years ago
sgaebel 9bc2297161 fixes deleting during recrawl
4 years ago
sgaebel 80785b785e adds deleting during recrawl
4 years ago
Michael Peter Christen e0ad8ca9da replaced json library from JSON.org with libandroid-json-java
5 years ago
luccioman 6b45cd5799 New optional crawl filter on the URL a doc must match to crawl its links
6 years ago
sgaebel 8d2e7262d9 Recrawl:
6 years ago
luccioman 08ea0b0397 Added a configurable timeout to wkhtmltopdf calls for pdf snapshots
6 years ago
luccioman fcf6b16db4 Added new crawler attribute for finer control over Media Type detection
6 years ago
luccioman 7adbd1f87d Fixed raw IPV6 addresses snapshots read/write on FAT32 and NTFS fs
6 years ago
luccioman 4ee14ff3c5 Fixed NullPointerException case on malformed crawl queue folder name
6 years ago
luccioman 21ad9435ec Fixed crawl queue folder naming for IPv6 hosts on MS Windows filesystems
6 years ago
luccioman dcad393fe5 Fixed exceeding max size of failreason_s Solr field on large link list
6 years ago
luccioman c726154a59 Fixed removal of URLs from the delegatedURL remote crawl stack
6 years ago
luccioman a15ac8e0ca Made CrawlProfile loading tolerant to malformed json string attribute
7 years ago
luccioman a715bb7876 Fixed rendering of solr mustNoMatch value on CrawlProfileEditor_p.xml
7 years ago
luccioman 0b302c5004 Do not block whole server startup on persisted crawl profile load error
7 years ago
luccioman 4d9aa4ed1e Fixed default crawl profile solr mustnotmatch query from previous commit
7 years ago
luccioman cced94298a Added a new crawler document filter type using Solr syntax
7 years ago
Michael Christen e0dc632020 removed transformer
7 years ago
luccioman fa4399d5d2 Small perf improvement : initialize threads names early when possible
7 years ago
luccioman fb3032c530 Added a crawl filtering possibility on documents Media Type (MIME)
7 years ago
luccioman e45afedee4 Added support for enclosures (media links) to the RSS loader
7 years ago
luccioman aaefd5219c Reduce log verbosity of RSS loader on feed items with no link
7 years ago
luccioman 17c7a85f18 Make StreamResponse usable in Java try-with-resources statements
7 years ago
luccioman 80fb1026d0 Create recrawl requests with the relevant crawl profile.
7 years ago
luccioman 46b5249c20 Removed time condition on HostBalancer initialization in JUnit test.
7 years ago
luccioman 8b572b7337 Commit Solr index before simulating or starting recrawl job.
7 years ago
luccioman 7baa99f26f Fixed stored URL in web cache when redirection(s) occurs.
7 years ago
luccioman 9ddf92d143 Removed unncessary reflection usage for workflow tasks.
7 years ago
luccioman 897d3d30cc Added new recrawl job profile to the list of default crawl profiles
7 years ago
luccioman b712a0671e Added a specific default crawl profile for the recrawl job.
7 years ago
luccioman adf3fa493d Added comments about crawl profiles recrawl cycles
7 years ago
luccioman 3638e16c2e More comprehensive log on rejected recrawls caused by date constraint
7 years ago
luccioman d47afe6fab Use a constant for crawler reject reason prefix with specific processing
7 years ago
luccioman 4e03335625 Added more details to the recrawl job report
7 years ago
luccioman 433e241e4f Added a report info box about eventual last terminated recrawl job
7 years ago
luccioman b2af25b14f Added a stop condition to the Recrawl busy thread
7 years ago
luccioman 421728d25a Made possible to customize selection query before launching a recrawl
7 years ago
luccioman 09c4ee56a7 Added optional https support for remote crawl and profile operations
7 years ago
luccioman 5db1c9155a Do locale independant case conversion on hosts, schemes, and file exts.
7 years ago
Michael Peter Christen 25573bd5ab added a crawl filter based on <div> tag class names
7 years ago
luccioman 46f37e38dc Customized Threads with generic name for easier monitoring.
7 years ago
luccioman 046be566e1 Updated a license header typo.
7 years ago
Apply55gx 3c905a2a5c fix typo
7 years ago
luccioman 6cec2cdcb5 Use unredirected robots.txt URL when adding an entry to the table.
7 years ago
luccioman 3f0446f14b Ensure proper synchronous robots entry retrieval on first check.
7 years ago
luccioman 5a646540cc Support parsing gzip files from servers with redundant headers.
7 years ago
luccioman 11a7f923d4 Distinguish response parsing failures from unexpected exceptions.
7 years ago
luccioman 452a17a8d5 Finer control on bounded input streams with custom stream implementation
7 years ago
luccioman 1e84956721 Support loading local files with a per request specified maximum size.
7 years ago