Commit Graph

401 Commits (c88c30a5c52dafb46c6d3eb401d23aa5feed63f1)

Author SHA1 Message Date
Michael Peter Christen 910a496c9f replaced http links with https
4 months ago
zutto 5268ae2ce9 check the document protocol & host values before proceeding to form final url.
5 months ago
zutto d958d1c0c4 ensure that returned SolrDocument is not null
5 months ago
Michael Peter Christen 9fcd8f1bda added canonical filter
2 years ago
Michael Peter Christen 5a52b01c09 front-end integration of tag valency
2 years ago
Michael Peter Christen 7f728bb4b4 crawl profile storage extension for tag valency
2 years ago
Michael Christen 4304e07e6f crawl profile adoption to new tag valency attribute
2 years ago
Michael Peter Christen 5acd98f4da introduction of tag-to-indexing relation TagValency
2 years ago
Michael Christen 99174282d8 try to shut down in a bit more ordered way
2 years ago
Michael Christen 8a06beaf24 removed finalize() methods, deprecated
2 years ago
Michael Peter Christen 23f1dc3741 addressing/fixing some concurrency issues from
2 years ago
Michael Peter Christen fc98ca7a9c removed ContentControl servlet and functinality
2 years ago
Daleth Darko 3ced06c731 Various javadoc fixes
3 years ago
Michael Peter Christen d7b17d8935 fixed missing thread name revert after balancer waiting
3 years ago
Michael Peter Christen bd3f2483a1 replaced url and date retrieval by only url retrieval
3 years ago
Michael Peter Christen 163ba26d90 replaced check for load time method
3 years ago
Michael Peter Christen 1ead7b85b5 remove compiler warning
3 years ago
sgaebel cdf901270c always use HTTPClient by 'try with resources' pattern to free up
3 years ago
sgaebel 69adaa9f55 makes our HTTPClient closable
3 years ago
Michael Peter Christen d19872fd26 making sure that crawl queues are closed correctly to prevent data loss
3 years ago
Michael Peter Christen e6a87e0426 enhanced crawler
3 years ago
Michael Peter Christen 9e13d77de4 removed call to class.finalize() because of deprecation in java 9
3 years ago
Michael Christen b2af745dd6
Merge pull request #404 from lnceballosz/master
4 years ago
sgaebel c69c462a15 replaces a expensive getLoadTimeURL() by exists()
4 years ago
Lina Ceballos a96752f5ab adding SPDX license and copyright headers
4 years ago
Michael Peter Christen 63f58e4785 enhanced strategy in host browser
4 years ago
Michael Peter Christen 9be36800a4 increased redirect depth by one
4 years ago
sgaebel fc03c4b4fe removes some warning and unused objects
4 years ago
sgaebel 9bc2297161 fixes deleting during recrawl
4 years ago
sgaebel 80785b785e adds deleting during recrawl
4 years ago
Michael Peter Christen e0ad8ca9da replaced json library from JSON.org with libandroid-json-java
5 years ago
luccioman 6b45cd5799 New optional crawl filter on the URL a doc must match to crawl its links
6 years ago
sgaebel 8d2e7262d9 Recrawl:
6 years ago
luccioman 08ea0b0397 Added a configurable timeout to wkhtmltopdf calls for pdf snapshots
6 years ago
luccioman fcf6b16db4 Added new crawler attribute for finer control over Media Type detection
6 years ago
luccioman 7adbd1f87d Fixed raw IPV6 addresses snapshots read/write on FAT32 and NTFS fs
6 years ago
luccioman 4ee14ff3c5 Fixed NullPointerException case on malformed crawl queue folder name
6 years ago
luccioman 21ad9435ec Fixed crawl queue folder naming for IPv6 hosts on MS Windows filesystems
6 years ago
luccioman dcad393fe5 Fixed exceeding max size of failreason_s Solr field on large link list
6 years ago
luccioman c726154a59 Fixed removal of URLs from the delegatedURL remote crawl stack
6 years ago
luccioman a15ac8e0ca Made CrawlProfile loading tolerant to malformed json string attribute
7 years ago
luccioman a715bb7876 Fixed rendering of solr mustNoMatch value on CrawlProfileEditor_p.xml
7 years ago
luccioman 0b302c5004 Do not block whole server startup on persisted crawl profile load error
7 years ago
luccioman 4d9aa4ed1e Fixed default crawl profile solr mustnotmatch query from previous commit
7 years ago
luccioman cced94298a Added a new crawler document filter type using Solr syntax
7 years ago
Michael Christen e0dc632020 removed transformer
7 years ago
luccioman fa4399d5d2 Small perf improvement : initialize threads names early when possible
7 years ago
luccioman fb3032c530 Added a crawl filtering possibility on documents Media Type (MIME)
7 years ago
luccioman e45afedee4 Added support for enclosures (media links) to the RSS loader
7 years ago
luccioman aaefd5219c Reduce log verbosity of RSS loader on feed items with no link
7 years ago