Commit Graph

182 Commits (c2ad1950e88842a6211585379f5dd3a3ce05f280)

Author SHA1 Message Date
Michael Peter Christen 9fcd8f1bda added canonical filter
2 years ago
Michael Peter Christen 5a52b01c09 front-end integration of tag valency
2 years ago
Michael Peter Christen 7f728bb4b4 crawl profile storage extension for tag valency
2 years ago
Michael Christen 4304e07e6f crawl profile adoption to new tag valency attribute
2 years ago
Michael Peter Christen 5acd98f4da introduction of tag-to-indexing relation TagValency
2 years ago
Michael Christen 8a06beaf24 removed finalize() methods, deprecated
2 years ago
Daleth Darko 3ced06c731 Various javadoc fixes
3 years ago
Michael Peter Christen e6a87e0426 enhanced crawler
3 years ago
Michael Peter Christen 9e13d77de4 removed call to class.finalize() because of deprecation in java 9
3 years ago
Lina Ceballos a96752f5ab adding SPDX license and copyright headers
4 years ago
Michael Peter Christen e0ad8ca9da replaced json library from JSON.org with libandroid-json-java
5 years ago
luccioman 6b45cd5799 New optional crawl filter on the URL a doc must match to crawl its links
6 years ago
luccioman 08ea0b0397 Added a configurable timeout to wkhtmltopdf calls for pdf snapshots
6 years ago
luccioman fcf6b16db4 Added new crawler attribute for finer control over Media Type detection
6 years ago
luccioman 7adbd1f87d Fixed raw IPV6 addresses snapshots read/write on FAT32 and NTFS fs
6 years ago
luccioman 4ee14ff3c5 Fixed NullPointerException case on malformed crawl queue folder name
6 years ago
luccioman dcad393fe5 Fixed exceeding max size of failreason_s Solr field on large link list
6 years ago
luccioman c726154a59 Fixed removal of URLs from the delegatedURL remote crawl stack
6 years ago
luccioman a15ac8e0ca Made CrawlProfile loading tolerant to malformed json string attribute
7 years ago
luccioman a715bb7876 Fixed rendering of solr mustNoMatch value on CrawlProfileEditor_p.xml
7 years ago
luccioman 4d9aa4ed1e Fixed default crawl profile solr mustnotmatch query from previous commit
7 years ago
luccioman cced94298a Added a new crawler document filter type using Solr syntax
7 years ago
luccioman fa4399d5d2 Small perf improvement : initialize threads names early when possible
7 years ago
luccioman fb3032c530 Added a crawl filtering possibility on documents Media Type (MIME)
7 years ago
luccioman 09c4ee56a7 Added optional https support for remote crawl and profile operations
7 years ago
luccioman 5db1c9155a Do locale independant case conversion on hosts, schemes, and file exts.
7 years ago
Michael Peter Christen 25573bd5ab added a crawl filter based on <div> tag class names
7 years ago
luccioman 9dd790087d Added HT Cache basic statistics (hit rate)
8 years ago
luccioman 28b451a0b3 Made Cache compression level and lock timeout user configurable
8 years ago
luccioman a7394b479b Limit the synchronization blocking time on some Cache operations.
8 years ago
luccioman 8399275142 Properly close file output streams even on exceptions scenarios.
8 years ago
luccioman d98c04853d Ensure proper closing of file input streams.
8 years ago
luccioman 39e081ef38 Fixed display of crawler pending URLs counts in HostBrowser.html page.
8 years ago
reger 87f6631a2a adjust Cache getHeader to prev. changes/commit
8 years ago
luccioman f0639d810c Customized name for Threads still using the default "Thread-n" pattern.
8 years ago
luccioman dcdea2d02f Fixed shutdown for crawler.MaxActiveThreads value greater than 200
8 years ago
luccioman 3ee4f56c39 Improved ErrorCache behavior when switching networks
8 years ago
Michael Peter Christen 5e165a8150 removed unused imports
8 years ago
reger eb2a00b1d8 fix NPE on missing crawldepth_i
9 years ago
reger 7be1c7a05a fix logger name
9 years ago
reger 379e9b330d use supplied url port to get robots.txt in crawlers hostqueue
9 years ago
Ryszard Goń a98c395023 Add the Autocrawl thread
9 years ago
reger 90686a75a2 fix flux factor (additional crawl delay by access count) calculation
9 years ago
reger 367fe388b9 fix exception throw after sendError in DefaultServlet
9 years ago
Michael Peter Christen 8f90767889 fix for filesystem crawl
9 years ago
Michael Peter Christen fbeae20b3a try a healing of the cache if the index file is corrupted
9 years ago
Michael Peter Christen 3c4c69adea fix for
10 years ago
Michael Peter Christen 9c12555be5 added link to Snapshots in search results if the snapshot exists and
10 years ago
Michael Peter Christen 197f7449e5 All entities of crawl profiles are now editable in the crawl profile
10 years ago
reger 3e742d1e34 Init remote crawler on demand
10 years ago