Commit Graph

95 Commits (6db7f5525b153b0ceb9d5c39a38a16772bc60e5b)

Author SHA1 Message Date
Michael Peter Christen 25573bd5ab added a crawl filter based on <div> tag class names
7 years ago
luccioman 46f37e38dc Customized Threads with generic name for easier monitoring.
7 years ago
reger 51a4e03c93 Allow to stop currently running warc import (stop button)
7 years ago
reger 1737af37cf Set request originator to own peer in warc importer
8 years ago
reger 039162fbf0 Change warc importer to use defaultsurrogate-crawl profile, as reported
8 years ago
luccioman edd7ccac40 Added some JavaDoc
8 years ago
reger bec34d3546 Add url input field as source for WarcImporter
8 years ago
luccioman f66438442e Extended Mediawiki dump import to remote URLs.
8 years ago
reger ba339a2a45 Add servlet to import warc file from filesystem IndexImportWarc_p.html.
8 years ago
reger 510f11d374 Implement surrogate import from Warc archives (as first option handle
8 years ago
luccioman 6a4d51d8f9 Cleaned up some Javadoc warnings.
8 years ago
luccioman eec5779889 Added a name prefix to pooled threads for easier monitoring.
8 years ago
luccioman f0639d810c Customized name for Threads still using the default "Thread-n" pattern.
8 years ago
luc 571bc55937 Refactoring : use StandardCharsets constants instead of hard-coded
9 years ago
reger 46ac0867ff fix poison mediawikiimporter output queue also after ExecutionException
9 years ago
reger a7591d3ed0 fix mediawikiimporter number format exception on coordinate parsing
9 years ago
reger 6b7c10cef8 fix dc:date in mediawikiimporter/document.writexml to use lastmodified
9 years ago
luc 7736ee5a42 Updated MediaWimporter main() : display usage in console and stop
9 years ago
reger bbe9df2bb3 fix MediawikiImporter for bz2 dump
9 years ago
Michael Peter Christen 6f4fe4b175 revert of 8a7c68e4c7
10 years ago
Michael Peter Christen fed26f33a8 enhanced timezone managament for indexed data:
10 years ago
Michael Peter Christen b5ac29c9a5 added a html field scraper which reads text from html entities of a
10 years ago
Michael Peter Christen 321840fde3 Replaced all fixed thread pools with cached thread pools. The cached
10 years ago
orbiter 97983ba89f fixed generics warnings for generic array instantiation that appeared
11 years ago
reger 8a7c68e4c7 content of surrogates/out never accessed (remove)
11 years ago
reger 121d25be38 recover sax fatal error on OAI-PMH import of xml with entity error
11 years ago
Michael Peter Christen da86f150ab - added a new Crawler Balancer: HostBalancer and HostQueues:
11 years ago
Michael Peter Christen 8b44fcf0f4 added missing @Override annotation
11 years ago
Michael Peter Christen 61c5e40687 - replaced the properties object in AnchorURL with distinct variables
11 years ago
Michael Peter Christen 5e31bad711 - the webgraph shall store all links which appear on a web page and not
11 years ago
Michael Peter Christen 765943a4b7 Redesign of crawler identification and robots steering. A non-p2p user
11 years ago
Michael Peter Christen 47b1c81d08 - refactoring
11 years ago
Roland Haeder 841a28ae76 Added 'final' for all exception blocks as this helps the Java compiler
11 years ago
Michael Peter Christen 5878c1d599 - refactoring of log to ConcurrentLog:
12 years ago
Michael Peter Christen 8f2d3ce2f9 reduced locking situation in crawler: shifted synchronized location and
12 years ago
Michael Peter Christen 5f0ab25382 removed the option to prevent removal of &amp; parts inside of the
12 years ago
Michael Peter Christen 1533bfd63b refactoring
12 years ago
Michael Peter Christen 00c1c777fa refactoring
12 years ago
Michael Peter Christen 24d9db1613 snippet retrieval loading processes may use a smaller minimum load time
12 years ago
orbiter 0cbda0b2b8 - replaced all length() == 0 and size() == 0 with isEmpty()
13 years ago
Michael Peter Christen 0301aba1e9 removed unused method parameters
13 years ago
Michael Peter Christen d3964253ae - added @SuppressWarnings to unused servlet method parameters
13 years ago
Michael Peter Christen ea10766bfd cleaned unnecessary nested code
13 years ago
Michael Peter Christen 1825f165b8 better integration of blacklist according to use case
13 years ago
Michael Peter Christen ce8d4b87d9 fixes for new eclipse 'Juno' warning 'Resource leak'.
13 years ago
Michael Peter Christen 963f92ed9a - merged files
13 years ago
Michael Peter Christen dd88d0ace2 more logging
13 years ago
Michael Peter Christen 461a0ce052 removed warnings
13 years ago
Michael Peter Christen 964406ad17 added concurrency enhancement to xml parser
13 years ago
Michael Peter Christen 4d3cc02168 replaced old bzip2 library against better documented commons-compress
13 years ago