Commit Graph

62 Commits (7d5ba2afa4810fc52dc229a82a9020768fb6c150)

Author SHA1 Message Date
reger 9e94989237 upd to PDFBox 2.0.1
9 years ago
reger 24b0fa2a38 extend snapshot Html2Image.pdf2image to use PDFBox image export capability
9 years ago
reger 06d0e2aeb9 result heuristic (also used in greedy learning mode) to use outbound links if result is full index doc. Otherwise use default loader methode.
9 years ago
luc 571bc55937 Refactoring : use StandardCharsets constants instead of hard-coded
9 years ago
reger 7d0d19cb8e avoid File.deleteOnExit() on temp files
9 years ago
Michael Peter Christen fed26f33a8 enhanced timezone managament for indexed data:
10 years ago
Michael Peter Christen b5ac29c9a5 added a html field scraper which reads text from html entities of a
10 years ago
Michael Peter Christen 6ad43c4a8b removed debug code
10 years ago
Michael Peter Christen 8c3e5b7b6d added experimental pdf splitting which enables YaCy to split pdfs during
10 years ago
reger 59c6532a65 add link extraction to pdfParser
10 years ago
reger aa2e15d846 allow url parameter in worktable apicall
10 years ago
reger 3b559e7846 optimize pdfParser
11 years ago
reger 09f73b790f fix pdfParser not closed warning from pdfbox
11 years ago
orbiter 19a051bec8 more monitoring for postprocessing and enhanced layout in Crawler
11 years ago
Michael Peter Christen 81d9e23532 fixed another memory leak in the PDF parser:
11 years ago
Michael Peter Christen 61c5e40687 - replaced the properties object in AnchorURL with distinct variables
11 years ago
Michael Peter Christen 5e31bad711 - the webgraph shall store all links which appear on a web page and not
11 years ago
Michael Peter Christen 35ab2cef7b added parsing of 'date', 'dc:date', 'dc.date' and 'last-modified' in
11 years ago
Michael Peter Christen 5878c1d599 - refactoring of log to ConcurrentLog:
12 years ago
Michael Peter Christen 3e1e358fdc calling pdf cache flush on class initialization because calling of the
12 years ago
Michael Peter Christen 5344a1c5f7 getting the trash out
12 years ago
Michael Peter Christen 95712fdc8b update to pdf parser
12 years ago
sixcooler 6c50d016ed pdf- and zipParser should not use forced Memory-Limits
12 years ago
Michael Peter Christen 528d6763fa - added new solr fields:
12 years ago
orbiter d9173ba7ed added more solr fields to integrate values from URIMetadataRow. All
12 years ago
orbiter 0cbda0b2b8 - replaced all length() == 0 and size() == 0 with isEmpty()
13 years ago
orbiter 78fc3cf8f8 refactoring and new usage of SentenceReader: this class appeared as one
13 years ago
Michael Peter Christen 0c345d1559 giving threads name so its easier to see whats happening during
13 years ago
Michael Peter Christen 5fc6524ca8 - moved triple store to net.yacy.cora.lod (should be generalized there
13 years ago
Michael Peter Christen 4540174fe0 memory hacks
13 years ago
Michael Peter Christen 1f4f60654a Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
13 years ago
reger 32104360ce PDFParser - return at least first 3 pages of PDF
13 years ago
Michael Peter Christen eadb58dd87 small enhancements in pdf parser
13 years ago
reger b616de5973 PDFParser - return at least first 3 pages of PDF
13 years ago
Michael Peter Christen b7bb84c0bb set a limit to CharBuffer object size to fight against bad/too large
13 years ago
sixcooler 69570fda24 bring my master to stuff from remote
13 years ago
sixcooler f280e339a8 no force on Memory Request for these parser
13 years ago
orbiter 8a428d3e77 ensure termination of pdf parser to avoid deadlocking of other processes during search result preparation
13 years ago
orbiter 77fe69395d added jempbox-1.5.0.jar which is required by pdfbox-1.5 as stated in http://pdfbox.apache.org/dependencies.html
14 years ago
orbiter 0c1b29f3c9 - applied many small performance hacks
14 years ago
orbiter b77b8cac0c - enhanced html parser: recognized much more details in the content
14 years ago
orbiter 9b25d07295 - added geo information parsing to html parser
14 years ago
orbiter 1989ebc24b removed more warnings
14 years ago
orbiter 694fa3a2a5 - replaced more direct string-based UTF-8 conversions by predefined UTF-8 conversion
14 years ago
low012 c0274bd123 *) minor changes
14 years ago
orbiter 7bdb13bf7f more fixes to smb crawling: better file names
14 years ago
orbiter b769cce433 - added a catch-all parser for all documents that cannot be parsed: they will contributed with their document url for the search index only
14 years ago
orbiter 114bdd8ba7 fixed old sitemap importer which was not able to parse urls containing post elements
14 years ago
orbiter c0b08ac59b slighlty changed way of pdf parser integration
14 years ago
orbiter 5fe828fa06 - replaced pdfbox and fontbox version 1.1.0 with 1.2.1
14 years ago