Commit Graph

53 Commits (740cbfd875d1abd414c8ad7af4a40b84a27f93df)

Author SHA1 Message Date
Michael Peter Christen 0579a9546a changed link to new forum location
3 years ago
Michael Peter Christen 3959d43a5c fixed doku link
3 years ago
sgaebel dd9d4b1188 replace org.junit.Assert.assertThat by
4 years ago
Michael Peter Christen 64a17faca0 added debug code to parser test to investigate why this fails in travis
4 years ago
Michael Christen 3a46b07603 fixed many links to old forum, now https://searchlab.eu
6 years ago
luccioman e90405b6f0 Support parsing audio URLs without file extension
6 years ago
luccioman 3fb449b3b6 Properly resolve relative URLs against document URL in html base tags
6 years ago
luccioman 685122363d Added a parser for XZ compressed archives.
6 years ago
luccioman 2c155ece77 Fixed JUnit test after removal of unused Transformer
7 years ago
luccioman eb20589e29 Fixed issue #158 : completed div CSS class ignore in crawl
7 years ago
luccioman 5a14d34a7d Refactoring : documented and extracted autotagging processing functions.
7 years ago
luccioman 58b9834729 Added HTML microdata typed items parsing capability.
7 years ago
Michael Peter Christen 25573bd5ab added a crawl filter based on <div> tag class names
7 years ago
luccioman e2f6427a63 Added a basic JUnit test for the Visio parser (vsdParser)
7 years ago
luccioman d41ad7af6f Restore initial locale at the end of a JUnit test case which modify it.
7 years ago
luccioman e0eda84c24 Remove old hard-coded holiday dates from DateDection class.
7 years ago
luccioman 73977ec0fe Added a html parser charset detection unit test
7 years ago
luccioman 32c9dfa768 Added partial bzip2 stream parsing support and bzipParser Junit test
7 years ago
luccioman c6ae87168a Added unit tests on the gzip parser.
7 years ago
luccioman 169ffdd1c7 Finer control on max links to parse in the html parser.
7 years ago
luccioman e41d046a9d Improved parsing support for OOXML spreadsheets (.xlsx)
7 years ago
luccioman 780173008e Implemented partial stream parsing of tar archives.
7 years ago
luccioman acab6a6def Also handle text content when parsing XML within limits.
7 years ago
luccioman ed678186a8 Updated xml parser limited parsing test for use latest jdk.
7 years ago
luccioman bf55f1d6e5 Started support of partial parsing on large streamed resources.
7 years ago
luccioman 2a87b08cea Removed temporary html parser test code
7 years ago
luccioman 90a7c1affa HTML parser : removed unnecessary remaining recursive processing
7 years ago
luccioman 9b1bb2545e Refactored plain-text URLs detection implementation.
7 years ago
luccioman 8da3174867 Ensure lower case conversion consistency with any default locale.
7 years ago
luccioman 286f3018bd Made mime type and extension normalization locale independent.
7 years ago
luccioman 319231a458 Added a generic XML parser, able to parse elements text and URLs.
7 years ago
luccioman 1acb7005d0 Added a basic JUnit test with test gz files for the gzip parser
8 years ago
luccioman 1e2fb76720 Properly close test files in htmlParser unit test
8 years ago
Michael Peter Christen 6fe735945d migrated Solr 5.5 -> Solr 6.6 and from Java 1.7 -> 1.8
8 years ago
luccioman a04feac064 Ensure file input streams proper closing in both success and failures
8 years ago
luccioman d98c04853d Ensure proper closing of file input streams.
8 years ago
reger 077d062be3 Adjust mergeDocuments to keep youngest last-modified date of document
8 years ago
reger 18c7563dbe Extend DCEntry.getLanguage convert to ISO639-1 codes for more languages
8 years ago
reger 41e2ee0eca Fix call parameter for ConnectionInfo in MonitorHandler
8 years ago
reger f254fcfc67 fix htmlParser <script> text extraction on code containing expression
8 years ago
luccioman c9889991b9 Fixed 2 failing JUNit tests.
8 years ago
reger cb95b7339a include html5 <time> tag in content scraper,
8 years ago
luccioman 7717a3d43d Fixed license headers on files created to improve favicon management.
8 years ago
luccioman 6e1959f469 Merge branch 'master' of https://github.com/yacy/yacy_search_server.git
8 years ago
reger b752bcfecb adjust date in text detection to ignore some program version strings
8 years ago
reger b017e97421 optimize condenser language detection a little.
8 years ago
reger ae3717d087 adjust Tokenizer sentence count to ignore repeated punktuation (like !!!! )
8 years ago
reger 1a79c64495 generalize DateDetection with holiday date rules readily available in icu
8 years ago
reger 272cdd496a reactivate sentence counter in WordTokenizer for phrasepos ranking,
8 years ago
reger e310ec5f70 fix posInText ranking calculation to score 0 on no position info
8 years ago