From 5acd98f4da615428eb2c03f34fab25d9484516f5 Mon Sep 17 00:00:00 2001 From: Michael Peter Christen Date: Fri, 13 Jan 2023 17:20:18 +0100 Subject: [PATCH] introduction of tag-to-indexing relation TagValency --- htroot/Crawler_p.html | 86 +-- .../net/yacy/crawler/data/CrawlProfile.java | 282 +++---- source/net/yacy/data/BookmarkHelper.java | 13 +- .../document/parser/html/ContentScraper.java | 692 +++++++++--------- .../yacy/document/parser/html/Scraper.java | 54 +- .../parser/html/ScraperInputStream.java | 5 +- .../yacy/document/parser/html/TagValency.java | 30 + .../parser/html/TransformerWriter.java | 20 +- .../net/yacy/document/parser/htmlParser.java | 21 +- source/net/yacy/htroot/Crawler_p.java | 18 +- 10 files changed, 651 insertions(+), 570 deletions(-) create mode 100644 source/net/yacy/document/parser/html/TagValency.java diff --git a/htroot/Crawler_p.html b/htroot/Crawler_p.html index 4e0b868aa..3d0b8fcda 100644 --- a/htroot/Crawler_p.html +++ b/htroot/Crawler_p.html @@ -28,13 +28,13 @@
Queues - + - + @@ -89,13 +89,13 @@
Index Size
Queue
 
Size
 
Local Crawler
- + - + @@ -124,12 +124,12 @@ Progress
Database
 
Entries
 
Seg-
ments
Documents
solr search api
- + - + @@ -147,7 +147,7 @@ @@ -180,13 +180,13 @@ @@ -219,23 +219,23 @@ window.setInterval("setTableSize()", 1000); If you crawl any un-wanted pages, you can delete them here.
:: No embedded local Solr index is connected. This is required to use a Solr query filter. - You can configure this with the Index Sources & targets page.:: - - The Solr filter query syntax is not valid : #[solrQuery]#:: - - Could not parse the Solr filter query : #[solrQuery]# + You can configure this with the Index Sources & targets page.:: + + The Solr filter query syntax is not valid : #[solrQuery]#:: + + Could not parse the Solr filter query : #[solrQuery]# #(/info)#

#(wontReceiptRemoteResults)#::
-

You asked for remote indexing, but remote crawl results won't be added to the local index as the remote crawler is currently disabled on this peer.

-

You can activate it in the Remote Crawl Configuration page.

+

You asked for remote indexing, but remote crawl results won't be added to the local index as the remote crawler is currently disabled on this peer.

+

You can activate it in the Remote Crawl Configuration page.

#(/wontReceiptRemoteResults)#
Indicator
 
Level
 
Speed / PPM
(Pages Per Minute)
Crawler PPM     - +