fixed indexing of external posts (#2983)

niebles · web-flow · commit b50db2e7133b · 2025-01-26T18:35:13.000-08:00
This should fix several issues with indexing external posts, including #1828. In short, I found that the issue with indexing was that the index builder was receiving 'empty' documents. To fix that, I'm setting the document content to be the post content as retrieved from the rss feed or the text extracted from the external page. I've tested with various blog sources and it seems to be working as expected now.
diff --git a/_plugins/external-posts.rb b/_plugins/external-posts.rb
@@ -62,6 +62,7 @@ def create_document(site, source_name, url, content)
       doc.data['description'] = content[:summary]
       doc.data['date'] = content[:published]
       doc.data['redirect'] = url
+      doc.content = content[:content]
       site.collections['posts'].docs << doc
     end
 
@@ -90,8 +91,12 @@ def fetch_content_from_url(url)
       parsed_html = Nokogiri::HTML(html)
 
       title = parsed_html.at('head title')&.text.strip || ''
-      description = parsed_html.at('head meta[name="description"]')&.attr('content') || ''
-      body_content = parsed_html.at('body')&.inner_html || ''
+      description = parsed_html.at('head meta[name="description"]')&.attr('content')
+      description ||= parsed_html.at('head meta[name="og:description"]')&.attr('content')
+      description ||= parsed_html.at('head meta[property="og:description"]')&.attr('content')
+
+      body_content = parsed_html.search('p').map { |e| e.text }
+      body_content = body_content.join() || ''
 
       {
         title: title,