HARPA AI | Reddit Thread Extraction

HARPA.AI

Extracts key info from Reddit posts and comments while preserving thread structure. Use this command on a Reddit thread page. #extraction

Created by Adrian Larsson

Updated on Nov 9, 2024 04:32

Installed 114 times

RUNS JS CODE

How to Use

Get HARPA AI from https://get.harpa.ai
Open HARPA AI with Alt+A on this page
Click IMPORT COMMAND

IMPORT COMMAND

Content

- type: say
  message: >-
    📌 Use this command on a Reddit thread page. You can use this command or JS
    code as a base for creating other commands or automations.
- type: ask
  param: targetMessageCount
  message: How many comments would you like to extract?
  options:
    - label: 50 comments
      value: 50
    - label: 100 comments
      value: 100
    - label: 200 comments
      value: 200
    - $custom
  default: ''
  vision:
    enabled: false
    mode: area
    hint: ''
    send: true
  optionsInvalid: false
- type: js
  args: targetMessageCount
  code: |2-
      async function scrollAndCollectMessages (targetMessageCount) {
        const config = {
          commentTree: '#comment-tree',
          comment: 'shreddit-comment',
          author: '.font-bold',
          timestamp: 'faceplate-timeago',
          content: '[slot="comment"]',
          pageTitle: 'h1',
          postAuthor: '.author-name',
          postContent: '[slot="text-body"]',
        }

        async function scrollAndCollectComments () {
          const comments = []
          const uniqueComments = new Set()
          let retries = 0
          const maxRetries = 5

          async function scrollAndWait () {
            window.scrollTo(0, document.body.scrollHeight)
            await new Promise(resolve => setTimeout(resolve, 200))
          }

          // Scroll back to top function
          function scrollToTop() {
            window.scrollTo(0, 0)
          }

          function extractCurrentComments () {
            const commentTree = document.querySelector(config.commentTree)
            const commentElements = commentTree.querySelectorAll(config.comment)
            let newCommentsFound = false

            commentElements.forEach(comment => {
              const author =
                comment.querySelector(config.author)?.innerText || 'Unknown'
              const time = comment.querySelector(config.timestamp)?.innerText || ''
              const content = comment.querySelector(config.content)?.innerText || ''
              const depth = parseInt(comment.getAttribute('depth')) || 0

              const commentId = `${time}-${author}-${content}`

              if (!uniqueComments.has(commentId) && author && time && content) {
                const prefix = ' → '.repeat(depth)
                const formattedComment = `${prefix}${time}. ${author}: ${content}`

                uniqueComments.add(commentId)
                comments.push(formattedComment)
                newCommentsFound = true
              }
            })

            return newCommentsFound
          }

          while (comments.length < targetMessageCount && retries < maxRetries) {
            const newCommentsFound = extractCurrentComments()

            if (comments.length < targetMessageCount) {
              await scrollAndWait()
              if (!newCommentsFound) {
                retries++
              } else {
                retries = 0
              }
            }
          }

          if (comments.length > targetMessageCount) {
            comments.splice(targetMessageCount)
          }

          // Scroll back to top after collecting comments
          scrollToTop()
          
          return comments
        }

        try {
          const result = {
            post: {
              title: document.querySelector(config.pageTitle)?.innerText || '',
              author: document.querySelector(config.postAuthor)?.innerText || '',
              content: document.querySelector(config.postContent)?.innerText || '',
            },
            comments: await scrollAndCollectComments(),
            url: window.location.href
          }

          return result
        } catch (error) {
          return null
        }
      }

      return scrollAndCollectMessages(targetMessageCount)
  param: array
  timeout: 15000
  silent: true
- type: calc
  func: extract-json
  to: array
  param: array
  index: all
- type: say
  message: >-
    **Post:** [{{title}}]({{url}})

    **Author:**
    [{{array.post.author}}](https://www.reddit.com/user/{{array.post.author}})


    **Data Array:**


    {{array}}

Notice: Please read before using

This automation command is created by a community member. HARPA AI team does not audit community commands.

Please review the command carefully and only install if you trust the creator.

Designed and engineered in Finland 🇫🇮

Home Use Cases Guides Privacy Policy Terms of Service

CAN WE STORE COOKIES?

Our website uses cookies for the purposes of accessibility and security. They also allow us to gather statistics in order to improve the website for you. More info: Privacy Policy

🧩 Reddit Thread Extraction

Extracts key info from Reddit posts and comments while preserving thread structure. Use this command on a Reddit thread page. #extraction

How to Use

Content