Text preprocessing for summarization Innovation Release

When working with real-world text data like CRM notes, emails, or documents, preprocessing can significantly improve summarization quality.

The following utility functions are written in pure PL/pgSQL. They allow you to clean text and extract high-value content before summarizing, and they require no external dependencies beyond AI Accelerator.

clean_text

The clean_text function removes general noise from text that often dilutes summarization quality. It handles:

  • Markdown formatting: **bold**, # headers, and lists.
  • Common filler words and phrases: "basically", "just wanted to", "at the end of the day".
  • Extra whitespace and empty lines.
CREATE OR REPLACE FUNCTION clean_text(input_text TEXT)
RETURNS TEXT
LANGUAGE plpgsql IMMUTABLE STRICT AS $$
DECLARE
    result TEXT;
BEGIN
    result := input_text;

    -- Remove markdown formatting: **bold**, __underline__, # headers, --- separators
    result := regexp_replace(result, '\*{1,3}', '', 'g');
    result := regexp_replace(result, '_{1,3}', '', 'g');
    result := regexp_replace(result, '^#{1,6}\s*', '', 'gm');
    result := regexp_replace(result, '^-{3,}$', '', 'gm');
    result := regexp_replace(result, '^\s*[-*+]\s+', '', 'gm');

    -- Remove numbered list prefixes (1. 2. etc) but keep the content
    result := regexp_replace(result, '^\s*\d+\.\s+', '', 'gm');

    -- Remove common filler phrases (use \y for word boundary in PostgreSQL)
    result := regexp_replace(result, '\y(just wanted to|I just wanted to|wanted to follow up|as discussed|per our conversation|as mentioned|going forward|at the end of the day|in terms of|with respect to|in regards to|please find attached|hope this helps|let me know if you have any questions|looking forward to hearing from you)\y', '', 'gi');

    -- Remove common filler words (use \y for word boundary in PostgreSQL)
    result := regexp_replace(result, '\y(basically|essentially|actually|literally|honestly|frankly|obviously|clearly|simply|really|very|quite|rather|pretty much|kind of|sort of|in order to|due to the fact that|at this point in time|for all intents and purposes)\y', '', 'gi');

    -- Lowercase
    result := LOWER(result);

    -- Collapse multiple spaces/tabs to single space
    result := regexp_replace(result, '[ \t]+', ' ', 'g');

    -- Collapse multiple newlines to single newline
    result := regexp_replace(result, '\n\s*\n+', E'\n', 'g');

    -- Remove leading/trailing whitespace per line
    result := regexp_replace(result, '^\s+', '', 'gm');
    result := regexp_replace(result, '\s+$', '', 'gm');

    -- Remove empty lines
    result := regexp_replace(result, '^\s*$\n?', '', 'gm');

    -- Final trim
    result := TRIM(result);

    RETURN result;
END;
$$;

COMMENT ON FUNCTION clean_text(TEXT) IS
'Removes markdown, filler words, and extra whitespace from text. Returns cleaned lowercase text.';

Usage

SELECT clean_text('**Meeting Notes**

Just wanted to follow up on our discussion. Basically, the customer is interested in the product.');
Output
                                clean_text
---------------------------------------------------------------------------
 meeting notes                                                            +
 follow up on our discussion. , the customer is interested in the product.
(1 row)

clean_crm_boilerplate

The clean_crm_boilerplate function removes common CRM-specific boilerplate phrases that add no value to summarization. This function is separated from clean_text so you can customize the regex patterns for your organization's specific patterns.

CREATE OR REPLACE FUNCTION clean_crm_boilerplate(input_text TEXT)
RETURNS TEXT
LANGUAGE plpgsql IMMUTABLE STRICT AS $$
DECLARE
    result TEXT;
BEGIN
    result := input_text;

    -- Remove boilerplate CRM patterns (use \y for word boundary in PostgreSQL)
    result := regexp_replace(result, '\yNo update[s]?\y\.?', '', 'gi');
    result := regexp_replace(result, '\yCalendar invite sent[.]?\y', '', 'gi');
    result := regexp_replace(result, '\ySent (proposal|case study|documentation|overview|pricing) (documentation |via email|as requested)?\.?\y', '', 'gi');
    result := regexp_replace(result, '\yWaiting (on|for) callback\.?\y', '', 'gi');
    result := regexp_replace(result, '\yUpdated CRM with latest info\.?\y', '', 'gi');
    result := regexp_replace(result, '\yMeeting confirmed for next week\.?\y', '', 'gi');
    result := regexp_replace(result, '\yFollowing standard sales process\.?\y', '', 'gi');
    result := regexp_replace(result, '\yMeeting went as expected\.?\y', '', 'gi');

    -- Remove empty lines created by removals
    result := regexp_replace(result, '^\s*$\n?', '', 'gm');

    RETURN TRIM(result);
END;
$$;

COMMENT ON FUNCTION clean_crm_boilerplate(TEXT) IS
'Removes common CRM boilerplate phrases like "No updates", "Calendar invite sent", etc. Customize this function for your organization.';

Usage

SELECT clean_crm_boilerplate('Customer is interested in the product.

No updates.

Calendar invite sent.

Will follow up next week.');
Output
         clean_crm_boilerplate
----------------------------------------
 Customer is interested in the product.+
 .                                     +
 Will follow up next week.
(1 row)

extract_sentences

The extract_sentences function performs extractive summarization by selecting the most relevant sentences from text based on keyword matching. It scores sentences by:

  • Keyword overlap with the provided search terms.
  • Sentence length (bonus for substantive sentences over 200 characters).
  • Presence of numbers (bonus for data-rich content).
  • Causal/analytical language (bonus for words like "because", "reason", "issue", "competitor", "pricing").
CREATE OR REPLACE FUNCTION extract_sentences(
    input_text TEXT,
    keywords TEXT,
    num_sentences INTEGER DEFAULT 10
)
RETURNS TEXT
LANGUAGE plpgsql IMMUTABLE STRICT AS $$
DECLARE
    keyword_arr TEXT[];
    sentences TEXT[];
BEGIN
    IF input_text IS NULL OR input_text = '' THEN
        RETURN '';
    END IF;

    -- Parse keywords: lowercase, split on spaces, remove short words
    keyword_arr := ARRAY(
        SELECT DISTINCT LOWER(TRIM(word))
        FROM unnest(string_to_array(LOWER(keywords), ' ')) AS word
        WHERE LENGTH(TRIM(word)) > 2
    );

    IF array_length(keyword_arr, 1) IS NULL THEN
        RETURN LEFT(input_text, 2000);
    END IF;

    -- Split text into sentences on . ? ! and newlines
    -- Keep sentences that are meaningful (> 20 chars)
    sentences := ARRAY(
        SELECT TRIM(s)
        FROM unnest(
            regexp_split_to_array(
                regexp_replace(input_text, E'\n+', '. ', 'g'),
                E'(?<=[.!?])\\s+'
            )
        ) AS s
        WHERE LENGTH(TRIM(s)) > 20
    );

    IF array_length(sentences, 1) IS NULL THEN
        RETURN input_text;
    END IF;

    -- Score each sentence and collect top N
    RETURN (
        SELECT STRING_AGG(scored_sentence, E'\n' ORDER BY rank_score DESC)
        FROM (
            SELECT
                s AS scored_sentence,
                (
                    SELECT COUNT(*)::NUMERIC
                    FROM unnest(keyword_arr) AS keyword
                    WHERE LOWER(s) LIKE '%' || keyword || '%'
                )
                -- Bonus for longer, more substantive sentences
                + CASE WHEN LENGTH(s) > 200 THEN 0.5 ELSE 0 END
                -- Bonus for sentences with numbers (data-rich)
                + CASE WHEN s ~ '\d+' THEN 0.3 ELSE 0 END
                -- Bonus for causal/analytical language
                + CASE WHEN LOWER(s) ~ '(because|reason|due to|caused|result|impact|issue|problem|concern|risk|challenge|blocker|gap|lack|missing|competitor|pricing|budget|cost|expensive|cheaper|alternative|decided|chose|prefer|switched|rejected|declined)' THEN 1.0 ELSE 0 END
                AS rank_score
            FROM unnest(sentences) AS s
            ORDER BY rank_score DESC
            LIMIT num_sentences
        ) ranked
    );
END;
$$;

COMMENT ON FUNCTION extract_sentences(TEXT, TEXT, INTEGER) IS
'Extracts the N most relevant sentences from a text block based on keyword matching. Uses keyword overlap scoring with bonuses for substantive, data-rich, and causal/analytical language.';

Usage

SELECT extract_sentences(
    'The customer loved the demo. They mentioned pricing concerns. The competitor offers a cheaper alternative. We discussed the technical requirements. The meeting went well overall.',
    'pricing competitor cost',
    2
);
Output
              extract_sentences
----------------------------------------------
 They mentioned pricing concerns.            +
 The competitor offers a cheaper alternative.
(1 row)

extract_relevant

The extract_relevant function combines text cleaning and sentence extraction into a single call. It first cleans the input text (including CRM boilerplate), then extracts the most relevant sentences.

CREATE OR REPLACE FUNCTION extract_relevant(
    input_text TEXT,
    prompt TEXT,
    num_sentences INTEGER DEFAULT 10
)
RETURNS TEXT
LANGUAGE plpgsql IMMUTABLE STRICT AS $$
BEGIN
    RETURN extract_sentences(
        clean_text(clean_crm_boilerplate(input_text)),
        prompt,
        num_sentences
    );
END;
$$;

COMMENT ON FUNCTION extract_relevant(TEXT, TEXT, INTEGER) IS
'Cleans text (removes markdown, filler, CRM boilerplate) then extracts the N most relevant sentences based on keyword matching against the prompt.';

Example: Analyzing lost deals with preprocessing

This example shows how to use extract_relevant with summarize_text_aggregate to analyze why deals were lost. The preprocessing extracts only the most relevant sentences about lost deals before summarizing.

-- Create a model for summarization
SELECT aidb.create_model(
    'nvidia_llama'::text,
    'nim_completions'::character varying,
    config => '{"model": "nvidia/llama-3.3-nemotron-super-49b-v1.5"}'::JSONB,
    credentials => jsonb_build_object('api_key', '<your-api-key>')::JSONB,
    replace_credentials => true
);

-- Analyze lost deals by product
SELECT p.product_name,
       COUNT(*) AS lost_deals,
       aidb.summarize_text_aggregate(
           extract_relevant(
               sn.note_text,
               'lost deal reason competitor pricing budget cost declined rejected',
               2
           ),
           aidb.summarize_text_config(
               'nvidia_llama',
               prompt => 'Analyze these sales notes for lost deals. Identify the top 3 reasons we lost. Be specific and actionable. Keep your response under 250 words. Format as a numbered list.'
           )::json ORDER BY sn.created_at
       ) AS loss_insights
FROM sales_orders so
JOIN products p ON so.product_id = p.product_id
JOIN sales_notes sn ON sn.order_id = so.order_id
WHERE so.status = 'lost'
  AND so.order_date >= '2025-11-01'
  AND so.order_date <= '2026-02-01'
GROUP BY p.product_name
ORDER BY lost_deals DESC;

The extract_relevant function:

  1. Cleans each note (removes filler words, markdown, boilerplate).
  2. Extracts the two most relevant sentences based on keywords like "lost", "competitor", "pricing", "budget".

This preprocessed text is then aggregated and summarized, producing more focused insights than summarizing the raw notes directly.

Customizing the functions

These functions are designed as modular starting points. You can adapt them to your specific domain or organizational needs:

  • clean_text: Add or modify the filler word patterns to match your communication style.
  • clean_crm_boilerplate: Add your organization's specific CRM templates and boilerplate phrases.
  • extract_sentences: Adjust the bonus keywords in the scoring section to prioritize different types of content, such as technical terms, specific product names, or sentiment indicators.
  • extract_relevant: Change the default num_sentences based on your typical text length and summarization needs.