Skip to main content
Version: 2.2

Log-Scale IQR-Based Outlier Detection

Summary:

We use validated rates

to estimate outlier boundaries using the IQR method on the log scale, then use these boundaries on non-validated rates to flag outliers.

We define outlier boundaries in the following way:

(lowerc,upperc)={(Q11.5IQR,Q3+1.5IQR),if nc>40(medicare×10%,medicare×1000%),if nc<40(\text{lower}_c, \text{upper}_c) = \left\{ \begin{array}{ll} \left(Q_1 - 1.5 \cdot IQR'', Q_3 + 1.5 \cdot IQR''\right), & \text{if } n_c > 40 \\ \left(\text{medicare} \times 10\%, \text{medicare} \times 1000\%\right), & \text{if } n_c < 40 \end{array} \right.

where IQRIQR'' is a truncated IQR and ncn_c is the number of validated rates for a billing code, cc.

Validated rates are always included. Non-validated rates falling outside of the boundaries are considered outliers and excluded.

Methodology

Background:

Traditional IQR-based outlier detection calculates the Interquartile Range (IQR=Q3Q1IQR = Q3 - Q1), where Q3 is the 75th percentile and Q1 is the 25th percentile. Then, data are considered to be outliers if greater than Q3+1.5IQRQ3 + 1.5 * IQR or less than Q11.5IQRQ1 - 1.5 * IQR.

This IQR-based method faces challenges with negotiated rates, because negotiated rates often follow a right-skewed distribution with a long tail, dense central mass, and are bounded below by 0. Using the traditional IQR-based approach would flag too many outliers due to a thin IQR. Further, the traditional approach often yields negative outlier thresholds (e.g. Q11.5IQR<0Q1 - 1.5 * IQR < 0), which fails to capture any outliers on the low end.

Solution:

A common solution is to apply a log transformation to reduce the right-skew, making the data more closely resemble a normal distribution.

Log Transform the Data:

X={log(x1),log(x2),...,log(xn)}X' = \{\log(x_1), \log(x_2), ..., \log(x_n)\}

Q1=25th percentile of XQ1' = \text{25th percentile of } X'
Q3=75th percentile of XQ3' = \text{75th percentile of } X'

Compute the interquartile range:

IQR=Q3Q1IQR' = Q3' - Q1'

Identify Outliers:

A rate xix_i' in log-transformed space is considered an outlier if:

xi<Q1k×IQRorxi>Q3+k×IQRx_i' < Q1' - k \times IQR' \quad \text{or} \quad x_i' > Q3' + k \times IQR'

We'll evaluate k[1.5,2,2.5]k \in [1.5, 2, 2.5]

See SQL
CREATE TABLE hive.cld_utils.outlier_bounds_2025_03_13 AS
WITH
log_transformed AS (
SELECT
billing_code_type,
billing_code,
COALESCE(CAST(asp.payment_limit AS DOUBLE), medicare_rate) as medicare_rate,
ln(canonical_rate) as log_canonical_rate
FROM hive.cld_v0_1.prod_combined_all c
LEFT JOIN hive.plancheck_hospital_rates.asp_standardized_2025_01 asp
ON c.billing_code = asp.code
WHERE
canonical_rate_score = 13
AND canonical_rate > 0.1
AND
(
(
substring(billing_code, 1, 1) IN ('J','Q')
AND
canonical_rate_type like '%drug_dosage_std%'
)
OR
substring(billing_code, 1, 1) NOT IN ('J','Q')
)
),
iqr AS (
SELECT
billing_code_type,
billing_code,
AVG(medicare_rate) as avg_medicare_rate,
APPROX_PERCENTILE(log_canonical_rate, 0.25) as q1,
APPROX_PERCENTILE(log_canonical_rate, 0.75) as q3,
CAST(
APPROX_PERCENTILE(log_canonical_rate, 0.75)
- APPROX_PERCENTILE(log_canonical_rate, 0.25)
AS DECIMAL(8,5)
) as iqr,
count(*) as n
FROM log_transformed
GROUP BY 1,2
),
iqr_truncated AS (
SELECT
*,
CASE
WHEN iqr > 1 THEN 1
ELSE iqr
END as iqr_truncated
FROM iqr
)
SELECT
billing_code_type,
billing_code,
avg_medicare_rate,
q1,
q3,
iqr,
iqr_truncated,
CASE
WHEN n < 40
THEN avg_medicare_rate * 0.1
ELSE exp(q1 - 1.5 * iqr_truncated)
END as lower_bound,
CASE
WHEN n < 40
THEN avg_medicare_rate * 10
ELSE exp(q3 + 1.5 * iqr_truncated)
END as upper_bound,
n
FROM iqr_truncated
WHERE
(
avg_medicare_rate IS NOT NULL
OR
n > 40
)

Plot Example

In the plots below, we show the distribution of validated rates for CPT code 31755 in the top row and non-validated rates for CPT code 31755 in the bottom row. The left column shows the distributions on the linear scale, whereas the right column shows the distributions on the logarithmic scale.

In all four plots, the horizontal dotted lines indicate outlier thresholds, computed using validated rates, on the log scale (e.g. Q3+kIQRQ3' + k \cdot IQR'). For the plots on the left column, these log scale thresholds are converted back to the linear scale (e.g. eQ3+kIQRe^{Q3' + k \cdot IQR'})


31755

alt text

Exceptions

Exception 1: Truncating IQR

Sometimes, the spread of validated rates can be too wide. Consider code 75605 where validated rates may be as high as $60k and as low as $55. The IQR is 3.42 and q3 is 9.21, then e(9.21+1.53.42)=e^{(9.21 + 1.5*3.42)} = $1,689,595. To prevent such an extreme upper bound, we truncate the IQR to 1.

IQR=min(IQR,1)IQR'' = min(IQR', 1)

This is an aggressive edit but only affects where the rate variability is unbelievably high.

Here is a demonstration of truncation:

image

Here is the distribution of IQRs (blue = untruncated, orange = truncated):

image

Exception 2: Low Ns

If there are not enough validated rates for a given code, we cannot reliably estimate outlier bounds using this methodology. In this case, defer to medicare based benchmarks (10% of medicare, 1000% of medicare).

More Examples