You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The performance of PL/R window functions is fairly poor. By building the "windows" by hand and pass them to R as vectors, I can get much better performance. Would it be possible for PL/R to use a similar approach to constructing windows?
The two queries below give the same results, but the first takes 37 seconds and the second 6 seconds. The performance of the second is even better (about 3 seconds) if I eliminate the extra array_agg/unnest steps I'm taking to be able to keep track of the data.
Query 1:
Here is a window function
CREATE OR REPLACEFUNCTIONpublic.winsorize(
double precision,
double precision)
RETURNS double precisionAS
$BODY$
library(psych)
return(winsor(as.vector(farg1), arg2)[prownum])
$BODY$ LANGUAGE plr WINDOW ;
Here is a query using this window function:
WITH raw_data AS (
SELECT gvkey, datadate, date_part('year', datadate) AS year,
CASE WHEN lag(ceq) OVER w >0 THEN ni/lag(ceq) OVER w END AS roe
FROMcomp.funda
WINDOW w AS (PARTITION BY gvkey ORDER BY datadate)
ORDER BY gvkey, datadate
LIMIT100000)
SELECT year, gvkey, datadate, roe,
winsorize(roe, 0.05) OVER (PARTITION BY year) AS roe_w
FROM raw_data
WHERE roe IS NOT NULLORDER BY year, roe;
Query 2:
Here is an alternative version of the query above in which the "windows" are constructed by hand.
WITH raw_data AS (
SELECT gvkey, datadate, date_part('year', datadate) AS year,
CASE WHEN lag(ceq) OVER w >0 THEN ni/lag(ceq) OVER w END AS roe
FROMcomp.funda
WINDOW w AS (PARTITION BY gvkey ORDER BY datadate)
ORDER BY gvkey, datadate
LIMIT100000),
intermediate AS (
SELECT year,
array_agg(gvkey ORDER BY roe) AS gvkeys,
array_agg(datadate ORDER BY roe) AS datadates,
array_agg(roe ORDER BY roe) AS roes,
winsorize_vec(array_agg(roe ORDER BY roe), 0.05) AS roe_ws
FROM raw_data
WHERE roe IS NOT NULLGROUP BY year)
SELECT year, unnest(gvkeys) AS gvkey,
unnest(datadates) AS datadate,
unnest(roes) AS roe,
unnest(roe_ws) AS roe_w
FROM intermediate
ORDER BY year, roe;
The query is using this function:
CREATE OR REPLACEFUNCTIONpublic.winsorize_vec(
float8[],
double precision)
RETURNS float8[] AS
$BODY$
library(psych)
return(winsor(as.vector(arg1), arg2))
$BODY$ LANGUAGE plr;
One idea I had was to create an SQL function like this:
CREATE OR REPLACE FUNCTION public.winsorize_sql(
double precision,
double precision)
RETURNS double precision AS
$BODY$
SELECT unnest(winsorize_vec(array_agg($1), $2))
$BODY$ LANGUAGE sql WINDOW
This compiles and I can use it in a query, but I get NULL as the return value every time.
The text was updated successfully, but these errors were encountered:
The performance of PL/R window functions is fairly poor. By building the "windows" by hand and pass them to R as vectors, I can get much better performance. Would it be possible for PL/R to use a similar approach to constructing windows?
The two queries below give the same results, but the first takes 37 seconds and the second 6 seconds. The performance of the second is even better (about 3 seconds) if I eliminate the extra
array_agg
/unnest
steps I'm taking to be able to keep track of the data.Query 1:
Here is a window function
Here is a query using this window function:
Query 2:
Here is an alternative version of the query above in which the "windows" are constructed by hand.
The query is using this function:
One idea I had was to create an SQL function like this:
This compiles and I can use it in a query, but I get
NULL
as the return value every time.The text was updated successfully, but these errors were encountered: