Skip to content

Window function performance seems lacking. #28

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
iangow opened this issue Apr 3, 2017 · 0 comments
Open

Window function performance seems lacking. #28

iangow opened this issue Apr 3, 2017 · 0 comments

Comments

@iangow
Copy link

iangow commented Apr 3, 2017

The performance of PL/R window functions is fairly poor. By building the "windows" by hand and pass them to R as vectors, I can get much better performance. Would it be possible for PL/R to use a similar approach to constructing windows?

The two queries below give the same results, but the first takes 37 seconds and the second 6 seconds. The performance of the second is even better (about 3 seconds) if I eliminate the extra array_agg/unnest steps I'm taking to be able to keep track of the data.

Query 1:

Here is a window function

CREATE OR REPLACE FUNCTION public.winsorize(
    double precision,
    double precision)
  RETURNS double precision AS
$BODY$
	library(psych)
	return(winsor(as.vector(farg1), arg2)[prownum])
$BODY$ LANGUAGE plr WINDOW ;

Here is a query using this window function:

WITH raw_data AS (
	SELECT gvkey, datadate, date_part('year', datadate) AS year,
	    CASE WHEN lag(ceq) OVER w > 0 THEN ni/lag(ceq) OVER w END AS roe
	FROM comp.funda
	WINDOW w AS (PARTITION BY gvkey ORDER BY datadate)
	ORDER BY gvkey, datadate
	LIMIT 100000)

SELECT year, gvkey, datadate, roe,
    winsorize(roe, 0.05) OVER (PARTITION BY year) AS roe_w
FROM raw_data
WHERE roe IS NOT NULL
ORDER BY year, roe;

Query 2:
Here is an alternative version of the query above in which the "windows" are constructed by hand.

WITH raw_data AS (
	SELECT gvkey, datadate, date_part('year', datadate) AS year,
	    CASE WHEN lag(ceq) OVER w > 0 THEN ni/lag(ceq) OVER w END AS roe
	FROM comp.funda
	WINDOW w AS (PARTITION BY gvkey ORDER BY datadate)
	ORDER BY gvkey, datadate
	LIMIT 100000),

intermediate AS (
    SELECT year, 
	array_agg(gvkey ORDER BY roe) AS gvkeys,
	array_agg(datadate ORDER BY roe) AS datadates,
	array_agg(roe ORDER BY roe) AS roes,
	winsorize_vec(array_agg(roe ORDER BY roe), 0.05) AS roe_ws
    FROM raw_data
    WHERE roe IS NOT NULL
    GROUP BY year)

SELECT year, unnest(gvkeys) AS gvkey,
    unnest(datadates) AS datadate,
    unnest(roes) AS roe,
    unnest(roe_ws) AS roe_w
FROM intermediate
ORDER BY year, roe;

The query is using this function:

CREATE OR REPLACE FUNCTION public.winsorize_vec(
    float8[],
    double precision)
  RETURNS float8[] AS
$BODY$
	library(psych)
	return(winsor(as.vector(arg1), arg2))
$BODY$ LANGUAGE plr; 

One idea I had was to create an SQL function like this:

CREATE OR REPLACE FUNCTION public.winsorize_sql(
    double precision,
    double precision)
  RETURNS double precision AS
$BODY$
     SELECT unnest(winsorize_vec(array_agg($1), $2))
$BODY$ LANGUAGE sql WINDOW 

This compiles and I can use it in a query, but I get NULL as the return value every time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant