sql - redshift select distinct returns repeated values -
i have database each object property stored in separate row. attached query not return distinct values in redshift database works expected when testing in mysql compatible database.
select distinct distinct_value ( select uri, ( select distinct value_string `test_organization__app__testsegment` x x.uri = parent.uri , name = 'hasteststring' , parent.value_string not null ) distinct_value `test_organization__app__testsegment` parent uri in ( select uri `test_organization__app__testsegment` name = 'types' , value_uri_multivalue = 'document' ) ) t distinct_value not null order distinct_value asc limit 10000 offset 0
this not bug , behavior intentional, though not straightforward.
in redshift, can declare constraints on tables redshift doesn't enforce them, i.e. allows duplicate values if insert them. difference here when run select distinct query against column doesn't have primary key declared scan whole column , unique values, , if run same on column has primary key constraint return output without scanning. how can duplicate entries if insert them.
why done? redshift optimized large datasets , it's faster copy data if don't need check constraint validity every row copy or insert. if want can declare primary key constraint part of data model need explicitly support removing duplicates or designing etl in way there no such.
more information specific examples in heap blog post redshift pitfalls , how avoid them
Comments
Post a Comment