let's have following table:
+--------------------+--------------------+------+------------+--------------------+ | host| path|status|content_size| time| +--------------------+--------------------+------+------------+--------------------+ |js002.cc.utsunomi...|/shuttle/resource...| 404| 0|1995-08-01 00:07:...| | tia1.eskimo.com |/pub/winvn/releas...| 404| 0|1995-08-01 00:28:...| |grimnet23.idirect...|/www/software/win...| 404| 0|1995-08-01 00:50:...| |miriworld.its.uni...|/history/history.htm| 404| 0|1995-08-01 01:04:...| | ras38.srv.net |/elv/delta/uncons...| 404| 0|1995-08-01 01:05:...| | cs1-06.leh.ptd.net | | 404| 0|1995-08-01 01:17:...| |dialip-24.athenet...|/history/apollo/a...| 404| 0|1995-08-01 01:33:...| | h96-158.ccnet.com |/history/apollo/a...| 404| 0|1995-08-01 01:35:...| | h96-158.ccnet.com |/history/apollo/a...| 404| 0|1995-08-01 01:36:...| | h96-158.ccnet.com |/history/apollo/a...| 404| 0|1995-08-01 01:36:...| | h96-158.ccnet.com |/history/apollo/a...| 404| 0|1995-08-01 01:36:...| | h96-158.ccnet.com |/history/apollo/a...| 404| 0|1995-08-01 01:36:...| | h96-158.ccnet.com |/history/apollo/a...| 404| 0|1995-08-01 01:36:...| | h96-158.ccnet.com |/history/apollo/a...| 404| 0|1995-08-01 01:36:...| | h96-158.ccnet.com |/history/apollo/a...| 404| 0|1995-08-01 01:37:...| | h96-158.ccnet.com |/history/apollo/a...| 404| 0|1995-08-01 01:37:...| | h96-158.ccnet.com |/history/apollo/a...| 404| 0|1995-08-01 01:37:...| |hsccs_gatorbox07....|/pub/winvn/releas...| 404| 0|1995-08-01 01:44:...| |www-b2.proxy.aol....|/pub/winvn/readme...| 404| 0|1995-08-01 01:48:...| |www-b2.proxy.aol....|/pub/winvn/releas...| 404| 0|1995-08-01 01:48:...| +--------------------+--------------------+------+------------+--------------------+
how filter table have distinct paths in pyspark? table should contains columns.
if want save rows values in specific column distinct, have call dropduplicates
method on dataframe. in example:
dataframe = ... dataframe.dropduplicates(['path'])
where path column name
Comments
Post a Comment