Skip to content

Commit

Permalink
Merge pull request #10 from Arkoniak/dev
Browse files Browse the repository at this point in the history
Custom parsers and keyword arguments
  • Loading branch information
Arkoniak authored Apr 18, 2020
2 parents 917aae5 + 998046f commit cc516a3
Show file tree
Hide file tree
Showing 6 changed files with 206 additions and 91 deletions.
5 changes: 3 additions & 2 deletions Project.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
name = "UrlDownload"
uuid = "856ac37a-3032-4c1c-9122-f86d88358c8b"
authors = ["Andrey Oskin"]
version = "0.1.0"
version = "0.1.1"

[deps]
HTTP = "cd3eb016-35fb-5094-929b-558a96fad6f3"
Expand All @@ -18,6 +18,7 @@ Feather = "becb17da-46f6-5d3c-ad1b-1c5fe96bc73c"
ImageMagick = "6218d12a-5da1-5696-b52f-db25d2ecc6d1"
JSON3 = "0f8b85d8-7281-11e9-16c2-39a750bddbf1"
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"

[targets]
test = ["Test", "ImageMagick", "Feather", "CSV", "JSON3"]
test = ["Test", "ImageMagick", "Feather", "CSV", "JSON3", "DataFrames"]
88 changes: 83 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,32 @@
[![Build Status](https://travis-ci.com/Arkoniak/UrlDownload.jl.svg?branch=master)](https://travis-ci.com/Arkoniak/UrlDownload.jl)
[![Codecov](https://codecov.io/gh/Arkoniak/UrlDownload.jl/branch/master/graph/badge.svg)](https://codecov.io/gh/Arkoniak/UrlDownload.jl)

This is simple package aimed to simplify process of data downloading, without intermediate files storing. Additionally
`UrlDownload.jl` provides progress bar for big files with long download time. Currently these types of data are supported
This is small package aimed to simplify process of data downloading, without intermediate files storing. Additionally `UrlDownload.jl` provides progress bar for big files with long download time.

Currently these types of data are supported

* PIC: image files, such as jpeg, png, bmp etc
* CSV: files with comma separated values
* FEATHER
* JSON

# Usage
Unsupported file formats can be processed with the help of custom parsers.

# Installation

To install `UrlDownload` either do

```julia
using Pkg
Pkg.add("UrlDownload")
```

or switch to `Pkg` mode with `]` and issue
```julia
pkg> add UrlDownload
```

# Basic usage

## Download CSV files

Expand All @@ -33,6 +50,24 @@ df = urldownload(url) |> DataFrame
# │ 2 │ 3 │ 4 │
```

For csv and other file formats one can use keyword arguments from the corresponding
library, for example, to process csv with nonstandard delimiters, one can use
`delim` argument from `CSV.jl`

```julia
using UrlDownload
using DataFrames

url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/semicolon.csv"
df = urldownload(url, delim = ';') |> DataFrame
# 2×2 DataFrame
# │ Row │ x │ y │
# │ │ Int64 │ Int64 │
# ├─────┼───────┼───────┤
# │ 1 │ 1 │ 2 │
# │ 2 │ 3 │ 4 │
```

## Images

Images are supported through `ImageMagick.jl`
Expand Down Expand Up @@ -78,7 +113,8 @@ df = urldownload(url)
# │ 2 │ 3 │ 4 │
```

# Progress Meter
# Additional functionality
## Progress Meter

By default nothing is shown during data downloading, but it can be changed with passing `true` as
a second argument to the function `urldownload`
Expand All @@ -92,7 +128,49 @@ urldownload(url, true)
# Progress: 45%|████████████████████ | Time: 0:00:01
```

# Undetected file types
## Custom parsers
If file type is not supported by `UrlDownload.jl` it is possible to use custom parser
to process the data. Such parser should accept one positional argument, of the type
`Vector{UInt8}` and can have optional keyword arguments.

It should be used in `parser` argument of the `urldownload`.

```julia
using UrlDownload
using DataFrames
using CSV

url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/ext.csv"
res = urldownload(url, parser = x -> DataFrame(CSV.File(IOBuffer(x))))
# 2×2 DataFrame
# │ Row │ x │ y │
# │ │ Int64 │ Int64 │
# ├─────┼───────┼───────┤
# │ 1 │ 1 │ 2 │
# │ 2 │ 3 │ 4 │
```

If keywords arguments are used in custom parser they will accept values from
keyword arguments of `urldownload` function

```julia
using UrlDownload
using DataFrames
using CSV

wrapper(x; kw...) = DataFrame(CSV.File(IOBuffer(x); kw...))

url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/semicolon.csv"
res = urldownload(url, parser = wrapper, delim = ';')
# 2×2 DataFrame
# │ Row │ x │ y │
# │ │ Int64 │ Int64 │
# ├─────┼───────┼───────┤
# │ 1 │ 1 │ 2 │
# │ 2 │ 3 │ 4 │
```

## Undetected file types
Sometimes file type can't be detected from the url, in this case one can supply optional
`format` argument, to force necessary behavior

Expand Down
79 changes: 36 additions & 43 deletions src/UrlDownload.jl
Original file line number Diff line number Diff line change
Expand Up @@ -18,29 +18,29 @@ const ext2sym = Dict(
)

const sym2func = Dict(
:FEATHER => (x, y) -> load_feather(x, y),
:PIC => (x, y) -> load_pic(x, y),
:CSV => (x, y) -> load_csv(x, y),
:TSV => (x, y) -> load_csv(x, y),
:JSON => (x, y) -> load_json(x, y)
:FEATHER => (x, y; kw...) -> load_feather(x, y; kw...),
:PIC => (x, y; kw...) -> load_pic(x, y; kw...),
:CSV => (x, y; kw...) -> load_csv(x, y; kw...),
:TSV => (x, y; kw...) -> load_csv(x, y; kw...),
:JSON => (x, y; kw...) -> load_json(x, y; kw...)
)

function load_feather(buf, data)
function load_feather(buf, data; kw...)
lib = checked_import(:Feather)
return Base.invokelatest(lib.read, buf)
end

function load_csv(buf, data)
function load_csv(buf, data; kw...)
lib = checked_import(:CSV)
return Base.invokelatest(lib.File, buf)
return Base.invokelatest(lib.File, buf; kw...)
end

function load_pic(buf, data)
function load_pic(buf, data; kw...)
lib = checked_import(:ImageMagick)
return Base.invokelatest(lib.load_, data)
end

function load_json(buf, data)
function load_json(buf, data; kw...)
lib = checked_import(:JSON3)
return Base.invokelatest(lib.read, data)
end
Expand Down Expand Up @@ -85,47 +85,36 @@ function datatype(url)
error("$ext is unsupported.")
end

function wrapdata(url, data, format)
function wrapdata(url, data, format; kw...)
buf = IOBuffer(data)
dtype = format == nothing ? datatype(url) : format

sym2func[dtype](buf, data)
sym2func[dtype](buf, data; kw...)
end

# This is one is mimic FileIO.query, may be it should be
# moved to FileIO.
# function wrapdata(url, data, ::Nothing)
# buf = IOBuffer(data)
# _, ext = splitext(url)
# if haskey(ext2sym, ext)
# sym = ext2sym[ext]
# no_magic = !hasmagic(sym)
# # Sorry, I prefer CSV...
# if sym ∈ CSVS
# return CSV.File(buf)
# end
# if lensym(sym) == 1 && no_magic # we only found one candidate and there is no magic bytes, trust the extension
# return load(Stream{DataFormat{sym}, typeof(buf)}(buf, nothing))
# elseif lensym(sym) > 1 && no_magic
# return load(Stream{DataFormat{sym[1]}, typeof(buf)}(buf, nothing))
# end
# if no_magic && !hasfunction(sym)
# error("Some formats with extension ", ext, " have no magic bytes; use `urldownload(url, format = :FMT)` to resolve the ambiguity.")
# end
# end
# # Check the magic bytes
# load(query(buf, nothing))
# end


"""
urldownload(url; format = nothing, progress = false, headers = HTTP.Header[], update_period = 1, kw...)
Download file from the corresponding url
urldownload(url, progress = false; parser = nothing, format = nothing, headers = HTTP.Header[], httpkw = Pair[], update_period = 1, kw...)
Download file from the corresponding url in memory and process it to the necessary data structure.
*Arguments*
* `url`: url of download
* `progress`: show ProgressMeter, by default it is not shown
* `parser`: custom parser, function that should accept one positional argument of the type `Vector{UInt8}` and optional
keyword arguments and return necessary data structure. If parser is set than it overrides all other settings, such as `format`.
If parser is not set, than internal parsers are used for data process.
* `format`: one of the fixed formats (:CSV, :PIC, :FEATHER, :JSON), if set overrides autodetection mechanism.
* `headers`: `HTTP.jl` arguments that set http header of the request.
* `httpkw`: `HTTP.jl` additional keyword arguments that is passed to the `GET` function. Should be supplied as a vector of
pairs.
* `update_period`: period of `ProgressMeter` update, by default 1 sec
* `kw...`: any keyword arguments that should be passed to the data parser.
"""
function urldownload(url, progress = false; format = nothing, headers = HTTP.Header[], update_period = 1, kw...)
function urldownload(url, progress = false; parser = nothing, format = nothing, headers = HTTP.Header[],
update_period = 1, httpkw = Pair[], kw...)
body = UInt8[]
HTTP.open("GET", url, headers; kw...) do stream
HTTP.open("GET", url, headers; httpkw...) do stream
resp = startread(stream)
eof(stream) && return
total_bytes = Int(floor(parse(Float64, HTTP.header(resp, "Content-Length", "NaN"))))
Expand All @@ -141,7 +130,11 @@ function urldownload(url, progress = false; format = nothing, headers = HTTP.Hea
end
end

return wrapdata(url, body, format)
if parser == nothing
return wrapdata(url, body, format; kw...)
else
return parser(body; kw...)
end
end

end # module
51 changes: 10 additions & 41 deletions test/runtests.jl
Original file line number Diff line number Diff line change
@@ -1,47 +1,16 @@
module TestUrlDownload
using Test
using UrlDownload

@testset "Standard CSVs" begin
url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/ext.csv"
res = urldownload(url)
for file in sort([file for file in readdir(@__DIR__) if
match(r"^test.*_.*\.jl$", file) !== nothing])
m = match(r"^test[_0-9]*(.*).jl$", file)

url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/ext.tsv"
res = urldownload(url)
@testset "$(m[1])" begin
# Here you can optionally exclude some test files
# VERSION < v"1.1" && file == "test_xxx.jl" && continue

url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/noextcsv"
res = urldownload(url, format = :CSV)

url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/noexttsv"
res = urldownload(url, format = :TSV)
end

@testset "Force format CSVs" begin
url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/semicolon.csv"
res = urldownload(url)

url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/semicolonnoextcsv"
res = urldownload(url, format = :CSV)
end

@testset "Pics" begin
url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/test.jpg"
res = urldownload(url)

url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/test.png"
res = urldownload(url)
end

@testset "Feather" begin
url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/test.feather"
res = urldownload(url)
include(file)
end
end

@testset "Json" begin
url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/test.json"
res = urldownload(url)
end

@testset "Progress meter" begin
url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/ext.csv"
res = urldownload(url, true)
end
end # module
56 changes: 56 additions & 0 deletions test/test01_basic.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
module TestBasic

using Test
using UrlDownload

@testset "Standard CSVs" begin
url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/ext.csv"
res = urldownload(url)

url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/ext.tsv"
res = urldownload(url)

url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/noextcsv"
res = urldownload(url, format = :CSV)

url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/noexttsv"
res = urldownload(url, format = :TSV)
end

@testset "Force format CSVs" begin
url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/semicolon.csv"
res = urldownload(url)

url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/semicolonnoextcsv"
res = urldownload(url, format = :CSV)
end

@testset "Pics" begin
url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/test.jpg"
res = urldownload(url)

url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/test.png"
res = urldownload(url)
end

@testset "Feather" begin
url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/test.feather"
res = urldownload(url)
end

@testset "Json" begin
url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/test.json"
res = urldownload(url)
end

@testset "Progress meter" begin
url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/ext.csv"
res = urldownload(url, true)
end

@testset "Keyword arguments" begin
url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/semicolon.csv"
res = urldownload(url, delim = ';')
end

end # module
18 changes: 18 additions & 0 deletions test/test02_wrapper.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
module TestWrapper

using Test
using UrlDownload
using DataFrames
using CSV

wrapper(x; kw...) = DataFrame(CSV.File(IOBuffer(x); kw...))

@testset "wrappers" begin
url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/ext.csv"
res = urldownload(url, parser = x -> DataFrame(CSV.File(IOBuffer(x))))

url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/semicolon.csv"
res = urldownload(url, parser = wrapper, delim = ';')
end

end # module

0 comments on commit cc516a3

Please sign in to comment.