Skip to content

parse-string encoding issue #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 6 additions & 5 deletions src/pl/danieljanus/tagsoup.clj
Original file line number Diff line number Diff line change
Expand Up @@ -102,16 +102,17 @@ removes empty (whitespace-only) PCDATA from in between the tags, which
makes the resulting tree cleaner. If prefer-header-http-info is true
and the encoding is specified in both <meta http-equiv> tag and the
HTTP headers (in this case, input must be a URL or a string
representing one), the latter is preferred."
[input & {:keys [xml strip-whitespace prefer-header-http-info], :or {strip-whitespace true}}]
representing one), the latter is preferred. If encoding cannot be
inferred from the input, use 'encoding' as the default."
[input & {:keys [xml strip-whitespace prefer-header-http-info encoding], :or {strip-whitespace true}}]
(with-local-vars [tree (zip/vector-zip []) pcdata "" reparse false]
(let [{:keys [stream encoding]} (input-stream input)
(let [{:keys [stream input-stream-encoding]} (input-stream input)
stream (BufferedInputStream. stream)
source (InputSource. stream)
reparse-exception (Exception. "reparse")
xml-encoding (when xml (read-xml-encoding-declaration stream))
_ (.mark stream 65536)
_ (.setEncoding source (or (and xml xml-encoding) encoding))
_ (.setEncoding source (or (and xml xml-encoding) input-stream-encoding encoding))
flush-pcdata #(let [data (var-get pcdata)]
(when-not (empty? data)
(when-not (and strip-whitespace (re-find #"^\s+$" data))
Expand Down Expand Up @@ -158,7 +159,7 @@ representing one), the latter is preferred."
(defn parse-string
"Parses a given string as HTML, passing options to `parse'."
[s & options]
(apply parse (-> s .getBytes ByteArrayInputStream.) options))
(apply parse (-> s (.getBytes "UTF-8") ByteArrayInputStream.) :encoding "UTF-8" options))

(defn parse-xml
"Parses a given XML using TagSoup and returns the parse result
Expand Down