Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not add value of preceding HTTP header field if there is no value #74

Merged
merged 1 commit into from
May 2, 2017

Conversation

sebastian-nagel
Copy link
Contributor

If a HTTP header field is empty (or contains only white space), the HttpHeaderParser does not use the empty value but uses the value from the preceding HTTP header field. See commoncrawl#11 for an example and test data.

This PR fixes the problem and adds a unit test.

@ldko
Copy link
Member

ldko commented May 2, 2017

Looking at this, I think the code change is fine. However, I ran it with and without the change on the sample WARC file attached at commoncrawl#11 and found that in both cases I couldn't check the value of the Server header because the relevant records in the WAT files in both cases didn't have Actual-Content-Type or HTTP-Response-Metadata fields in the Payload-Metadata. I think this is because of this difference between your code and the iipc code regarding WARCs created with wget:

diff --git a/src/main/java/org/archive/extract/ExtractingResourceFactoryMapper.java b/src/main/java/org/archive/extract/ExtractingResourceFactoryMapper.java
index ad10be4..0afe16f 100644
--- a/src/main/java/org/archive/extract/ExtractingResourceFactoryMapper.java
+++ b/src/main/java/org/archive/extract/ExtractingResourceFactoryMapper.java
@@ -153,7 +153,10 @@ public class ExtractingResourceFactoryMapper implements ResourceFactoryMapper {
        private boolean isHTTPResponseWARCResource(MetaData envelope) {
                return childFieldEquals(envelope,WARC_HEADER_METADATA,
                                WARCConstants.CONTENT_TYPE,
-                               WARCConstants.HTTP_RESPONSE_MIMETYPE);
+                               WARCConstants.HTTP_RESPONSE_MIMETYPE)
+                       || childFieldEquals(envelope,WARC_HEADER_METADATA,
+                               WARCConstants.CONTENT_TYPE,
+                               WARCConstants.HTTP_RESPONSE_MIMETYPE_NS);
        }
        private boolean isWARCJSONResource(MetaData envelope) {
                return childFieldEquals(envelope,WARC_HEADER_METADATA,
diff --git a/src/main/java/org/archive/format/warc/WARCConstants.java b/src/main/java/org/archive/format/warc/WARCConstants.java
index 93a81f9..4f2fa57 100644
--- a/src/main/java/org/archive/format/warc/WARCConstants.java
+++ b/src/main/java/org/archive/format/warc/WARCConstants.java
@@ -209,7 +209,9 @@ public interface WARCConstants extends ArchiveFileConstants {
        "application/http; msgtype=request";
     public static final String HTTP_RESPONSE_MIMETYPE =
        "application/http; msgtype=response";
-    
+  public static final String HTTP_RESPONSE_MIMETYPE_NS =
+      "application/http;msgtype=response";                   // wget does this
+
     public static final String FTP_CONTROL_CONVERSATION_MIMETYPE =
         "text/x-ftp-control-conversation";

Does this seem correct to you, @sebastian-nagel ?

@sebastian-nagel
Copy link
Contributor Author

Ok, I see. Of course, application/http;msgtype=request should also be covered. The fork commoncrawl/ia-web-commons contains still a couple of changes not merged into iipc/webarchive-commons, mainly the WETExtractor. I'm on course to push the remaining changes. But feel free to pull anything!

@ldko
Copy link
Member

ldko commented May 2, 2017

Ok, I will merge this along with the differences from WARCConstants.java and ExtractingResourceFactoryMapper.java. Thanks!

@ldko ldko merged commit cf34a3e into iipc:master May 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants